A linear transformer has just cross the gold standard in transformer models, LLaMA 7B, with less tokens trained in both English and multi-lingual evals. A historical first.
What is the thinking behind having channel mixing? I would have thought that could be captured in the time mixing, given the right matrix initiation?
Amazing. Are there any GPU or CPU demostrations? How fast does it run?
What is the thinking behind having channel mixing? I would have thought that could be captured in the time mixing, given the right matrix initiation?
Amazing. Are there any GPU or CPU demostrations? How fast does it run?