A linear transformer has just cross the gold standard in transformer models, LLaMA 7B, with less tokens trained in both English and multi-lingual evals. A historical first.
I see, so the purpose of channel mixing is to allow non-linearities from the time mixing to be taken in. Still, it seems to me that is anyway being done in the time mixing of the next layer (and the channel mixing). If there were cross terms in the channel mixing, that would be different - but it seems to be linear in context length (while only considering the current and previous input). I'm probably missing something.
On a separate note, it's pretty interesting how information propagates to higher layers and maintains memory of the distant past - seemingly in a way that does not work for sliding window attention in a standard transformer.
Your not wrong, there been discussions before if we should remove channel mix. However from previous experiments (which is outdated, as its the previous version of timemix), the model performed slightly worse without channel mix.
The biggest decider for an architecture / version change - is loss curve and evals
Longer range memories (>50 tokens), is done by timemix, which maintains the state between tokens, indefinitely - where the model learns to make use of this state to either remember (or discard from memory) things
Thanks Eugene, appreciate the comment. This paper covers initialisation and simplification and may be of interest: https://arxiv.org/abs/2311.01906 Easier to say than do because it's clearly risky to just change the architecture.
Regarding longer range, I see there's a linear drop off in the information with context (which is then placed in an exponent), but I guess the later layers have non-linearity and are able to compensate for that decline and somehow keep important information in "memory"?
Is your model faster than Mistral 7B AWQ on say T4?
I am not sure how to compare. Mistral Super Fast seems to output at the same speed, but the HF Space does not say exactly what kind of hardware it runs on.
With the exact same settings, from an architecture stand point - ours should be faster in overall
However while we do support quantization - we do not support speculative decoding. for now
As a result transformer models are able to match or be faster then our model with the support of speculative decoding - which in the future we do plan to add support for as well
Is there any Google Colab to run this model and to run it on a CPU? I ran this model on CPU in high-ram machine and it works but it is very slow. Something like 1 minute per token or more.
What is the thinking behind having channel mixing? I would have thought that could be captured in the time mixing, given the right matrix initiation?
channel mixing can be thought as more of short term / sliding window attention
this allow it to just focus on the more immediate context for the token prediction, using data fed to it from the previous layers/timemix/channelmix
I see, so the purpose of channel mixing is to allow non-linearities from the time mixing to be taken in. Still, it seems to me that is anyway being done in the time mixing of the next layer (and the channel mixing). If there were cross terms in the channel mixing, that would be different - but it seems to be linear in context length (while only considering the current and previous input). I'm probably missing something.
On a separate note, it's pretty interesting how information propagates to higher layers and maintains memory of the distant past - seemingly in a way that does not work for sliding window attention in a standard transformer.
Your not wrong, there been discussions before if we should remove channel mix. However from previous experiments (which is outdated, as its the previous version of timemix), the model performed slightly worse without channel mix.
The biggest decider for an architecture / version change - is loss curve and evals
Longer range memories (>50 tokens), is done by timemix, which maintains the state between tokens, indefinitely - where the model learns to make use of this state to either remember (or discard from memory) things
Thanks Eugene, appreciate the comment. This paper covers initialisation and simplification and may be of interest: https://arxiv.org/abs/2311.01906 Easier to say than do because it's clearly risky to just change the architecture.
Regarding longer range, I see there's a linear drop off in the information with context (which is then placed in an exponent), but I guess the later layers have non-linearity and are able to compensate for that decline and somehow keep important information in "memory"?
Amazing. Are there any GPU or CPU demostrations? How fast does it run?
You can try it online today on
- our hugging face : https://huggingface.co/spaces/recursal/EagleX-7B-1.7T-Gradio-Demo
- our new cloud platform : https://recursal.ai
Is your model faster than Mistral 7B AWQ on say T4?
I am not sure how to compare. Mistral Super Fast seems to output at the same speed, but the HF Space does not say exactly what kind of hardware it runs on.
https://huggingface.co/spaces/osanseviero/mistral-super-fast
With the exact same settings, from an architecture stand point - ours should be faster in overall
However while we do support quantization - we do not support speculative decoding. for now
As a result transformer models are able to match or be faster then our model with the support of speculative decoding - which in the future we do plan to add support for as well
Is there any Google Colab to run this model and to run it on a CPU? I ran this model on CPU in high-ram machine and it works but it is very slow. Something like 1 minute per token or more.
```
model_path = hf_hub_download(repo_id="recursal/EagleX_1-7T", filename="EagleX-1_7T.pth")
strategy = "cpu fp32i8"
model = RWKV(model=model_path, strategy=strategy)
from rwkv.utils import PIPELINE, PIPELINE_ARGS
pipeline = PIPELINE(model, "rwkv_vocab_v20230424")
```