This is really impressive! Do you have any metrics on long context benchmarks such as RULER or NIAH? That seems to be the last advantage an attention mechanism would hold, compared to a state-space approach like this.
I work mostly with quantized models and this is very exciting for me. While I can get great performance with most 8/12B Q4 models, *memory* is still a huge bottleneck (I work mostly on open source for assistive tech for brain trauma folks like myself).
I'm very excited to see where this goes. Glad I found Featherless!
This is really impressive! Do you have any metrics on long context benchmarks such as RULER or NIAH? That seems to be the last advantage an attention mechanism would hold, compared to a state-space approach like this.
For RWKV v7 paper ( https://arxiv.org/pdf/2503.14456 )
We covered 3B models that has been fully trained with long context to pass 32k NIAH tests.
With evidence to show that context length scales with param size.
We forecast for a 70B given sufficient long context data, it should hold all the way to 512K context length without issues
Note: the qwerky-v1 models are not long context trained, but the upcoming qwerky-v2 is planned to be long context trained
I work mostly with quantized models and this is very exciting for me. While I can get great performance with most 8/12B Q4 models, *memory* is still a huge bottleneck (I work mostly on open source for assistive tech for brain trauma folks like myself).
I'm very excited to see where this goes. Glad I found Featherless!
Luv this work, make RNN great again!
nice work, really close to Qwen2.5 this time