We’re excited to release the latest addition to the RWKV family of model releases: Flock of Finches 37B-A11B v0.1!
This is an experimental model that uses 11 billion active parameters, and despite our new flock having been trained on only 109 billion tokens, roughly matches our recently released Finch 14B model on common benchmark evaluation scores. You can find the model and code at huggingface here , or try it on featherless AI platform here
We leveraged an efficient Sparse Mixture of Experts (MoE) method to supply a higher total parameter count while activating only a fraction of those parameters for any given token. This saves time and uses less compute during both training and inference. As with most architectural choices, there is a tradeoff; increased efficiency comes in exchange for higher VRAM usage.
From our perspective, the ability to inexpensively train and run a model with greater powers seems very much worth that cost.
GPU Sponsor: TensorWave
We trained Flock of Finches on 16 AMD MI300X GPUs kindly donated by TensorWave, over a period of nearly four weeks. Each MI300X comes with a whopping 192GB of VRAM, which easily accommodated the added VRAM requirements we had for MoE.
This allowed us to use our limited time efficiently, finding the best hyper-parameters and doing training instead of spending days or weeks developing software workarounds.
MoE Overview
A large part of the knowledge and intelligence of LLMs come from a component known as the Feed Forward Network (FFN), sometimes called the Channel Mixer. We added a flock of eight new Feed-Forward Network (“FFN”) “experts” to a Finch 7B checkpoint that had been trained on around 2 trillion tokens, then continued training it for only 109 billion more.
The original Finch FFN in Flock of Finches is always evaluated like usual, acting like the leader of the flock, and we call it the “shared expert”. Alongside this shared expert one additional expert from the flock is chosen for each token, and the results are added together.
This forms the mathematical equivalent of a double-width dynamically chosen FFN. The shared expert contributes the shared intelligence learned during the original 2 trillion tokens of training, while the new experts in the flock selectively contribute new information depending on context.
MoE Shared Expert
A few choices we made were unusual, and make Flock of Finches a bit different from other MoE architectures you may have encountered in the wild. One such choice was to use a shared expert and add eight fresh experts, instead of replacing the original FFN with eight cloned copies and continuing training from there.
We found that this setup learned much faster, even when accounting for the extra width and therefore computation it adds. We also discovered that with this setup we were able to use an extremely high initial learning rate for the new experts, eventually annealing it down to the original model’s learning rate as training progressed.
MoE Hash Routing
Another unusual choice we made was to use hash routing instead of a trained top-k gated router. We chose this partly for simplicity and speed, but also because it gives us a naturally even token-to-expert routing distribution, which we hope will improve inference efficiency. Hash routing is extremely simple; we take the token index fed into the model plus a prime number and use that result modulo eight as the index of the expert to which that token is sent for processing. Many other MoE models use a learned gating function, which is trained instead of being fixed in advance of training.
And one final very RWKV-specific quirk was our use of token-shift with these new experts. Ordinarily, RWKV does a unique kind of 1D convolution as part of its FFN called token-shift, which mixes parts of the current and prior token together. This allows the model to perform some kinds of operations in a single layer that a traditional transformer would require two layers to accomplish. We tried various ways of applying token-shift to our new experts, and in the end we found that the most efficient way was to perform the same shift on the input that goes to both the shared and new experts. The gate applied to FFN outputs is also generated from a single token-shift and applied uniformly to the combined output.
Benchmark
We evaluated Flock of Finches across a range of common industry standard benchmarks using EleutherAI’s lm-eval-harness. While some benchmarks got higher scores and some lower, it was generally around the same level as our recently released Finch 14B model. This is an interesting result for us, as this model has significantly fewer active parameters (11B versus 14B), and those parameters are more concentrated in the Feed Forward Network portion of the model.
The Takeaway
Flock of Finches 37B-A11B features a new Mixture of Experts RWKV-6 architecture with 11 billion active parameters and 37 billion parameters total. It’s the largest RWKV MoE model yet, but it’s just our first step combining MoE with the RWKV architecture. We’re excited to expand the use of MoE to the Time Mixer portion of RWKV, and to try more complex MoE ideas like employing expert parameter sharing across breadth and depth, and combining a larger number of narrower experts.
We hope you’ll give Flock of Finches a try, and see how the RWKV ecosystem is growing with new more powerful models.
References
Weights & Code: https://huggingface.co/recursal/Finch-MoE-37B-A11B-v0.1-HF
Acknowledgements
Special thanks to TensorWave and AMD for sponsoring the Flock of Finches MI300X training run
Recursal AI for its commitment to providing resources and development for the RWKV ecosystem - you can use their featherless.ai platform to easily run RWKV and compare to it other language models
EleutherAI for support and guidance, especially on benchmarks and publishing research papers about the RWKV architecture
Linux Foundation AI & Data group for supporting and hosting the RWKV project
And of course a huge thank you to the many developers around the world working hard to improve the RWKV ecosystem and provide environmentally friendly open source AI for all.