RADLADS: Dropping the cost of AI architecture experiment by 250x
Unlocking and accelerating the next wave of AI architecture research
Why do most large AI research labs swear by scaling and avoid architecture research?
What works small often fails big — Architectural innovations that show promise at 1M parameters may break down at 1B or 50B.
Validating at scale is expensive — Training from scratch to test a new architecture at meaningful scale can cost at least $5–10M.
High risk, uncertain reward — You’re just as likely to degrade performance as improve it—making architecture exploration financially unsustainable for most labs.
Training a state-of-the-art language model from scratch costs roughly $5-10M—just to validate a new attention mechanism, recurrence scheme, or memory system.
From our team experience, it typically takes 20–80 architecture iterations to achieve a 10%+ improvement. We've done this four times over the past two years.
For most AI labs, that level of experimentation would cost around $250 million in research GPU time. From that perspective, it's often more rational to invest in scaling model parameters and datasets for a near-guaranteed performance gain of ~10%.
At Featherless, we believe this bottleneck in architecture validation has slowed progress—not only in capabilities but in reliability.
But what if the cost to validate an architecture dropped from $5 million to $20K?
With that same $250 million, we could run over 12,500 iterations, uncovering 100+ architecture improvements, each with 10%+ gains. Compounded, that’s a theoretical 1,378,000% improvement in performance.
That’s why we’re excited about RADLADS.
Introducing RADLADS
RADLADS (Rapid Attention Distillation to Linear Attention Decoders at Scale) is a new method for converting massive transformer models (e.g., Qwen-72B) into new AI models with alternative attention mechanisms—at a fraction of the original training cost.
Total cost: $2,000–$20,000
Tokens used: ~500 million
Training time: A few days on accessible cloud GPUs (8× MI300)
Cost reduction: ~250× reduction in the cost of scientific experimentation
Instead of training from scratch, we convert existing models to new attention architectures in three steps:
Align hidden states between the original transformer and the target attention architecture
Distill output behavior (logits) from the original model
Fine-tune for long-context performance
You can read about the process details from our paper review on huggingface and arxiv. This is the same technique that allowed us to train our latest 72B attention-free, with only 8 GPU’s.
What does this mean for research?
RADLADS is already changing how we explore AI architecture. We can now:
Rapidly test novel attention mechanisms and hybrid designs
Iterate on model structures in days, not months
Validate alignment and interpretability hypotheses at scale
This isn’t just about RWKV—it opens doors for advancing Transformers, State Space models, xLSTMs, and architectures yet to be imagined. Its about accelerating our pace of research.
And we’re not doing it alone. Since announcing our work, we've collaborated with other researchers to validate multiple attention mechanisms, including Transformer-based variants.
Reach out to us if you have any attention alternative your research team or university lab is working on and looking to validate in collaboration.
It’s all part of our mission to make personalized reliable AI — and eventually AGI — a reality
One more thing:
Qwerky 2, based on the RWKV architecture & Qwen 3 models, is already training...
Translation:
A linear GPT-4o class text model is on its way...
After that its O1, and O3 class