Featherless.ai introduces QRWKV-72B-Preview: The Best Post-Transformer model yet

QRWKV-72B-Preview: Revolutionizing AI Efficiency and Accessibility with Hybrid Transformer Architecture

Featherless AI - dev blog

Darin Verheijke

Eugene Cheah

, and

Joseph Palmisano

Feb 22, 2025

IMPORTANT NOTICE: This preview model has been replaced entirely by the final QRWKV-72B1 model found here:
Featherless AI - recursive dev blog
🪿QRWKV-72B and 32B : Training large attention free models, with only 8 GPU's
We are proud to announce the updated QRWKV-72B and 32B…
Read more
7 months ago · 14 likes · 3 comments · Eugene Cheah

At Featherless.ai , we’re thrilled to announce the launch of QRWKV-72B , a revolutionary hybrid model combining the computational efficiency of linear transformers with the precision of attention mechanisms. This breakthrough architecture reduces GPU compute costs by over 50% compared to traditional transformers, making it one of the most cost-effective large language models available today, costing less than $100k to build.

QRWKV-72B sets a new standard for scalability and accessibility, enabling real-time applications across industries.

Why QRWKV-72B Matters

The introduction of QRWKV-72B marks a pivotal moment in AI development. By merging the strengths of linear transformers and attention transformers, QRWKV-72B achieves unparalleled efficiency without sacrificing performance.

Traditional transformer-based models require ~100GB of VRAM (excluding model weights) to handle a single 72B parameter model at a 16k context length. For each additional concurrent request, the VRAM demand increases significantly due to the attention mechanism's quadratic scaling with sequence length.

In contrast, QRWKV-72B achieves remarkable efficiency by leveraging its hybrid linear-transformer architecture. At 72B parameters, QRWKV-72B requires just 1GB of additional VRAM per request (excluding model weights). Regardless of context length.

These innovations unlocks several critical benefits:

Lower inference costs: Businesses can deploy advanced AI solutions at a fraction of the cost, particularly for tasks requiring long-context processing on affordable GPUs.
Global Accessibility: Reduced hardware requirements enable smaller organizations and developing nations to leverage state-of-the-art AI technologies.
Higher Concurrency: Serve more users simultaneously on the same hardware.
Scalability: Handle longer context lengths without exponential increases in resource usage.

This launch aligns with our mission to make AI accessible to everyone regardless of language or nation. QRWKV-72B runs at a fraction of the inference cost of current models, especially at larger context length, by merging the computational efficiency of linear transformers with the precision of attention mechanisms. This is a key multiplier unlock for not only making AI accessible for the world but for the recent test-time compute style models.

Real-time Voice: Instant, Natural AI Conversations

Alongside QRWKV-72B, we’re proudly introducing Real-time Voice, an ultra-low latency AI speech solution designed for seamless human-computer interaction. Built on our serverless infrastructure, it delivers fast speech processing and generation time, minimizing delays for a more natural conversation experience. This allows individuals and businesses to build interactive voice applications that respond instantly and accurately. Whether for virtual assistants, global call centers, or interactive voice response systems, Real-time Voice ensures fast, reliable, and cost-efficient AI-powered speech applications.

Private-cloud beta

Featherless.ai now offers a Private Cloud solution for organizations that need full control over their AI deployments. Our dedicated, secure environments allow businesses to run open models with the ease of serverless infrastructure, zero maintenance, pay-per-use pricing, and complete data sovereignty. With Private Cloud, sensitive data stays protected while maintaining the scalability and flexibility required for modern AI applications. Whether you're an enterprise prioritizing compliance and security or a developer needing custom AI deployment, Featherless.ai’s Private Cloud delivers seamless, cost-efficient AI hosting with full control over where and how your data is processed.

We invite you to experience the future of AI with QRWKV-72B, our real-time voice capabilities and private cloud solutions. Together, let’s make AI more accessible to everyone regardless of language or nation. Visit our website to access the model via our API or download the model directly from HuggingFace:

Featherless.ai

Featherless.ai is a serverless inference platform. Our goal is to make all AI models available for serverless inference. We provide inference via API to a continually expanding library of open-weight models, including the most popular models for role-playing, creative writing, coding assistance, and more. Our solutions enable enterprises and individuals to harness the full potential of artificial intelligence without worrying about underlying infrastructure. Featherless.ai offers scalable, secure, and easy-to-use tools that empowers businesses and individuals alike to accelerate their AI initiatives. For more information, visit www.featherless.ai

This model was originally published as Qwerky-72B. However, due to confusion with another similar naming company/model, we have been requested to avoid using the Qwerky name, so we have renamed our models to QRWKV-72B

A guest post by

Darin Verheijke

A guest post by

Eugene Cheah

Builds Attention-Free Transformer AI models (http://wiki.rwkv.com) from scratch, CEO @ featherless.ai (prv recursal.ai) - Also known for k8s infra & UI testing tools, webapps, and GPU.js, Hot-takes/Views are my own

A guest post by

Joseph Palmisano

Featherless AI - recursive dev blog

Featherless.ai introduces QRWKV-72B-Preview: The Best Post-Transformer model yet

QRWKV-72B-Preview: Revolutionizing AI Efficiency and Accessibility with Hybrid Transformer Architecture

Why QRWKV-72B Matters

Real-time Voice: Instant, Natural AI Conversations

Private-cloud beta

Featherless.ai

Discussion about this post