Featherless.ai introduces Qwerky-72B: The Best Post-Transformer model yet
Qwerky-72B: Revolutionizing AI Efficiency and Accessibility with Hybrid Transformer Architecture
At Featherless.ai , we’re thrilled to announce the launch of Qwerky-72B , a revolutionary hybrid model combining the computational efficiency of linear transformers with the precision of attention mechanisms. This breakthrough architecture reduces GPU compute costs by over 50% compared to traditional transformers, making it one of the most cost-effective large language models available today, costing less than $100k to build.
Qwerky-72B sets a new standard for scalability and accessibility, enabling real-time applications across industries.
Why Qwerky-72B Matters
The introduction of Qwerky-72B marks a pivotal moment in AI development. By merging the strengths of linear transformers and attention transformers, Qwerky-72B achieves unparalleled efficiency without sacrificing performance.
Traditional transformer-based models require ~100GB of VRAM (excluding model weights) to handle a single 72B parameter model at a 16k context length. For each additional concurrent request, the VRAM demand increases significantly due to the attention mechanism's quadratic scaling with sequence length.
In contrast, Qwerky-72B achieves remarkable efficiency by leveraging its hybrid linear-transformer architecture. At 72B parameters, Qwerky-72B requires just 1GB of additional VRAM per request (excluding model weights). Regardless of context length.
These innovations unlocks several critical benefits:
Lower inference costs: Businesses can deploy advanced AI solutions at a fraction of the cost, particularly for tasks requiring long-context processing on affordable GPUs.
Global Accessibility: Reduced hardware requirements enable smaller organizations and developing nations to leverage state-of-the-art AI technologies.
Higher Concurrency: Serve more users simultaneously on the same hardware.
Scalability: Handle longer context lengths without exponential increases in resource usage.
This launch aligns with our mission to make AI accessible to everyone regardless of language or nation. Qwerky-72B runs at a fraction of the inference cost of current models, especially at larger context length, by merging the computational efficiency of linear transformers with the precision of attention mechanisms. This is a key multiplier unlock for not only making AI accessible for the world but for the recent test-time compute style models.
Real-time Voice: Instant, Natural AI Conversations
Alongside Qwerky-72B, we’re proudly introducing Real-time Voice, an ultra-low latency AI speech solution designed for seamless human-computer interaction. Built on our serverless infrastructure, it delivers fast speech processing and generation time, minimizing delays for a more natural conversation experience. This allows individuals and businesses to build interactive voice applications that respond instantly and accurately. Whether for virtual assistants, global call centers, or interactive voice response systems, Real-time Voice ensures fast, reliable, and cost-efficient AI-powered speech applications.
Private-cloud beta
Featherless.ai now offers a Private Cloud solution for organizations that need full control over their AI deployments. Our dedicated, secure environments allow businesses to run open models with the ease of serverless infrastructure, zero maintenance, pay-per-use pricing, and complete data sovereignty. With Private Cloud, sensitive data stays protected while maintaining the scalability and flexibility required for modern AI applications. Whether you're an enterprise prioritizing compliance and security or a developer needing custom AI deployment, Featherless.ai’s Private Cloud delivers seamless, cost-efficient AI hosting with full control over where and how your data is processed.
We invite you to experience the future of AI with Qwerky-72B, our real-time voice capabilities and private cloud solutions. Together, let’s make AI more accessible to everyone regardless of language or nation. Visit our website to access the model via our API or download the model directly from HuggingFace:
Featherless.ai
Featherless.ai is a serverless inference platform. Our goal is to make all AI models available for serverless inference. We provide inference via API to a continually expanding library of open-weight models, including the most popular models for role-playing, creative writing, coding assistance, and more. Our solutions enable enterprises and individuals to harness the full potential of artificial intelligence without worrying about underlying infrastructure. Featherless.ai offers scalable, secure, and easy-to-use tools that empowers businesses and individuals alike to accelerate their AI initiatives. For more information, visit www.featherless.ai