On-device Audio Generation Accelerated by 30x with Arm Kleidi

Imagine editing a video on your smartphone and needing the perfect sound effect or wanting to generate your own custom sound for a ringtone, alarm or social media post. Instead of searching online or purchasing audio clips, you type a description – “gentle ocean waves at sunset” – and within seconds, your device generates the perfect sound without even connecting to the internet. This seamless, on-the-spot audio generation entirely on the device has become a reality, thanks to a new collaboration between Arm and Stability AI.
Arm and Stability AI collaboration accelerates text-to-audio response times
To achieve this, Stability AI, which develops AI models for image, video, 3D, and audio, leveraged Arm KleidiAI, which offers optimized performance-critical routines – known as micro-kernels – tailored for Arm CPUs. Through the KleidiAI integrations into the XNNPack library and ExecuTorch framework and Stability AI’s own optimizations, the team unlocked significant AI performance improvements on Stability AI’s text-to-audio open model, “Stable Audio Open.”
The results are remarkable. Text-to-audio AI generation is significantly reduced from minutes to seconds, representing a 30x faster response time. This is all while running the Stable Audio Open model entirely on smartphone devices on Arm CPUs – a first for text-to-audio AI – with no internet connection required.
Stability AI used the automatic KleidiAI accelerations to speed up model responses for improved on-device AI performance without compromising quality. These KleidiAI performance uplifts require no additional developer effort from users of the Stable Audio Open model, saving time and costs. Arm and Stability AI are continuing to work together to implement yet more performance improvements that will further enhance this outstanding AI user experience.
The dramatic improvements show how targeted hardware and software integration can make previously unattainable AI applications feasible on mobile, fueling future innovation opportunities. It also means that advanced AI audio capabilities are now accessible to billions of smartphone users worldwide, with 99 percent of the world’s smartphones built on Arm technology.
Solving complex AI challenges together
Despite the efficiency of the Stable Audio Open model, running it directly on-device on smartphone CPUs presented significant challenges. Initial attempts resulted in generation times exceeding four minutes for a single audio sample, rendering the experience impractical for end users.
Working with Arm, Stability AI distilled the model down to the right number of trainable parameters for mobile. Stability AI then took this new distilled model and utilized the KleidiAI performance accelerations from the XNNPack and ExecuTorch integrations, so it could generate audio clips in seconds across Arm CPUs for mobile.
“As more and more professional creatives and businesses adopt generative AI to power their production pipeline, it’s important that our models and workflows are available everywhere for builders to build and creators to create. We are excited to partner with Arm for this exact reason. Arm’s prevalence across the ecosystem from the server to the smartphone and its work to accelerate AI models across all the popular frameworks by integrating Arm Kleidi into the software stack, made it a no brainer,” said Prem Akkaraju, CEO, Stability AI.
The rise of text-to-audio AI
Since 2022, Stability AI has been at the forefront of the generative AI evolution, initially making waves with Stable Diffusion, the industry-leading image model. Building upon this success, the company then introduced Stable Audio, one of the first fully licensed audio models designed to generate high-quality music and sound effects from textual prompts. These are some of the top-ranking AI models on leading platforms like Hugging Face, cultivating an active community of millions utilizing these tools.
Arm and Stability AI at MWC
At Mobile World Congress (MWC) 2025, Arm and Stability AI are showcasing the results of the KleidiAI accelerations on the Stable Audio Open model at the Arm booth in Hall 2 Stand I60. The demos are generated using Stability AI’s models and workflows, and all executed offline on Arm-based hardware, which includes the vivo X200 Series of flagship smartphones built on the MediaTek Dimensity 9400 featuring the latest Armv9 CPUs.
Advanced audio AI experiences accessible to all
This is just the start of the partnership between Arm and Stability AI, with yet more performance optimizations planned to further enhance the user experience. Working together, we are both setting the stage for on-device AI across audio, images, video, and 3D – reshaping how everyone creates content and interacts with digital media. By distilling advanced models and leveraging optimized software on ubiquitous hardware, we are paving the way for a future where sophisticated AI applications, models and experiences are accessible to all, directly from the devices in our pockets.
Arm and Stability AI
Learn more about the Arm and Stability AI partnership
Any re-use permitted for informational and non-commercial or personal use only.