Accelerating and Scaling AI Inference Everywhere with New Llama 3.2 LLMs on Arm
- Significant performance improvements from cloud to edge when running Meta’s new Llama 3.2 release on Arm CPUs, enabling future AI workloads
- Collaboration between Meta and Arm enables faster innovation in use cases such as personalized on-device recommendations or automating routine tasks
- A decade of AI investment from Arm and extensive open-source collaboration enables new LLMs ranging from 1B to 90B seamlessly on Arm compute platforms
The ongoing rapid evolution of AI means new versions of large language models (LLMs) are emerging on a regular cadence. While LLMs running everywhere, from cloud to edge, are needed to realize the full potential and opportunities of AI, these are driving significant compute and energy demands. The ecosystem is rallying to find solutions to this challenge, and releasing new, more efficient open-source LLMs to enable a broad range of AI inference workloads at scale and bring new, accelerated AI experiences to the user, quicker.
Through our collaboration with Meta enabling the new Llama 3.2 LLMs on Arm CPUs, we are demonstrating the powerful combination of open-source innovation and the Arm compute platform in tackling this challenge. Arm’s ongoing investments and work with new LLMs like this one mean the ecosystem can automatically see the benefits of running AI on the Arm CPU, making it the platform of choice for developers targeting their AI inference workloads.
Accelerated cloud to edge AI performance
The availability of smaller LLMs that enable fundamental text-based generative AI workloads, such as Llama 3.2 1B and 3B, are critical to enabling AI inference at scale. Running the new Llama 3.2 3B LLM on Arm-powered mobile devices through the Arm CPU optimized kernel leads to a 5x improvement in prompt processing and 3x improvement in token generation, achieving 19.92 tokens per second in the generation phase. This means less latency when processing AI workloads on the device and a far faster overall user experience. Also, the more AI processed at the edge, the more power that is saved from data traveling to and from the cloud, leading to energy and cost savings.
Alongside running small models at the edge, we are also able to run larger models, such as Llama 3.2 11B and 90B, in the cloud. The 11B and 90B models are a great fit for CPU based inference workloads in the cloud that generate text and image, as our data on Arm Neoverse V2 shows. When we run the 11B image and text model on the Arm-based AWS Graviton4, we can achieve 29.3 tokens per second in the generation phase. When you consider that the human reading speed is around 5 tokens per second, it’s far outpacing that.
AI will scale quickly with open-source innovation and ecosystem collaboration
Having new LLMs like Llama 3.2 openly available is vital. We see that open-source innovation moves incredibly fast: In previous releases the open-source community has had new LLMs up and running on Arm in less than 24 hours.
We are further bolstering the software community through Arm Kleidi, where we are working to enable the entire AI technology stack to leverage this optimized CPU performance. Kleidi unlocks the AI capabilities and performance of Arm Cortex and Neoverse CPUs on any AI framework with no application developer integration required.
Through our recent Kleidi integration with PyTorch and the in-progress integration with ExecuTorch, we are enabling seamless AI performance benefits for developers on the Arm CPU from cloud to edge. The Kleidi integration with PyTorch leads to a 2.5x faster time-to-first token on Arm-based AWS Graviton processors when running the Llama 3 LLM.
Meanwhile at the edge, the KleidiAI libraries are accelerating the time-to-first token for Llama 3 using llama.cpp by 190 percent on the Arm Cortex-X925 CPU compared with the reference implementation.
Building the future of AI
When you combine the flexibility, pervasiveness and AI capabilities of the Arm compute platform with the expertise of industry leaders like Meta, you unlock new opportunities for AI at scale. Whether it’s on-device LLM performing tasks on your behalf by understanding your location, schedule, and preferences, or enterprise use cases to make us more productive and focus on the higher value tasks at work, the integration of Arm’s technology is paving the way for a future where devices are not just command and control tools but proactive assistants that enhance the user’s overall experience.
The AI performance uplifts on the Arm CPU through the new Llama 3.2 LLM are impressive, and we believe that these open collaborations are the best way to enable AI innovation everywhere, in the most sustainable way possible. Through new LLMs, the open-source community and Arm’s computing platform, we are building the future of AI, with more than 100 billion Arm-based devices set to be AI ready by 2025.
Additional resources
For mobile and edge ecosystem developers, Llama 3.2 runs efficiently across Arm Cortex CPU based devices. See our documentation for developer resources.
Developers can access Arm from all the major cloud service providers running Llama 3.2 in the cloud on Arm Neoverse CPU. See our documentation for getting started.
Any re-use permitted for informational and non-commercial or personal use only.