Blog

December 12, 2024

5 Key Benchmarks That Prove Llama 3.3 70B Runs Efficiently on Arm Neoverse CPUs

The smaller model size of Llama 3.3 70B makes generative AI processing more accessible to the ecosystem, with fewer computational resources needed.

By Na Li, AI Solutions Architect, Arm

Llama is an open and accessible collection of large language models (LLMs) tailored for developers, researchers, and businesses to innovate, experiment, and responsibly scale their generative AI ideas. The Llama 3.1 405B model stands out as the top-performing model in t he Llama collection. However, deploying and utilizing such a large-scale model presents significant challenges, especially for individuals or organizations lacking extensive computational resources.

To address those challenges, Meta is introducing the Llama 3.3 70B model, which retains the same architecture as the Llama 3.1 70B model but incorporates the latest advancements in post-training techniques for greater model evaluation performance, while delivering notable improvements in reasoning, mathematics, general knowledge, instruction following, and tool use. Compared to the Llama 3.1 405B model, it offers similar performance, while being significantly smaller in size.

In close partnership with Meta, Arm’s engineering teams evaluated the inferencing performance of the Llama 3.3 70B model on Google Axion, a family of custom Arm64-based processors built on Arm’s Neoverse V2 technology, which are available through the Google Cloud. Google Axion is designed for higher performance, lower power consumption and greater scalability than legacy, off-the-shelf processors, which better prepares its data centers for the age of AI.

Our benchmarking shows that C4A virtual machines (VMs) based on Axion processors deliver seamless AI-based experiences when running Llama 3.3 70B model and achieve human readability levels across multiple user batch sizes. Human readability refers to the average speed at which a human can read text. This provides developers with flexibility to attain high-quality performance in text-based applications that is comparable to results produced with Llama 3.1 405B model, while no longer requiring large computational resources.

Optimizing Llama 3.3 70B Inference on Arm Neoverse CPUs for AI Scalability

Google Cloud offers Axion-based C4A VMs with up to 72 vCPUs and 576 GB of RAM. For these tests, we have used mid-range cost-effective c4a-standard-32 machine type to deploy the Llama 3.3 70B model with 4-bit quantization. For running our performance testing, we utilized the popular Llama.cpp framework, which as of version b4265 has been optimized with Arm Kleidi. The Kleidi integration provides optimized kernels to ensure AI frameworks can by default unlock the AI capabilities and performance of Arm CPUs.

Key Benchmark Results for Llama 3.3 70B on Google Axion:

Prompt encoding speed remains stable at ~50 tokens/sec across batch sizes
Token generation speed scales efficiently, leveraging SIMD optimizations (Neon, SVE, SDOT, MMLA)
Supports human readability speeds for batch sizes up to 4 users
Arm Kleidi integration in Llama.cpp improves inference performance
Google Axion CPUs optimize AI workloads, reducing computational costs

Now let’s get a closer look at the results.

Prompt encoding speed refers to how quickly user inputs are processed and interpreted by the language model. As prompt encoding is done in parallel and leverages multiple cores, as shown in Figure 1, performance remains consistent at around ~50 tokens per second across various batch sizes, and the speed is comparable for the different prompt sizes tested.

Llama 3.3 70B inference speed benchmark on Arm Neoverse CPUs — *Llama 3.3 70B inference on Google Axion: Prompt encoding speed performance benchmarks.*

Token generation speed measures the rate at which the model generates responses when running Llama 3.3 70B model. Arm Neoverse CPUs optimize machine learning workflows with advanced SIMD instructions, such as Neon and SVE, that are designed to accelerate General Matrix Multiplication (GEMM). To further boost throughput, especially for larger batch sizes, Arm has introduced specialized optimization instructions like SDOT (Signed Dot Product) and MMLA (Matrix Multiply Accumulate).

As shown in Figure 2, the token generation speed increases with larger user batch sizes, while remaining relatively consistent across different token generation sizes tested. This capability to achieve higher throughput with larger batch sizes is essential for building scalable systems capable of serving multiple users effectively.

To evaluate the performance perceived by each user when multiple users are interacting with the model at the same time, we measured the token generation speed per batch. Token generation speed per batch is critical, as it directly influences the real-time experience during user interactions with the model.

As shown in Figure 3, the token generation speed reached an average human readability level for batch sizes up to 4, indicating that the performance remains stable as the system scales to accommodate multiple users. To accommodate larger numbers of concurrent users, leveraging serving frameworks like vLLM is beneficial, as these frameworks optimize KV cache management to enhance scalability.

Figure 3: Comparison of prompt generation speed per user running the Llama 3.3 70B model in batch mode and the average human readability level across various batch sizes.

A game-changer for generative AI

The new Llama 3.3 70B model is a potential game-changer in the accessibility and efficiency for utilizing the benefits of large-scale AI. The smaller model size makes generative AI processing more accessible to the ecosystem, with large computational resources no longer required. Meanwhile, the Llama 3.3 70B model helps to deliver more efficient AI processing that is vital for datacenter and cloud workloads, while delivering comparable performance to Llama 3.1 405B model in terms of model evaluation benchmark.

Through our benchmarking work, we have demonstrated how Google Axion processors, powered by Arm Neoverse, provide a smooth and efficient experience when running the Llama 3.3 70B model, delivering text generation with human-level readability across multiple user batch sizes tested.

We’re proud to continue our close partnership with Meta to enable open-source AI innovation on the Arm compute platform, helping to ensure that Llama LLMs operate seamlessly and efficiently across hardware platforms.

This blog also had contributions from Milos Puzovic, Technical Director, Arm, and Nobel Chowdary Mandepudi, Graduate Software Engineer, Arm.

The Arm Meta partnership

Learn more about how Arm and Meta are unlocking AI technologies together.

Read Now

By Na Li, AI Solutions Architect, Arm

Article Text

Copy Text

Any re-use permitted for informational and non-commercial or personal use only.

Editorial Contact

Arm Editorial Team

editorial@arm.com

Subscribe to Blogs and Podcasts

Get the latest blogs & podcasts direct from Arm

Blog

Oct 09, 2024

Why Arm is the Compute Platform for All AI Workloads

Arm Editorial Team

News

Sep 25, 2024

Accelerating and Scaling AI Inference Everywhere with New Llama 3.2 LLMs on Arm

Ian Bratt, VP of ML Technology and Fellow, Arm

News

Sep 16, 2024

Arm Accelerates AI From Cloud to Edge With New PyTorch and ExecuTorch Integrations to Deliver Immediate Performance Improvements for Developers

Alex Spinelli, SVP, AI and Developer Platforms and Services, Arm

Blog

Oct 11, 2024

How Arm Neoverse can Accelerate Your AI Data Center Dreams

Arm Editorial Team

Media Information

Latest on X

; Arm @Arm ·

10h 1988378686341702065

KubeCon + CloudNativeCon highlights just how quickly the cloud-native ecosystem is advancing. Developers everywhere are rethinking performance, scalability, and efficiency - across architectures - built on Arm.

Arm Software Developers @ArmSoftwareDev

KubeCon + CloudNativeCon 2025 shows the evolution of cloud-native systems and multi-architecture innovation. We're accelerating this shift by enabling scalable, efficient performance for AI and next-generation workloads across diverse architectures!
https://okt.to/KYaX5H

Reply on Twitter 1988378686341702065 Retweet on Twitter 1988378686341702065 1 Like on Twitter 1988378686341702065 4 Twitter 1988378686341702065

; Arm @Arm ·

10h 1988377725451587876

📅 Tomorrow at #WebSummit, Ami Badani joins global leaders shaping the future of AI.

She’ll share how Intelligence per Watt is redefining progress — and why scaling AI responsibly means designing compute that’s as efficient as it is powerful.

Reply on Twitter 1988377725451587876 Retweet on Twitter 1988377725451587876 0 Like on Twitter 1988377725451587876 1 Twitter 1988377725451587876

; Arm @Arm ·

10 Nov 1988017107838398857

Hello KubeCon + CloudNativeCon USA 👋

We're so excited to see you all in Atlanta this week. We're bring community programs, booth demos, and so much more.

Be sure to swing by the Arm booth to see what we're up to!

Arm Software Developers @ArmSoftwareDev

Collaboration, learning, and innovation for the future of cloud native computing? Sign us up!

We can't wait to see you at KubeCon + CloudNativeCon USA where we'll be bringing the Arm developer experience to life with demos, community and more. 🥳
https://okt.to/nGB5Y3

Reply on Twitter 1988017107838398857 Retweet on Twitter 1988017107838398857 1 Like on Twitter 1988017107838398857 10 Twitter 1988017107838398857

; Arm @Arm ·

6 Nov 1986497586925031935

Today's announcement is cause for celebration! 🎉

@googlecloud's new N4A VMs and C4A metal, powered by Arm Neoverse, deliver unmatched performance-per-watt and scalability - showing what’s possible when one platform powers innovation from cloud to car.

https://okt.to/k2f7HJ

Reply on Twitter 1986497586925031935 Retweet on Twitter 1986497586925031935 10 Like on Twitter 1986497586925031935 35 Twitter 1986497586925031935

; Arm @Arm ·

5 Nov 1986188215766950282

Celebrating a strong Q2 FYE26, with revenue surpassing $1B for the third consecutive quarter.

As the only unified compute platform combining unmatched breadth with the performance, efficiency & security the AI era demands, Arm is delivering AI everywhere. https://newsroom.arm.com/news/arm-q2-fye26-results?utm_source=twitter&utm_medium=social-organic&utm_content=blog&utm_campaign=mk29_exec-comms_na

Reply on Twitter 1986188215766950282 Retweet on Twitter 1986188215766950282 9 Like on Twitter 1986188215766950282 44 Twitter 1986188215766950282

; Arm @Arm ·

4 Nov 1985521798000304164

OneTrust’s deployment on Azure Kubernetes Service using the Arm-based Azure Cobalt 100 processor shows what’s possible with efficient, scalable cloud compute. Together, we’re driving secure, high-performance cloud-native innovation with Azure.🤝https://okt.to/lQkmMy

Reply on Twitter 1985521798000304164 Retweet on Twitter 1985521798000304164 2 Like on Twitter 1985521798000304164 14 Twitter 1985521798000304164

; Arm @Arm ·

3 Nov 1985465776954904725

Last week, Richard Grisenthwaite joined theTSF-AI Conference to explore how Arm is powering the AI revolution. Our architecture enables trusted innovation, helping businesses build and run securely as AI scales globally. 💪

Reply on Twitter 1985465776954904725 Retweet on Twitter 1985465776954904725 6 Like on Twitter 1985465776954904725 18 Twitter 1985465776954904725

; Arm @Arm ·

3 Nov 1985425254479429978

Physical AI needs more than hardware - it needs a collaborative ecosystem built on silicon, software, and safety.

Paul Williamson, SVP and GM of IoT, notes how flexibility across platforms drives innovation efficiently and at scale.🧠💡

Physical AI Needs An Ecosystem - EE Times

Robotics is entering the era of physical AI, where smarter, safer machines work alongside humans—driven by advances ...

okt.to

Reply on Twitter 1985425254479429978 Retweet on Twitter 1985425254479429978 1 Like on Twitter 1985425254479429978 10 Twitter 1985425254479429978

; Arm @Arm ·

3 Nov 1985317893009985918

AI is reshaping the world ⏩ but can laws keep up?

In the latest episode of Arm Tech Unheard, Rene Haas and Minister @AshwiniVaishnaw unpack how innovation and policy must work together to govern AI responsibly.

Catch the full episode: https://okt.to/nXTvkb

Reply on Twitter 1985317893009985918 Retweet on Twitter 1985317893009985918 3 Like on Twitter 1985317893009985918 17 Twitter 1985317893009985918

; Arm @Arm ·

2 Nov 1985095589432881499

AI innovation isn’t just about hardware, it’s the software that connects it all. Complexity in AI toolchains still blocks real-world deployment.

See how we're simplifying the AI stack to help developers build faster, deploy anywhere in this article by @VentureBeat.…

Reply on Twitter 1985095589432881499 Retweet on Twitter 1985095589432881499 4 Like on Twitter 1985095589432881499 24 Twitter 1985095589432881499

; Arm @Arm ·

31 Oct 1984374835686895914

We're powering a major shift in AI. 💪

With Arm-based cloud instances organizations can implement AI efficiently and at scale - gaining higher performance-per-watt, lower total cost, and the flexibility to move from pilot projects to full AI platforms.

From pilot to platform: How Arm is powering AI in the cloud

Why now is the right time to evaluate Arm-based cloud instances

okt.to

Reply on Twitter 1984374835686895914 Retweet on Twitter 1984374835686895914 9 Like on Twitter 1984374835686895914 27 Twitter 1984374835686895914

; Arm @Arm ·

30 Oct 1984026047466045622

By migrating to Arm-based AWS Graviton processors and GitHub’s native Arm64 runners, @ThePSF cut compute costs by 25%, reduced carbon emissions by 40%, and achieved zero downtime - keeping Python’s ecosystem running stronger. ⚡💪
https://okt.to/atludF

Reply on Twitter 1984026047466045622 Retweet on Twitter 1984026047466045622 9 Like on Twitter 1984026047466045622 27 Twitter 1984026047466045622

; Arm @Arm ·

30 Oct 1983995601793519960

Personalized AI is reshaping our daily lives and it all starts with power-efficient compute.

From your morning latte to life-changing medical care, Arm is powering the future of AI everywhere.

Learn more in this @nytimes feature.
https://okt.to/5AQvsK

Reply on Twitter 1983995601793519960 Retweet on Twitter 1983995601793519960 3 Like on Twitter 1983995601793519960 14 Twitter 1983995601793519960

; Arm @Arm ·

29 Oct 1983673134977802747

A new era for robotaxis and intelligent mobility is here. 🚗

Auto leaders like @LucidMotors, @MercedesBenz, and @Stellantis are driving innovation with the @NVIDIA DRIVE AV platform and DRIVE AGX Hyperion 10 architecture — powered by NVIDIA DRIVE Thor featuring Arm Neoverse…

Reply on Twitter 1983673134977802747 Retweet on Twitter 1983673134977802747 2 Like on Twitter 1983673134977802747 30 Twitter 1983673134977802747

; Arm @Arm ·

29 Oct 1983536074681995696

Say hello to the #OPPOFindX9Series, built on our latest Arm v9.3 C1 CPU cluster and G1-Ultra GPU, delivering up to 32% higher performance, 42% better power efficiency, plus new AI-powered features with ColorOS 16.

Congrats, @Oppo! We're excited to continue collaborating on the…

Reply on Twitter 1983536074681995696 Retweet on Twitter 1983536074681995696 5 Like on Twitter 1983536074681995696 25 Twitter 1983536074681995696

; Arm @Arm ·

29 Oct 1983485237414789559

The development of the world’s first blockchain-on-chip for drones is being made possible through Arm Flexible Access, as @Minima_Global and @unisouthampton use Arm compute platforms to explore new approaches to secure, autonomous system design 👏: https://okt.to/UL1Or0

Reply on Twitter 1983485237414789559 Retweet on Twitter 1983485237414789559 91 Like on Twitter 1983485237414789559 176 Twitter 1983485237414789559

; Arm @Arm ·

29 Oct 1983461333719724516

Migrating cloud workloads doesn’t need to be complex.

The new Arm Cloud Migration Assistant Custom Agent, integrated with @GitHub Copilot, accelerates deployment so you can analyze code for readiness and build optimized multi-arch containers faster: https://okt.to/Sou3gr

Reply on Twitter 1983461333719724516 Retweet on Twitter 1983461333719724516 7 Like on Twitter 1983461333719724516 21 Twitter 1983461333719724516

; Arm @Arm ·

28 Oct 1983284153815622070

Reply on Twitter 1983284153815622070 Retweet on Twitter 1983284153815622070 1 Like on Twitter 1983284153815622070 16 Twitter 1983284153815622070

; Arm @Arm ·

28 Oct 1983245931672707416

Last week, 300+ Arm graduates from 12 countries came together in London for the Global Graduate Conference.

Co-designed by grads, GGC helps our future innovators accelerate their impact and shape the future of AI. 💡

Reply on Twitter 1983245931672707416 Retweet on Twitter 1983245931672707416 5 Like on Twitter 1983245931672707416 22 Twitter 1983245931672707416

; Arm @Arm ·

27 Oct 1982897135327576542

We’re proud that the Arm x @Simprints partnership with @gaviwas a finalist for Partnership of the Year at the @Reuters Sustainability Awards!

A huge congratulations to our teams and partners for their work helping ensure everyone counts. 👏

Reply on Twitter 1982897135327576542 Retweet on Twitter 1982897135327576542 4 Like on Twitter 1982897135327576542 13 Twitter 1982897135327576542

; Arm @Arm ·

27 Oct 1982872522271141945

What gives you the edge in AI? The answer’s in the question!

As AI adoption accelerates, our report with @scsp_ai explores why success depends on rethinking infrastructure to embrace edge AI - backed by policies that prioritize power-efficient computing: https://www.arm.com/-/media/Files/pdf/policies/scsp-arm-position-paper?utm_source=twitter&utm_medium=social-organic&utm_content=report&utm_campaign=mk29_exec-comms_na

Reply on Twitter 1982872522271141945 Retweet on Twitter 1982872522271141945 2 Like on Twitter 1982872522271141945 12 Twitter 1982872522271141945

; Arm @Arm ·

25 Oct 1982084746835378550

🚗 AI inside the car is redefining vehicle design, enhancing safety, personalization, and performance in ways drivers barely notice.

In this #AIToyToTools podcast series, we explore how AI is powering the shift from on-device intelligence to cloud-to-car integration.…

Reply on Twitter 1982084746835378550 Retweet on Twitter 1982084746835378550 6 Like on Twitter 1982084746835378550 32 Twitter 1982084746835378550

; Arm @Arm ·

24 Oct 1981797862242394130

We’re heading to #GitHubUniverse!

Catch us in the Festival Pavilion to explore demos, connect with experts, and discover more on GitHub-native development across cloud, PC and embedded. 💪

+ Wrap up day 1️⃣ with the team at our onsite Happy Hour: https://okt.to/657ei8

Reply on Twitter 1981797862242394130 Retweet on Twitter 1981797862242394130 1 Like on Twitter 1981797862242394130 13 Twitter 1981797862242394130

; Arm @Arm ·

24 Oct 1981750164902588512

Moments before the Geely EX5 UK launch, Dipti Vachani talked about what’s under the hood - Arm Automotive Enhanced technology powering real-time safety, performance and intelligence. #GeelyAutoUK #GeelyEX5 @geelyautouk

Reply on Twitter 1981750164902588512 Retweet on Twitter 1981750164902588512 4 Like on Twitter 1981750164902588512 23 Twitter 1981750164902588512

5 Key Benchmarks That Prove Llama 3.3 70B Runs Efficiently on Arm Neoverse CPUs

Optimizing Llama 3.3 70B Inference on Arm Neoverse CPUs for AI Scalability

A game-changer for generative AI

The Arm Meta partnership

Editorial Contact

Related

Why Arm is the Compute Platform for All AI Workloads

Accelerating and Scaling AI Inference Everywhere with New Llama 3.2 LLMs on Arm

Arm Accelerates AI From Cloud to Edge With New PyTorch and ExecuTorch Integrations to Deliver Immediate Performance Improvements for Developers

How Arm Neoverse can Accelerate Your AI Data Center Dreams

Media Information

Company Overview & History

Arm Corporate Guidelines

Media Contacts

Latest on X