Blog

May 22, 2024

Small Language Models: Efficient Arm Computing Enables a Custom AI Future

As AI pivots from the colossal to the compact, small language models (SLMs) offer tailored solutions with reduced costs and increased accessibility

By Ravi Malhotra, Hyperscale Solutions Architect, Arm

Increasingly in the world of AI, small is big.

Large language models (LLMs) have driven the early innovation in generative AI in the past 18 months, but there’s a growing body of evidence that the momentum behind unfettered scaling of LLMs – now pushing trillions of parameters to train on – is not sustainable. Or, at the very least, the infrastructure costs to push this approach to AI further are putting it out of reach for all but a handful. This class of LLM requires a vast amount of computational power and energy, which translates into high operational costs. Training GPT-4 cost at least $100 million, illustrating the financial and resource-heavy nature of these projects.

Not to mention, these LLMs are complex to develop and deploy. A study from the University of Cambridge points out companies might spend over 90 days to deploy a single machine learning model. This long cycle hampers rapid development and iterative experimentation, which are crucial in the fast-evolving field of AI.

These and other challenges are why the development focus is shifting towards small language models (SLMs or sometimes small LLMs), which promise to address many of these challenges by being more efficient, requiring fewer resources, and being easier to customize and control. SLMs like Llama, Mistral, Qwen, Gemma, or Phi3 are much more efficient at simpler, focused tasks like conversation, translation, summarization, and categorization as compared to sophisticated or nuanced content generation and, as such, consume a fraction of the energy for training.

This can encourage developers to build generative AI solutions with multimodal capabilities, which can process and generate content across different forms of media, such as text, images, and audio.

Foundational models like Llama 3 can be further fine-tuned with context-specific data to focus on specific applications like medical sciences, code generation, or subject matter expertise. These focused applications, combined with the accessibility that these smaller LLMs bring, ‘democratize’ generative AI and bring its capabilities to application developers who do not have a farm of GPUs at their disposal, thereby unlocking new applications and use-cases.

And then there’s under-the-hood optimization techniques such as quantization, a method to make models more efficient. Quantization reduces the size of the model by using lower-precision calculations for the neural network’s weights. Instead of using 16-bit floating point numbers, quantization can compress these to 4-bit integers, greatly reducing memory and computational needs while only slightly affecting accuracy. For example, using this method, the earlier Llama 2 model at 7 billion parameters can shrink from 13.5 GB to 3.9 GB, the 13 billion parameter version from 26.1 GB to 7.3 GB, and the 70 billion parameter model from 138 GB to 40.7 GB. This technique enhances the speed and reduces the costs of running these lightweight models, especially on CPUs.

These software advancements – coupled with more efficient and powerful Arm CPU technology – enable these smaller, more efficient language models to run directly on mobile devices, enhancing performance, privacy and the user experience.

Another boon to the rise of SLMs has been the emergence of specialized frameworks like llama.cpp. By focusing on performance optimization for CPU inference, llama.cpp – compared with a general-purpose framework like PyTorch – enables faster and more efficient execution of Llama-based models on commodity hardware. This accessibility opens up new possibilities for widespread deployment without relying on specialized GPU resources, making large language models more accessible to a broader range of users and applications.

How does hardware figure into all this?

The value of efficiency – Arm-style

Arm Neoverse CPUs enhance machine learning processes through advanced SIMD instructions like Neon and SVE, specifically accelerating General Matrix Multiplication (GEMM) – a core algorithm involving complex multiplications within a neural network. Arm has been adding features instructions like SDOT (Signed Dot Product) and MMLA (Matrix Multiply Accumulate) in Arm’s Neon and SVE2 engines over the past few generations which benefit key ML algorithms. These can increase efficiency in broadly deployed server CPUs like AWS Graviton and NVIDIA Grace, as well as the recently announced Microsoft Cobalt and Google Axion as they come into production.

A typical LLM pipeline is split into two stages:

First, prompt processing, which prepares the input data for the model and aims to improve responsiveness
Second, token generation, which creates text one piece at a time, focusing on throughput and scalability.

Depending on the application—whether it’s chatting, style transfer, summarization, or content creation—the balance between prompt size, token generation, and the need for speed or quality shifts accordingly. Interactive chatting prioritizes quick responses, style transfer emphasizes output quality, summarization balances thoroughness with timely delivery, and content generation focuses on producing extensive, high-quality material.

Put simply, the effectiveness of a language model hinges on fine-tuning the input processing and text generation to the needs of the task, be it fast interaction, high-quality writing, efficient summarization, or prolific content creation.

Performance of Llama 3 on AWS Graviton3

To characterize the efficiency of Arm Neoverse CPUs for LLM tasks, Arm software teams and partners optimized the int4 and int8 kernels in llama.cpp to leverage newer instructions in Arm-based server CPUs. They tested the performance impact on an AWS r7g.16xlarge instance with 64 Arm-based Graviton3 cores and 512 GB RAM, using an 8B parameter LLaMa-3 model with int4 quantization.

Here’s what they found:

Prompt processing: Arm optimizations improved tokens processed per second by up to 3x, with minor gains from larger batch sizes.
Token generation: Arm optimizations helped with larger batch sizes, increasing throughput by up to 2x.
AWS Graviton3 met the emerging industry-consensus 100ms latency target for interactive LLM deployments in both single and batched scenarios. Even the older Graviton2 (2019) can run LLMs up to 8B parameters within the 100ms latency target.
AWS Graviton3 delivered up to 3x better performance compared to current-generation x86 instances for prompt processing and token generation.
Cost-effectiveness: Graviton3 instances are priced lower than Sapphire Rapids and Genoa. Graviton3 provides up to 3x higher tokens/$, making it a compelling choice for cost-effective LLM adoption and scaling.

Read more about the work in this Arm Community blog.

Flexible and affordable

CPU-based cloud instances provide a flexible, cost-effective and quick start for developers looking to deploy smaller, specialized LLMs in their applications. Arm has added multiple key features to our architecture to help improve the performance of LLMs significantly. These enable widely deployed Arm Neoverse-based server processors like AWS Graviton3 to provide both best-in-class LLM performance compared to other server CPUs as well as lower the cost entry barrier for LLM adoption for a much wider set of application developers.

In other words, for less than quarter of a cent, this blog can be processed in 2 seconds, and a short summary can be generated in less than a second.

Arm has been at the forefront of the movement towards smaller language models, recognizing their potential and readiness to embrace this shift. At the heart of this lies our DNA – CPUs that are renowned for their efficiency and remarkable ability to run AI workloads seamlessly without compromising quality or performance.

Very large language models aren’t going away anytime soon, especially after the profound impact they’ve had on the technology industry and broader society in just 18 months.

But even OpenAI CEO Sam Altman sees a coming shift. “The era of large models is over, and the focus will now turn to specializing and customizing these models. Real value is unlocked only when these models are tuned on customer and domain specific data,” he said.

SLMs are spreading their wings and finding their place in a world in which customization is increasingly easy – and necessary.

So much so that Clem Delangue, CEO of the AI startup HuggingFace, has suggested that up to 99% of use cases could be addressed using SLMs, and he predicted 2024 will be the year of the SLM.

That’s big.

By Ravi Malhotra, Hyperscale Solutions Architect, Arm

Article Text

Copy Text

Any re-use permitted for informational and non-commercial or personal use only.

Editorial Contact

Brian Fuller & Jack Melling

editorial@arm.com

Subscribe to Blogs and Podcasts

Get the latest blogs & podcasts direct from Arm

Blog

Apr 17, 2024

Arm’s Mission to Help Tackle AI’s Insatiable Energy Needs

Rene Haas, CEO, Arm

Blog

May 20, 2024

Transforming AI Experiences at the Edge with a System-Level Approach

Arm Editorial Team

Blog

May 15, 2024

Generative AI is on Mobile and it’s Powered by Arm

James McNiven, Vice President of Product Management, Client Line of Business, Arm

Blog

Apr 05, 2024

From Possibility to Reality: Enabling AI and ML at the Edge with Arm

Paul Williamson, SVP and GM of the IoT LoB, Arm

Blog

Jan 08, 2024

Arm: The Technology Foundation for AI Everywhere

Arm Editorial Team

Blog

Mar 15, 2024

Enabling Next-Gen Edge AI Applications with Transformer Networks

Stephen Su, Senior Segment Marketing Manager, Arm IoT, Arm

Media Information

Latest on X

; Arm @Arm ·

15 Aug 1956452480960843892

Here's a little BTS for your Friday✨

Arm CMO, Ami Badani, recently spoke with @CNN’s Anna Stewart on how Arm is enabling real-time compute to transforming landscape of AI-powered devices. Redefining what is possible in the cloud and at the Edge.

Reply on Twitter 1956452480960843892 Retweet on Twitter 1956452480960843892 3 Like on Twitter 1956452480960843892 18 Twitter 1956452480960843892

; Arm @Arm ·

15 Aug 1956424324514296184

Congratulations to @SiMa_Inc on the launch of Modalix™, bringing AI and LLM capabilities to Physical AI applications at the edge! 🥳

Built on the Arm compute platform, Modalix™ is a next-gen solution built to deliver performance without sacrificing power.…

Reply on Twitter 1956424324514296184 Retweet on Twitter 1956424324514296184 1 Like on Twitter 1956424324514296184 14 Twitter 1956424324514296184

; Arm @Arm ·

15 Aug 1956401987500867919

Which of the below are key findings from @VDC_Research's recent report, in partnership with Arm, exploring the next era of embedded technology - led by AI and built on Arm?

Hint: This is a trick question it's all true🥳

Reply on Twitter 1956401987500867919 Retweet on Twitter 1956401987500867919 1 Like on Twitter 1956401987500867919 5 Twitter 1956401987500867919

; Arm @Arm ·

15 Aug 1956379484308775165

🚗 What if your SDV software could start development before hardware even exists?

With Arm Compute Subsystems and @SOAFEE’s open standards-based stack, developers get early access to virtual prototypes, accelerating cloud-native automotive innovation.

Built on Arm. Ready for…

Reply on Twitter 1956379484308775165 Retweet on Twitter 1956379484308775165 4 Like on Twitter 1956379484308775165 9 Twitter 1956379484308775165

; Arm @Arm ·

14 Aug 1956072094879388098

Microsoft has just announced a major update for Windows Insiders in the Xbox Insider Program.

For the first time ever, ARM64-compatible games can be downloaded and played locally from the Xbox PC app

👏 Huge shoutout to the teams at @Microsoft & @Xbox!

Xbox PC App Experience Expanding on Arm®-based Windows 11 PCs

Today, we’re beginning to roll out an update for Arm®-based Windows 11 PCs, which introduces changes and improvem...

okt.to

Reply on Twitter 1956072094879388098 Retweet on Twitter 1956072094879388098 7 Like on Twitter 1956072094879388098 24 Twitter 1956072094879388098

; Arm @Arm ·

14 Aug 1956008310601031746

🎧 Can AI make music with soul?

In this one of a kind conversation, @itspetergabriel joins Arm CEO Rene Haas to share his perspective, including why we need to "work with it" to build something remarkable, rather than fight it: https://okt.to/gITxA4

Reply on Twitter 1956008310601031746 Retweet on Twitter 1956008310601031746 4 Like on Twitter 1956008310601031746 9 Twitter 1956008310601031746

; Arm @Arm ·

13 Aug 1955712728959500678

How do you inspire the next generation of innovators?

Through partnerships like ours with @siemenssoftware and the @unisouthampton we're providing the resources necessary to support the pipeline of emerging semiconductor talent at scale. 🌟
https://okt.to/8w9Lbo

Reply on Twitter 1955712728959500678 Retweet on Twitter 1955712728959500678 2 Like on Twitter 1955712728959500678 15 Twitter 1955712728959500678

; Arm @Arm ·

12 Aug 1955254555534332348

👾 Ready to build the future of neural graphics?

We’ve launched the world’s first open neural graphics dev kit, so you can start creating AI-enhanced visuals today, a year ahead of hardware!

Includes:
✅ @UnrealEngine plugin
✅ @VulkanAPI emulation
✅ Open models on @github &…

Arm Software Developers @ArmSoftwareDev

Introducing Arm neural technology – an industry first for on-device AI and mobile graphics.

From 2026, it will bring dedicated neural accelerators to Arm GPUs, kicking off with Neural Super Sampling – an AI-powered graphics upscaler that delivers 2x resolution uplift at…

Reply on Twitter 1955254555534332348 Retweet on Twitter 1955254555534332348 14 Like on Twitter 1955254555534332348 52 Twitter 1955254555534332348

; Arm @Arm ·

8 Aug 1953878635959476238

AI’s rapid rise is changing the workplace.

Charlotte Eaton, CPO at Arm, sat down with the Future Ready Leadership Podcast to explore how complex this transformation really is and the mindset shift that is needed to support an AI-ready culture. ⚡

How to Build an AI Leadership Mindset with Charlotte Eaton, CPO of Arm

Discover the AI leadership mindset with Charlotte Eaton, CPO of Arm. Learn how to build an AI-ready workforce and ...

okt.to

Reply on Twitter 1953878635959476238 Retweet on Twitter 1953878635959476238 3 Like on Twitter 1953878635959476238 16 Twitter 1953878635959476238

; Arm @Arm ·

7 Aug 1953518779494731890

Embedded Intelligence + Production Environments =

⚡Real time anomaly detection
⚡Minimal waste
⚡Continuous process optimization

What more could you need? Arm-based platforms are bringing AI to the edge and redefining quality control in manufacturing. 💪…

Reply on Twitter 1953518779494731890 Retweet on Twitter 1953518779494731890 26 Like on Twitter 1953518779494731890 83 Twitter 1953518779494731890

; Arm @Arm ·

7 Aug 1953512486117748952

We’re proud to share that Arm has been named one of the Top 100 Internship Programs in the US by @Yello! 🏆

This recognition reflects the meaningful, hands-on experience our interns gain from day one.

Thank you to everyone who makes it possible. 💙

https://okt.to/YuwoBb

Reply on Twitter 1953512486117748952 Retweet on Twitter 1953512486117748952 6 Like on Twitter 1953512486117748952 26 Twitter 1953512486117748952

; Arm @Arm ·

7 Aug 1953474832542077331

F1®, precision starts long before race day.

In the @AstonMartinF1 HQ, the CoreWeave wind tunnel uses Arm-powered, state of the art technology to help interface with over 1,000 sensors simultaneously to turn data into faster, smarter decisions.

Less time validating. More time…

Reply on Twitter 1953474832542077331 Retweet on Twitter 1953474832542077331 5 Like on Twitter 1953474832542077331 15 Twitter 1953474832542077331

; Arm @Arm ·

6 Aug 1953208734693425333

We believe in supporting the next generation of innovators - that's why we're proud to have signed the Pledge to America's Youth and committed to increasing our work in this area.

This effort will foster early interest in Al, promote Al literacy, and enable comprehensive Al…

Reply on Twitter 1953208734693425333 Retweet on Twitter 1953208734693425333 4 Like on Twitter 1953208734693425333 19 Twitter 1953208734693425333

; Arm @Arm ·

6 Aug 1953204947199095140

Leaders around the globe are investing in the search for efficient and sustainable solutions to enable AI data centers at scale.

@FT spoke with Mohamed Awad, SVP and GM, Infrastructure Line of Business at Arm, about the ongoing race for AI capacity! ⚡

Inside the relentless race for AI capacity

The quest for superintelligence is spurring a data centre boom — but critics question the cost, environmental impact and whether it is all needed

okt.to

Reply on Twitter 1953204947199095140 Retweet on Twitter 1953204947199095140 8 Like on Twitter 1953204947199095140 28 Twitter 1953204947199095140

; Arm @Arm ·

4 Aug 1952377677467255178

Musical legend 🤝 Tech CEO

In the 🆕 episode of Tech Unheard, @itspetergabriel joins Rene Haas to explore how AI can open up access to science, music and the arts - and why that access matters.

Listen here: https://okt.to/ZVn8ft

Reply on Twitter 1952377677467255178 Retweet on Twitter 1952377677467255178 3 Like on Twitter 1952377677467255178 21 Twitter 1952377677467255178

; Arm @Arm ·

4 Aug 1952346796040020100

A world-first for the world’s youngest. We’re supporting @Simprints and @Gavi to launch a contactless AI tool that identifies infants for vital vaccines.

Built on Arm Neoverse, starting in Ghana. Because smarter, more equitable healthcare should start from birth.…

Reply on Twitter 1952346796040020100 Retweet on Twitter 1952346796040020100 10 Like on Twitter 1952346796040020100 30 Twitter 1952346796040020100

; Arm @Arm ·

1 Aug 1951396286961270968

Chiplets are here and they’re reshaping the landscape of compute.

At #62DAC, @EddieRamirez, VP Infrastructure at Arm, shared how Arm Total Design alongside open standards, scalable IP, and a strong partner ecosystem are accelerating the creation of interoperable, silicon-proven…

Reply on Twitter 1951396286961270968 Retweet on Twitter 1951396286961270968 8 Like on Twitter 1951396286961270968 42 Twitter 1951396286961270968

; Arm @Arm ·

1 Aug 1951321099058168204

"We’re just at the beginning of the AI-defined vehicle era.” – Suraj Gajendra

Built on our scalable compute platform, SOAFEE enables standardization, flexibility, and software reuse which help OEMs move faster in this new automotive era.🚗

https://okt.to/whaNx9

Reply on Twitter 1951321099058168204 Retweet on Twitter 1951321099058168204 2 Like on Twitter 1951321099058168204 11 Twitter 1951321099058168204

; Arm @Arm ·

1 Aug 1951246558176985330

In this article for @eetimes, Dipti Vachani explores how the automotive industry must evolve to meet the demands of increasingly complex, AI-defined vehicles.

Read more about what it will take to build a more resilient automotive compute ecosystem: https://okt.to/l0wPcG

Reply on Twitter 1951246558176985330 Retweet on Twitter 1951246558176985330 5 Like on Twitter 1951246558176985330 13 Twitter 1951246558176985330

; Arm @Arm ·

30 Jul 1950649493310808259

We're kicking off the financial year strong with our best Q1 revenue quarter ever, topping $1B for the second quarter in a row.

As AI rewrites what's possible, Arm is the only platform that can deliver performance, efficiency & scale from cloud to edge:

Reply on Twitter 1950649493310808259 Retweet on Twitter 1950649493310808259 10 Like on Twitter 1950649493310808259 40 Twitter 1950649493310808259

; Arm @Arm ·

29 Jul 1950006138499441009

Edge AI is rewriting the playbook for IoT and embedded development as it shifts towards collaborative ecosystems and heterogeneous compute.

@VDC_Research partnered with us to explore the next era of embedded technology - led by AI and built on Arm. ⚡⬇️

https://okt.to/nIkNe6

Reply on Twitter 1950006138499441009 Retweet on Twitter 1950006138499441009 3 Like on Twitter 1950006138499441009 25 Twitter 1950006138499441009

; Arm @Arm ·

28 Jul 1949917544954892736

➡️50% faster vector indexing
➡️20% performance boost
➡️10% cost reduction

@zilliz_universe achieved all this and more by transitioning from x86 to Arm CPUs for compute intensive workloads, reducing operational costs and delivering scale across the organization:…

Reply on Twitter 1949917544954892736 Retweet on Twitter 1949917544954892736 4 Like on Twitter 1949917544954892736 17 Twitter 1949917544954892736

; Arm @Arm ·

28 Jul 1949868215485845764

Ready to push genAI performance to the next level?

Our new course gives you hands-on experience in optimizing AI models from cloud to edge using Arm-based platforms like SIMD (SVE, Neon), low-bit quantization, and the KleidiAI library.

Reply on Twitter 1949868215485845764 Retweet on Twitter 1949868215485845764 4 Like on Twitter 1949868215485845764 10 Twitter 1949868215485845764

; Arm @Arm ·

25 Jul 1948827821310161337

We're building a future for real people.

We caught up with @1JessicaHawkins from our partners over at @AstonMartinF1 during our latest brand film shoot where she gave us a look into her own career journey and the importance of empowerment, growth & pushing the limits.

The…

Reply on Twitter 1948827821310161337 Retweet on Twitter 1948827821310161337 2 Like on Twitter 1948827821310161337 10 Twitter 1948827821310161337

Small Language Models: Efficient Arm Computing Enables a Custom AI Future

The value of efficiency – Arm-style

Performance of Llama 3 on AWS Graviton3

Flexible and affordable

Editorial Contact

Related

Arm’s Mission to Help Tackle AI’s Insatiable Energy Needs

Transforming AI Experiences at the Edge with a System-Level Approach

Generative AI is on Mobile and it’s Powered by Arm

From Possibility to Reality: Enabling AI and ML at the Edge with Arm

Arm: The Technology Foundation for AI Everywhere

Enabling Next-Gen Edge AI Applications with Transformer Networks

Media Information

Company Overview & History

Arm Corporate Guidelines

Media Contacts

Latest on X