Small Language Models: Efficient Arm Computing Enables a Custom AI Future
Increasingly in the world of AI, small is big.
Large language models (LLMs) have driven the early innovation in generative AI in the past 18 months, but there’s a growing body of evidence that the momentum behind unfettered scaling of LLMs – now pushing trillions of parameters to train on – is not sustainable. Or, at the very least, the infrastructure costs to push this approach to AI further are putting it out of reach for all but a handful. This class of LLM requires a vast amount of computational power and energy, which translates into high operational costs. Training GPT-4 cost at least $100 million, illustrating the financial and resource-heavy nature of these projects.
Not to mention, these LLMs are complex to develop and deploy. A study from the University of Cambridge points out companies might spend over 90 days to deploy a single machine learning model. This long cycle hampers rapid development and iterative experimentation, which are crucial in the fast-evolving field of AI.
These and other challenges are why the development focus is shifting towards small language models (SLMs or sometimes small LLMs), which promise to address many of these challenges by being more efficient, requiring fewer resources, and being easier to customize and control. SLMs like Llama, Mistral, Qwen, Gemma, or Phi3 are much more efficient at simpler, focused tasks like conversation, translation, summarization, and categorization as compared to sophisticated or nuanced content generation and, as such, consume a fraction of the energy for training.
This can encourage developers to build generative AI solutions with multimodal capabilities, which can process and generate content across different forms of media, such as text, images, and audio.
Foundational models like Llama 3 can be further fine-tuned with context-specific data to focus on specific applications like medical sciences, code generation, or subject matter expertise. These focused applications, combined with the accessibility that these smaller LLMs bring, ‘democratize’ generative AI and bring its capabilities to application developers who do not have a farm of GPUs at their disposal, thereby unlocking new applications and use-cases.
And then there’s under-the-hood optimization techniques such as quantization, a method to make models more efficient. Quantization reduces the size of the model by using lower-precision calculations for the neural network’s weights. Instead of using 16-bit floating point numbers, quantization can compress these to 4-bit integers, greatly reducing memory and computational needs while only slightly affecting accuracy. For example, using this method, the earlier Llama 2 model at 7 billion parameters can shrink from 13.5 GB to 3.9 GB, the 13 billion parameter version from 26.1 GB to 7.3 GB, and the 70 billion parameter model from 138 GB to 40.7 GB. This technique enhances the speed and reduces the costs of running these lightweight models, especially on CPUs.
These software advancements – coupled with more efficient and powerful Arm CPU technology – enable these smaller, more efficient language models to run directly on mobile devices, enhancing performance, privacy and the user experience.
Another boon to the rise of SLMs has been the emergence of specialized frameworks like llama.cpp. By focusing on performance optimization for CPU inference, llama.cpp – compared with a general-purpose framework like PyTorch – enables faster and more efficient execution of Llama-based models on commodity hardware. This accessibility opens up new possibilities for widespread deployment without relying on specialized GPU resources, making large language models more accessible to a broader range of users and applications.
How does hardware figure into all this?
The value of efficiency – Arm-style
Arm Neoverse CPUs enhance machine learning processes through advanced SIMD instructions like Neon and SVE, specifically accelerating General Matrix Multiplication (GEMM) – a core algorithm involving complex multiplications within a neural network. Arm has been adding features instructions like SDOT (Signed Dot Product) and MMLA (Matrix Multiply Accumulate) in Arm’s Neon and SVE2 engines over the past few generations which benefit key ML algorithms. These can increase efficiency in broadly deployed server CPUs like AWS Graviton and NVIDIA Grace, as well as the recently announced Microsoft Cobalt and Google Axion as they come into production.
A typical LLM pipeline is split into two stages:
- First, prompt processing, which prepares the input data for the model and aims to improve responsiveness
- Second, token generation, which creates text one piece at a time, focusing on throughput and scalability.
Depending on the application—whether it’s chatting, style transfer, summarization, or content creation—the balance between prompt size, token generation, and the need for speed or quality shifts accordingly. Interactive chatting prioritizes quick responses, style transfer emphasizes output quality, summarization balances thoroughness with timely delivery, and content generation focuses on producing extensive, high-quality material.
Put simply, the effectiveness of a language model hinges on fine-tuning the input processing and text generation to the needs of the task, be it fast interaction, high-quality writing, efficient summarization, or prolific content creation.
Performance of Llama 3 on AWS Graviton3
To characterize the efficiency of Arm Neoverse CPUs for LLM tasks, Arm software teams and partners optimized the int4 and int8 kernels in llama.cpp to leverage newer instructions in Arm-based server CPUs. They tested the performance impact on an AWS r7g.16xlarge instance with 64 Arm-based Graviton3 cores and 512 GB RAM, using an 8B parameter LLaMa-3 model with int4 quantization.
Here’s what they found:
- Prompt processing: Arm optimizations improved tokens processed per second by up to 3x, with minor gains from larger batch sizes.
- Token generation: Arm optimizations helped with larger batch sizes, increasing throughput by up to 2x.
- AWS Graviton3 met the emerging industry-consensus 100ms latency target for interactive LLM deployments in both single and batched scenarios. Even the older Graviton2 (2019) can run LLMs up to 8B parameters within the 100ms latency target.
- AWS Graviton3 delivered up to 3x better performance compared to current-generation x86 instances for prompt processing and token generation.
- Cost-effectiveness: Graviton3 instances are priced lower than Sapphire Rapids and Genoa. Graviton3 provides up to 3x higher tokens/$, making it a compelling choice for cost-effective LLM adoption and scaling.
Read more about the work in this Arm Community blog.
Flexible and affordable
CPU-based cloud instances provide a flexible, cost-effective and quick start for developers looking to deploy smaller, specialized LLMs in their applications. Arm has added multiple key features to our architecture to help improve the performance of LLMs significantly. These enable widely deployed Arm Neoverse-based server processors like AWS Graviton3 to provide both best-in-class LLM performance compared to other server CPUs as well as lower the cost entry barrier for LLM adoption for a much wider set of application developers.
In other words, for less than quarter of a cent, this blog can be processed in 2 seconds, and a short summary can be generated in less than a second.
Arm has been at the forefront of the movement towards smaller language models, recognizing their potential and readiness to embrace this shift. At the heart of this lies our DNA – CPUs that are renowned for their efficiency and remarkable ability to run AI workloads seamlessly without compromising quality or performance.
Very large language models aren’t going away anytime soon, especially after the profound impact they’ve had on the technology industry and broader society in just 18 months.
But even OpenAI CEO Sam Altman sees a coming shift. “The era of large models is over, and the focus will now turn to specializing and customizing these models. Real value is unlocked only when these models are tuned on customer and domain specific data,” he said.
SLMs are spreading their wings and finding their place in a world in which customization is increasingly easy – and necessary.
So much so that Clem Delangue, CEO of the AI startup HuggingFace, has suggested that up to 99% of use cases could be addressed using SLMs, and he predicted 2024 will be the year of the SLM.
That’s big.
Any re-use permitted for informational and non-commercial or personal use only.