Scalable Matrix Extension (SME) for Armv9 Architecture Enables AI Innovation on the Arm CPU
Arm is the technology foundation that runs artificial intelligence (AI) everywhere across every technology touchpoint. This is made possible by our industry-leading architecture that enables a wide range of computational workloads across the billions of diverse devices worldwide.
Arm has a relentless focus on fast-paced architectural evolution that prepares our industry-leading ecosystem for future technology trends and ever-changing compute requirements. While the astronomical rise of AI may feel like a recent phenomenon, Arm has been laying the groundwork for AI innovation for the past two decades, from the Armv7 architecture that introduced the Advanced Single Instruction Multiple Data (SIMD) Extension as part of the first foray into machine learning (ML) based workloads, to today’s Armv9 architecture that incorporates features that accelerate and protect advanced generative AI workloads, like large language models (LLMs), on the Arm CPU.
The Scalable Matrix Extension, known as SME, is an innovative feature designed to meet the needs of today’s AI and ML workloads growing in complexity and power. In addition to accelerating today’s AI, SME provides flexibility on the Arm architecture to manage ever-evolving generative AI workloads.
What is Arm SME?
SME is an Instruction Set Architecture (ISA) extension introduced in the Armv9-A architecture, which accelerates AI and ML workloads and enables improved performance, power efficiency, and flexibility for AI and ML-based applications running on the Arm CPU. This is achieved through the following features:
- Enabling a significant increase in matrix and vector processing throughput and efficiency on the Arm CPU;
- Maximizing the reuse of data loaded in registers by introducing outer-product instructions that reduce memory bandwidth pressure;
- Expanding compressed user data where input elements throughput is increased without increasing the memory load bandwidth;
- Supporting a wide range of storage and compute data types, making it a flexible solution for many current and future use cases; and
- Permitting an implementation to select a Streaming Vector Length (SVL) between 128 and 2048 bits, with a resulting SVL^2 throughput for matrix-matrix multiplies.
SME2 builds on SME by adding multi-vector instructions that allow the reuse of the architectural state (the ZA array) for both matrix and vector operations, along with higher throughput vector processing capabilities. This brings balance to the vector and matrix accelerations by compressing the AI formats to reduce memory bandwidth and save power. SME2 also enables flexible on-the-fly dequantization and the ability to decompress 2-bit and 4-bit weights to save memory bandwidth. These are important in the context of generative AI workloads that are increasingly complex and power-hungry, while also supporting Arm’s wider commitment to tackling AI’s insatiable energy needs.
What are the key use cases of SME and SME2?
SME accelerates various types of AI and ML workloads, such as generative AI and classical ML networks, as well as computer vision (CV). This is achieved through SME being able to handle both matrix by matrix, matrix by a vector, and multiple vectors by vector operations, as well as the pre and post-processing stages required for ML execution. We anticipate that SME will benefit from a variety of different AI use cases in different markets including:
- Applications that combine ML and classical CV/DSP approaches, for example; cinematic photography, media processing, driver monitoring, digital cockpit, audio processing, advanced driver assistance system (ADAS) L2+, and real-time voice assistants.
- Use cases that make use of small language models and LLMs, for example; chatbots, conversation summarization, and virtual assistants.
Explaining vector processing, matrix processing, and quantization
To understand how SME operates, it is important to explain the different AI-based processing techniques that it enables and the benefits that SME and the Armv9 architecture provide for each technique. These include:
- Vector processing;
- Matrix processing;
- Matrix multiplications; and
- Quantization.
What is vector processing?
In the context of AI and ML, vectors represent one-dimensional arrays of values and data points that typically encode features, inputs, or weights in neural networks. Vector processing is commonly used in modern AI frameworks and libraries like TensorFlow and PyTorch. By leveraging this approach, AI algorithms can efficiently handle complex computations and process large datasets more quickly, leading to faster training times and improved performance. SME includes vector instructions that perform calculations on multiple values in parallel, instead of processing each value sequentially, to significantly speed up many aspects of AI computation.
What is matrix processing?
Matrices, which are two-dimensional arrays of values and data points, play a crucial role in various AI techniques, including ML and deep learning. Matrix processing through SME involves performing operations on these matrices to improve the performance and efficiency of core AI-based workloads including linear algebra operations, like matrix multiplication, and neural networks.
Where are matrix multiplications present? And what do they improve?
Matrix multiplications are an important part of AI and ML-based workloads, as well as other computing workloads, such as scientific simulations and CV. The matrix-matrix multiply operation is becoming increasingly important for AI acceleration on CPUs and benefits significantly from SME. The Arm architecture has evolved, gaining features that improve the performance and efficiency of these operations. For example:
- Armv7 added the Advanced SIMD Extension, which is also known as the Arm NEON™ instructions.
- Armv8.4-A includes support for 8-bit integer DOT product instructions.
- Armv8.6-A includes support for in-vector integer and floating-point matrix-multiply instructions for various data types as well as including the new BFloat16 data type.
- Armv9.0-A includes Scalable Vector Extension 2 (SVE2) for digital signal processors (DSPs), media and general-purpose vectorization.
- Armv9.2-A introduces SME.
What is quantization?
Quantization involves reducing the precision of numerical values, typically from floating-point representation to fixed-point representation. The process is used in SME to make AI and ML models more efficient by reducing their memory bandwidth and footprint and computational complexity, which is important for compute-intensive generative AI workloads. This means they can be deployed on resource-constrained devices, such as smartphones, mobile devices, embedded systems, and IoT devices.
How long has Arm been adding AI-based features to the architecture?
Arm has been working on adding AI-based features, specifications, and instructions to our architecture for the past two decades. The Armv7 architecture, which was first released in 2003, added the Advanced SIMD extension, which is also known as the Arm NEON™ instructions. NEON™ considers registers as one-dimensional vectors of elements of the same data type, with instructions operating on multiple elements simultaneously. The Armv8 architecture then added a range of AI-based specifications and instructions, including dot product instructions, in-vector matrix multiply instructions, and BFLoat16 support. It also improved the Advanced SIMD Extension by doubling the number of vector registers and adding floating point support. All these improvements and additions were designed to accelerate AI and ML performance in response to evolving AI workloads. The Armv9 architecture incorporates all these features, specifications, and instructions, alongside SVE2, SME, and the new SME2.
What are the core benefits of SMEs?
SME on the Armv9 architecture significantly improves the processing of existing AI and ML workloads on the Arm CPU, leading to faster, more responsive user experiences across various AI-powered devices and applications. It also accelerates a range of applications that use matrix arithmetic, like DSPs, scientific computing, augmented reality (AR) and virtual reality (VR), and imaging to name a few, all of which AI and ML play an increasingly important role in.
Similar to the Arm CPUs that can run a wide variety of neural networks in many different data formats, SME provides the flexibility to respond to evolving AI and ML workloads and requirements that are growing in complexity. This ensures that the Arm architecture will remain relevant for the most important compute workloads in the fast-moving age of AI and beyond. Looking ahead, we will continue to add more AI capabilities into the instruction set for the benefit of Arm’s industry-leading ecosystem, so our partners can deliver improved performance, innovative features, and scalability for their AI-based solutions.
AI-based architecture innovation on Arm
SME exemplifies Arm’s continuous architectural innovation. As AI continues to evolve and grow, SME will ensure that new power-hungry generative AI workloads are processed efficiently on Arm CPUs, leading to better AI-based experiences across the billions of Arm-powered devices. This will ensure that the world’s AI continues to be built on Arm.
Learn more about SME in this Arm Community blog and also how to apply SME in this Programmers Guide.
Any re-use permitted for informational and non-commercial or personal use only.