Choosing the Right Arm Edge AI Solution for Your AI application

With the launch of Cortex-A320, Arm’s smallest implementation of the Armv9-A architecture, developers now have even more options for processing IoT edge AI workloads. But with so many choices, how do you determine the right processor for your specific AI application? As a system developer, you need to navigate that decision by comparing Cortex-A, Cortex-M, and Ethos-U NPU-based devices—along with their potential combinations. Beyond the cost, discover how each processor impacts AI functionality and what software development flows are available to streamline your project.
Efficiency of AI compute in embedded devices
The efficiency of AI computation in embedded devices has grown leaps and bounds in recent years. Improvements in Arm’s M- and A-profile architecture deliver multi-fold increases in machine learning (ML) inferences per unit energy consumed. Specifically on M-profile, the Cortex-M52, Cortex-M55, and Cortex-M85 CPUs (all based on the Armv8.1-M architecture) have integrated the programmable Helium vector extension, unlocking new AI-enabled use cases on microcontroller-class devices. Cortex-A processors based on Armv9, such as the recently announced Cortex-A320, on the other hand, boosted AI performance from their earlier generation thanks to the scalable vector extension (SVE2). The Ethos-U family of neural processing units (NPU) has improved processing efficiency – especially with transformer networks – with the latest generation, the Ethos-U85.
How to choose?
Each architecture offers advantages on different fronts. When considering which hardware is best suited, raw performance should be weighed against design flexibility. Additionally, software development flow, including CI/CD requirements need to be considered.
Performance
Meeting the required AI processing performance is, of course, mandatory.
By nature, Cortex-A processors are programmable processors that can target a wide range of end-uses. Integrated into Cortex-A is the Neon/SVE2 vector engine designed to accelerate neural networks and any vectorized code. The data types natively supported are also numerous. Cortex-M processors, with the Helium vector engine, present the same characteristics, although perfected for more cost and energy constrained target use. By contrast Ethos-U NPUs (up to Ethos-U85) are purposely designed to process neural network operators, and specifically with quantized 8-bit integer data weights. They are very efficient at their tasks, for network operators that can be mapped to hardware present in those NPUs.
The latest generation of Cortex-A CPUs based on the Armv9 architecture supports a broad set of data types, including BF16. In addition, new matrix-multiply instructions have been introduced that significantly increases performance on neural network processing. A good explanation on how matrix multiply is implemented with SVE2 can be found in this blog.
The Cortex-M55 was the first Cortex-M processor to integrate Helium vector technology, followed later by the Cortex-M85. Both processors implement the dual-beat Helium configuration, which delivers up to eight 8-bit integer multiply-accumulate (MAC) operations per clock cycle. Helium also supports natively other data types: FP16, FP32, for example.
Finally, Ethos-U NPUs deliver very efficient neural network (NN) processing, but on models with quantized data types. Int8 weights and Int8 or Int16 activation data specifically. This design choice enhances the NPUs execution efficiency but would restrict its usage only with these data types.
One way to assess a processor’s performance for real-world AI workloads is by analyzing its theoretical MAC execution capability per data type and per clock cycle. Since neural network processing uses large datasets, memory access performance is another crucial factor. However, in this instance, we will focus specifically on processor-bound performance rather than memory-bound performance.
Neural network processing rate is often limited by the MAC operation rate of the underlying hardware. While the actual network processing performance varies based on the network structure, the theoretical MAC processing rate shown below provides one indicator of the hardware’s capabilities.
MAC/core/clock cycle | datatype | Int8 | Int16 | Int32 | BF16 | FP16 | FP32 |
Cortex-M55 & Cortex-M85 | 8 | 4 | 2 | N/A | 4 | 2 | |
Ethos-U85(128 MACs) | 128 | 64 | N/A | N/A | N/A | N/A | |
Ethos-U85(2048MACs) | 2048 | 1024 | N/A | N/A | N/A | N/A | |
Cortex-A320 | 32 | 8 | 4 | 8 | 8 | 4 |
Software
Another aspect to consider is the software support for each hardware solution. Arm offers a comprehensive set of open-source runtime support software for all AI hardware solutions: Cortex-A, Cortex-M, and Ethos-U. Arm supports hardware acceleration for various ML frameworks and runtimes, including PyTorch, ExecuTorch, Llama.cpp, TensorFlow, and LiteRT via XNNPACK. Any ML framework can be optimized to leverage Arm AI features, with runtimes executing on Arm processors and utilizing software acceleration libraries like CMSIS-NN for Cortex-M/Helium and Arm Compute Library or KleidiAI for int8 and bf16 in Neon/SVE2. The Vela compiler is an offline tool that optimizes models for efficient deployment on Ethos-U, where it further refines the executable binary for maximum hardware performance.
When should you use Ethos-U?
Some edge AI use-cases with well-defined AI workloads can benefit from off-loading NN processing to a dedicated NPU, thereby freeing up the host processor from such compute-intensive tasks. As discussed, the Ethos-U NPU is very efficient at processing neural networks with quantized 8-bit integer weights. Transformer networks are especially suited to run on Ethos-U85. However, the Ethos-U85 NPU must be driven by a host processor, which can be either a Cortex-M or a Cortex-A.
Various host processor-Ethos-U configurations are possible. Ethos-U can be driven by a Cortex-M, which is done on Helium enabled Cortex-M processors, such as the Cortex-M55. Some examples of this system-on-chip configuration are available today on the market. Recently, running generative AI workloads on small language models (SLMs) has been gaining interest in the industry; the Ethos-U combined with a Helium-enabled Cortex-M is perfect for such use-cases.
There are also system-on-chips based on Cortex-A processors that integrate an ML island of Cortex-M with Ethos-U processors. Typically, these SoCs are geared to run rich operating systems, such as Linux, and support a larger and more flexible memory system. Cortex-M CPUs have 32-bit addressable memory address space with a direct memory address mapping, while the more recent Cortex-A processors, such as the Cortex-A320, have 40-bit memory addressable space that also benefits from virtual memory addressing by a memory management unit (MMU).
As we see large language model (LLM) execution gravitating to edge AI devices, having larger and more flexible memory systems can ease the execution of such models with larger numbers of parameters, for example >1Bn parameter LLMs. With growing interest in SLMs, Cortex-M with Ethos-U85 is a great fit. Cortex-M processors have 4GB addressing space, with some reserved for system functions. As LLM models grow in size, however, Cortex-A systems, with larger and more flexible memory, may become essential.
Recently, we announced another configuration called ‘direct drive’. This is where the Cortex-A processor directly drives the Ethos-U NPU. Such a configuration removes the need for a dedicated Cortex-M ‘driver’ processor. There is a Linux driver for the Ethos-U85 that runs on the host Cortex-A.

Meeting Generative AI demands with Cortex-A320
Edge AI system developers now have more options to optimize the last-mile AI in IoT. Whether choosing a Cortex-M, Cortex-A, or an Ethos-U-accelerated system, each serves different needs. With the Cortex-A320 processor’s ability to directly drive Ethos-U85, designers gain even more flexibility. As Arm’s smallest and most efficient Armv9-A Cortex-A processor, Cortex-A320 enhances edge AI efficiency while adapting to the evolving demands of generative AI on embedded systems.
Learn how the future of IoT is being shaped with transformative edge AI solutions from Arm.
Any re-use permitted for informational and non-commercial or personal use only.