Blog

November 13, 2024

Arm Ethos-U85 NPU: Unlocking Generative AI at the Edge with Small Language Models

Arm Ethos-U85 small language model demo advances performance, power efficiency, and generative AI at the edge.

By Arm Editorial Team

As artificial intelligence evolves, there is increasing excitement about executing AI workloads on embedded devices using small language models (SLM).

Arm’s recent demo, inspired by Microsoft’s “Tiny Stories” paper and Andrej Karpathy’s TinyLlama2 project, where a small language model trained on 21 million stories generates text, showcases endpoint AI’s potential for IoT and edge computing. In the demo, a user inputs a sentence, and the system generates an extended children’s story based on it.

Our demo featured Arm’s Ethos-U85 NPU (Neural Processing Unit) running a small language model on embedded hardware. While large language models (LLMs) are more widely known, there is growing interest in small language models due to their ability to deliver solid performance with significantly fewer resources and lower costs, making them easier and cheaper to train.

Implementing A Transformer-based Small Language Model on Embedded Hardware

Our demo showcased the Arm Ethos-U85 as a small, low-power platform capable of running generative AI, highlighting that small language models can perform well within narrow domains. Although TinyLlama2 models are simpler than the larger models from companies like Meta, they are ideal for showcasing the U85’s AI capabilities. This makes them a great fit for endpoint AI workloads.

Developing the demo involved significant modeling efforts, including the creation of a fully integer int8 (and int8x16) Tiny Llama2 model, which was converted to a fixed-shape TensorFlow Lite format suitable for the Ethos-U85’s constraints.

Our quantization approach has shown that fully integer language models can successfully balance the tradeoff between maintaining strong accuracy and output quality. By quantizing activation, normalization functions, and matrix multiplications, we eliminated the need for floating-point computations, which are more costly in terms of silicon area and energy—key concerns for constrained embedded devices.

The Ethos-U85 ran a language model on an FPGA platform at only 32 MHz, achieving text generation speeds of 7.5 to 8 tokens per second—matching human reading speed—while using just a quarter of its compute capacity. In a real system-on-chip (SoC), performance could be up to ten times faster, significantly enhancing speed and energy efficiency for AI processing at the edge.

The children’s story-generation feature used an open-source version of Llama2, running the demo on TFLite Micro with an Ethos-NPU back-end. Most of the inference logic was written in C++ at the application level. Adjusting the context window enhanced narrative coherence, ensuring smooth, AI-driven storytelling.

The team’s adaptation of the Llama2 model to run efficiently on the Ethos-U85 NPU required careful consideration of performance and accuracy due to the hardware limitations. Using mixed int8 and int16 quantization demonstrates the potential of fully integer models, encouraging the AI community to optimize generative models for edge devices and expand neural network accessibility on power-efficient platforms like the Ethos-U85.

Showcasing the Power of the Arm Ethos-U85

Scalable from 128 to 2048 MAC units (multiply-accumulate units), the Ethos-U85 achieves a 20% power efficiency improvement over its predecessor, the Ethos-U65. A standout feature of the Ethos-U85 is its native support for transformer networks, which earlier versions could not support.

The Ethos-U85 enables seamless migration for partners using previous Ethos-U NPUs, allowing them to capitalize on existing investments in Arm-based machine learning tools. Developers are increasingly adopting the Ethos-U85 for its power efficiency and high performance.

The Ethos-U85 can reach 4 TOPS (trillions of operations per second) with a 2048 MAC configuration in silicon. In the demo, however, a smaller configuration of 512 MACs on an FPGA was used to run the Tiny Llama2 small language model with 15 million parameters at just 32 MHz.

This capability highlights the potential for embedding AI directly into devices. The Ethos-U85 effectively handles such workloads even with limited memory (320 KB of SRAM for caching and 32 MB for storage), paving the way for small language models and other AI applications to thrive in deeply embedded systems.

Bringing Generative AI to Embedded Devices

Developers need better tools to navigate the complexities of AI at the edge, and Arm is addressing this with the Ethos-U85 and its support for transformer-based models. As edge AI becomes more prominent in embedded applications, the Ethos-U85 is enabling new use cases, from small language models to advanced vision tasks.

The Ethos-U85 NPU delivers the performance and power efficiency required for innovative, cutting-edge solutions. Like the “Tiny Stories” paper, our demo represents a significant advancement in bringing generative AI to embedded devices, demonstrating the ease of deploying small language models on the Arm platform.

Arm is opening new possibilities for Edge AI across a wide range of applications, positioning the Ethos-U85 to power the next generation of intelligent, low-power devices.

Read how Arm is accelerating real-time processing for edge AI applications in IoT with ExecuTorch.

By Arm Editorial Team

Article Text

Copy Text

Any re-use permitted for informational and non-commercial or personal use only.

Editorial Contact

Arm Editorial Team

editorial@arm.com

Stay informed with Arm's top stories, insights, and conversations.

News

Oct 29, 2024

Arm Empowers Developers with AI-Driven Tools on GitHub

Alex Spinelli, SVP, AI and Developer Platforms, Arm

News

Oct 24, 2024

Accelerating Generative AI at the Edge on Arm with ExecuTorch Beta Release

Alex Spinelli, SVP, AI and Developer Platforms, Arm

Blog

May 22, 2024

Small Language Models: Efficient Arm Computing Enables a Custom AI Future

Ravi Malhotra, Hyperscale Solutions Architect, Arm

Blog

Apr 09, 2024

Arm Ethos-U85: Addressing the High Performance Demands of IoT in the Age of AI

Parag Beeraka, Senior Director, Consumer Computing, Edge AI Business Unit, Arm

Media Information

Latest on X

; Arm @Arm ·

8 Jun 2063996101657514105

Something new is coming soon.⌛

Arm Software Developers @ArmSoftwareDev

Every journey starts with a question.

This is just the beginning.

Stay tuned as we reveal what we've been building. 👀

Reply on Twitter 2063996101657514105 Retweet on Twitter 2063996101657514105 4 Like on Twitter 2063996101657514105 69 Twitter 2063996101657514105

; Arm @Arm ·

6 Jun 2063326851473293633

What a week in Taipei.

From keynote moments and packed demos, to conversations with partners, developers and customers, #COMPUTEX2026 reminded us what makes this industry so exciting: the people building it together.

Thank you to everyone who watched our keynote, enjoyed the

Reply on Twitter 2063326851473293633 Retweet on Twitter 2063326851473293633 11 Like on Twitter 2063326851473293633 98 Twitter 2063326851473293633

; Arm @Arm ·

6 Jun 2063050057763414158

Reachy isn't just being friendly at #COMPUTEX2026. 🤖

Powered by NVIDIA DGX Spark and Arm-based technology, it can see, listen and respond in real time using AI running locally at the edge.

This is AI everywhere—in action.

Reply on Twitter 2063050057763414158 Retweet on Twitter 2063050057763414158 15 Like on Twitter 2063050057763414158 68 Twitter 2063050057763414158

; Arm @Arm ·

5 Jun 2063033151547437564

AI infrastructure today is all about the next generation of AI-ready data centers, calling for:
1️⃣ Efficiency
2️⃣ Performance
3️⃣ Scale

At #Computex2026, @Lenovo showcased the Arm AGI CPU-based Lenovo HR650a V3 Server, demonstrating how Arm-based platforms are expanding

Reply on Twitter 2063033151547437564 Retweet on Twitter 2063033151547437564 14 Like on Twitter 2063033151547437564 79 Twitter 2063033151547437564

; Arm @Arm ·

5 Jun 2063006238187667509

What does it take to bring Physical AI from the lab into the real world?

At @GlobalSemi Tech Summit, we explored the convergence of AI, simulation and edge compute with leaders from @GlobalFoundries and @awscloud, and how it's enabling the next generation of robotics and

Reply on Twitter 2063006238187667509 Retweet on Twitter 2063006238187667509 3 Like on Twitter 2063006238187667509 36 Twitter 2063006238187667509

; Arm @Arm ·

5 Jun 2062985904466112987

📣 82% of internet traffic is video.

That means the challenge isn’t just delivering every stream, it’s doing it without
❌Buffering
❌Dropped frames
❌Ballooning power demands

The Arm AGI CPU helps deliver more streams, more AI, and smoother video experiences at scale.

Reply on Twitter 2062985904466112987 Retweet on Twitter 2062985904466112987 5 Like on Twitter 2062985904466112987 44 Twitter 2062985904466112987

; Arm @Arm ·

5 Jun 2062903604638302248

Not every day you get a robot explaining Arm AGI CPU. 🤖

At #COMPUTEX2026, the Aeolus aeo robot—a multi-talented robot built to help boost productivity and everyday convenience—showed how AI can understand, reason and interact in the real world.

Exciting to see how our

Reply on Twitter 2062903604638302248 Retweet on Twitter 2062903604638302248 11 Like on Twitter 2062903604638302248 45 Twitter 2062903604638302248

Arm Ethos-U85 NPU: Unlocking Generative AI at the Edge with Small Language Models

Implementing A Transformer-based Small Language Model on Embedded Hardware

Showcasing the Power of the Arm Ethos-U85

Bringing Generative AI to Embedded Devices

Editorial Contact

Related

Arm Empowers Developers with AI-Driven Tools on GitHub

Accelerating Generative AI at the Edge on Arm with ExecuTorch Beta Release

Small Language Models: Efficient Arm Computing Enables a Custom AI Future

Arm Ethos-U85: Addressing the High Performance Demands of IoT in the Age of AI

Media Information

Company Overview & History

Arm Corporate Guidelines

Media Contacts

Latest on X