Podcast

April 23, 2025

Small Language Models, Big Impact: Exploring AI Inference on CPUs

Arcee AI is revolutionizing enterprise AI by making powerful models more efficient and accessible

The Arm Podcast · Arm Viewpoints: Small language models, big ambitions

Summary

In the latest episode of our Arm Viewpoints podcast, we dive into a fascinating conversation with Julien Simon, Chief Evangelist at Arcee AI, about why small language models (SLMs) running on Arm CPUs are becoming a game-changer for enterprise AI implementation.

Julien, who brings extensive experience from his time at Hugging Face and Amazon Web Services, discusses how Arcee.ai has evolved since its founding in 2023 to deliver impressive AI performance by focusing on smaller, more efficient models rather than following the industry obsession with ever-larger language models.

Throughout the conversation, Simon challenges the conventional wisdom that “bigger is better” in AI, explaining how Arcee has achieved remarkable results with models ranging from 8 billion to 72 billion parameters – a fraction of the size of models from OpenAI or Anthropic. Their 72 billion parameter model, Supernova, even outperforms GPT-4 and Claude 3.5 Sonnet on certain benchmarks when run on Arm Neoverse processors.

Perhaps most surprisingly, Simon reveals how enterprises, for many practical use cases, can run these powerful models on standard CPUs rather than expensive GPUs. “For small-scale scenarios, GPUs are overkill. They’re too expensive, and they’re unnecessary,” Simon explains. He emphasizes that many business applications don’t require thousands of tokens per second but rather cost-effective solutions that deliver ROI.

Looking toward the future, Simon draws an interesting parallel to software development’s evolution from monoliths to microservices: “I think we’re doing exactly the same with AI models… breaking OpenAI GPTs into a constellation of smaller models that collaborate together with agents.”

This episode offers invaluable insights for technology leaders considering AI implementation. Listen now to learn how the right-sized models, leveraging Arm compute technology, could dramatically improve your AI performance while reducing costs.

Speakers

Julien Simon, Chief Evangelist, Arcee AI

Julien Simon currently serves as VP & Chief Evangelist at Arcee AI, where he bridges the gap between cutting-edge AI research and practical enterprise applications. With over 30 years of technology experience, Julien has established himself as an authority in artificial intelligence, particularly in optimizing small language models for CPU-based inference.

As an official Arm Ambassador, Julien evangelizes the potential of CPU-based AI inference across various platforms. His current work focuses on demonstrating how small language models running on standard CPUs can deliver impressive performance at a fraction of the cost of GPU-based solutions.

Before joining Arcee AI in 2024, Julien spent nearly three years as Chief Evangelist at Hugging Face, where he created technical AI content and worked with Fortune 1000 companies to demonstrate the business value of machine learning innovations. His strategic partnership initiatives with companies like AWS, Azure, and Intel generated a commercial pipeline exceeding $35 million.

Julien’s six-year tenure at Amazon Web Services saw him rise to Senior Principal Technical Evangelist for AI & ML, delivering over 100 talks globally and authoring “Learn Amazon SageMaker,” which sold more than 10,000 copies. His earlier career includes CTO roles at Viadeo and Aldebaran, where he helped launch the famous Pepper robot.

A graduate in Electrical Engineering and Computer Science from ISEP, Julien also holds a Master’s degree in Computer Systems from Pierre and Marie Curie University. He created the first French-language Linux documentation in the early 1990s, demonstrating his longstanding commitment to technology education and evangelism.

Brian Fuller

Host Brian Fuller is an experienced writer, journalist and communications/content marketing strategist specializing in both traditional publishing and evolving content-marketing technologies. He has held various leadership roles, currently as Editor-in-Chief at Arm and formerly at Cadence Design Systems, Inc. Prior to his content-marketing work inside corporations, he was a wire-service reporter and business editor before joining EE Times where he spent nearly 20 years in various roles, including editor-in-chief and publisher. He holds a B.A. in English from UCLA.

Transcript

Highlights:

00:00 Introduction to AI and Small Language Models

10:21 The Shift from Training to Inference

19:53 Achieving Performance on CPUs

25:13 Quantization and Model Performance

33:35 Future of Small Language Models and AI Microservices

Brian: [00:00:00] Welcome to the Arm Viewpoints podcast, where we bring you technology insights at the intersection of AI and human imagination. I’m your host Brian Fuller, and today we’re diving into the fascinating world of small language models and CPU based AI inference. Our guest is Julien Simon, chief Evangelist at Arcee AI, who brings extensive experience from his time at Hugging Face and Amazon Web Services to the leading edge of AI development at the edge.

In this episode, we explore the evolution of Arcee AI from its origins in 2023 to becoming a leader in small language model technology, the critical distinction between training and inference in AI and why inference on. CPUs represent an enormous opportunity for enterprise applications. How Arcee AI Achieves impressive performance.

Running large models on CPUs through innovative [00:01:00] techniques like quantization and their achievement of running a 32 billion parameter model on an Arm-based CPU. The practical implications of model performance metrics, including the significance of. 16 tokens per second, and what that means for real world applications.

A look into the future of small language model technology and how enterprises can leverage these advances for better ROI, his fascination with a heavy metal band, iron Maiden. And much, much more. So now we bring you Julien. Simon.

Julien, welcome from Paris where the weather is. I understand not the best.

Hey. No, it’s not the best today. Before we dive into the evolution of language models and small language models and inference on CPUs, tell us a little bit about you, your journey. Give us a quick tour of your background.

Julien: Sure. So right now I work for a [00:02:00] startup called Arcee AI, and I’m sure we’ll come to that in a second.

I’m the chief evangelist. For Arcee meaning spend a fair amount of time traveling, speaking at conferences, chatting with partners, and generally trying to explain to everybody what Arcee is all about and what we are able and, do some demos and of course, online content, YouTube and whatnot.

Pretty busy with that. And before that, I worked at another startup called the Hugging Face for almost three years. And I’m sure a lot of you will be familiar with hugging face. And before that I spent six years at Amazon Web Services, again as a tech evangelist working on the AI ML services going.

Brian: Yeah. So toughest question of this segment, Arcee AI. What does Arcee stand for?

Julien: Arcee is actually the name of one of those transformer robots, the auto bots. Sure. Okay. It’s a female auto bot. And the reason why they picked this particular one [00:03:00] is because two of our co-founders have daughters.

And so I guess when they went home and say we’re thinking of naming the company after a robot. I don’t think they had much of a

Brian: choice. So let’s talk a little bit about the company now. Let me do my math here. We are five years from, all you need is Attention. The famous paper.

Yep. That came outta Google in 2017. Your company was founded in 2023. Yep. What motivated the founding of the company and the company’s focus on small language models rather than following the other four, which is the development of large language models, foundational

Julien: models. So we have three co-founders.

Mark, our CEO. Brian is our CRO, and Jacob is our CTO. I worked with Brian and Mark at Hugging face. So it was pretty obvious to us already back then that enterprise users would get more business value out of their [00:04:00] models if those were small open-source customizable models. Hugging face, it stands for open-source and that’s great.

I think at the end of the day, the hugging face is customer, is the open-source community. And I think, mark and Brian and eventually myself figured out that, we thought we’d do more good working for actual customers. So they started Arcee together with Jacob who came from robo flow and with that simple, I would say first principle that.

AI is just delivering more business value if it is based on small open-source models that you can customize and that don’t require an insane amount of infrastructure to predict with, etc., etc.. And so they started by building a training platform where customers, enterprise customers could easily bring their own data, their domain specific data.

Tailor open-source models to the particular use case. They were [00:05:00] after. So let’s say you want to build, I don’t know a customer support chatbot for telco customers. Bring your data, start from a good open-source model, like maybe Llama or Mistral, etc., and tailor it in a very cost-efficient way because Arcee’s stack is pretty clever. Maybe we’ll talk about that later. So they started with that and then they realized if we want to prove that our stack is really good, then we should be building models. And so they started building new models. And what I mean by that is taking existing models like, Llama three or Qwen2, etc., and applying their, the Arcee stack to those models, open-source libraries and training recipes, etc..

And high-quality data sets that we have curated over time and make them better. Okay? And they started releasing those models on hugging face. And for example, they released an 8 billion parameter model, which goes Supernova light, which outperformed metals Lama 8 billion parameter. [00:06:00] They released a 14 billion parameter model that outperforms everything else.

They released a 70 billion parameter model. So really the game was, hey we can build really good models. We have the know-how we have the tech stack to do it.

Brian: Let’s talk a little bit about training versus inference, because training for a couple years was the golden child in terms of the coverage of the rise of AI and generative ai.

But inference has an enormous upside, particularly inference run on CPUs. Talk a little bit about that.

Julien: I guess training did wonders for the, the valuation of some companies, but I’m on the customer side, right? I’m trying to solve problems for real life people. And real-life people are not interested in training models and spending millions on expensive accelerators.

They want to solve problems. Can AI solve this problem in a, in an accurate and cost-effective way? Yes or no? That’s the only thing they care about. That’s my world. [00:07:00] A lot of companies were led to training or fine-tuning models when they really didn’t have to. It’s not that easy. It’s not that cost effective.

It’s not for everybody. And obviously it’s not something you should be doing lightly. So training is more, and I don’t want to say training is for 10 companies in the world, but in a sense I think it has become that way because. Especially when you talk about building new models.

The investment required to train a new model from scratch has gone from, $10 million to a hundred million dollars to maybe a billion dollars now. If you want to compete with Open AI or Anthropic or Meta great a go find VC money and you’ll certainly need a ton of it, or you can try and do it differently.

And I think that’s what we’re doing. So we’re not doing we’re not trying to invent a net new model. We’re trying to start, we’re trying to be clever. And so we’re starting from existing models. We’re making them better [00:08:00] because again, we have what I think is a pretty unique training stack.

We do things very differently and we deliver those outperforming model at very low cost. I think people would be shocked how little we spend. When we improve on, Llama three or QU two, five, etc., it’s nowhere near the numbers that people would expect. because again, you have to be clever.

That’s what we bring to the table. We bring better models to, to our customers. But obviously most of them, they just want to use the models and definitely before they even consider. Fine tuning and improving on those models for their particular use case. They want to try them out the way they are, right out of the box.

Does this model solve my problem? Yes or no? With a bit of prompt engineering or with a bit of retrieve augmented generation, can I get what I need from that? And think a fair amount of customers can, especially since those models are small, so you, the investment required. [00:09:00] To deploy those models is very reasonable.

Because I, again, I, that’s my punchline. I think no one really understands paper token and, we’ve met so many customers who told us, yeah. We started experimenting with the open AI models and before we knew it, we were spending $50,000 a month. And when. It’s not a question of are we getting ROI or not.

We’re probably not getting ROI, but it’s just that we can’t keep scaling our costs like that, right? No. So we like the experience. We know AI and Gen AI is doing some good to our organization and our customers, but the costs are just not sustainable. So we don’t want to pull the plug on the eye.

We definitely want to pull the plug on those crazy bills that we’re getting. So can we use smaller models to get the job done? And that’s a really good way to start the conversation because that’s the business we’re in, which is create small models that deliver the same or hopefully higher business value than [00:10:00] those crazy monster

Brian: models.

You guys play at both ends of the model spectrum with your, with some of your Arcee models. For instance, you’re running Arcee supernova, which is Llama 3.1. 70 billion parameters. 70 billion. Billion. Yeah, so I think the largest we have is

Julien: 72 B 72. We have a 72 B, but yeah, that’s the, we still consider that a small language model.

Okay. Because if you we don’t really know, but if we assume that, the Anthropics and the Open Ais of the world are running, trillion parameter models or, 500 billion or whatever the crazy number is, 70 billion is still small and you can still run CB on a single box, on a single server.

And that’s my definition of an SLM.

Brian: So you’re doing supernova on, you’ve benchmarked it on arm, Neo verse V one, V two, it outperforms GPT-4 oh and Claude 3.5 sonnet. Talk a little bit about how you’ve been able to achieve that [00:11:00]outperformance. Yeah,

Julien: so you have those two dimensions, let’s call them.

So the first dimension is how close can we get to the state-of-the-art performance of let’s say, GPT-4, philanthropic, etc.. With a model that is, let’s say 10 10th of the size, right? Hopefully smaller, the smaller the better. But can we that’s the first promise we want to keep to customers.

We can give you on your enterprise use cases, we can give you comfortable performance, hopefully better performance with a model that’s just a fraction of the size. That’s where the training stack is important. It’s not just, sometimes people tell me, oh, you’re a fine-tuning company. I’m like, no, we’re not.

Of course we do fine tuning. There is fine tuning in the pipeline, but it’s much more than that. So in the last year, our research team has actually built and published a collection of open-source libraries. Some of them may be familiar to our listeners. Merge kit for model merging distill kit for model distillation, [00:12:00] spectrum for parameter efficient training.

These are the main ones. So again we’re trying to, we’re trying, because if we set to spend 50 million or more on building those s SLMs, I guess the business case for Arcee wouldn’t be that interesting. And again, procuring the infrastructure and running those training jobs for six months, we would never have the same velocity.

Those techniques like merging and distillation, etc., are much more. Efficient than I would say brute force pre-training. That’s how we do it. And it’s a combination. If you look at our go, go to our blog, if you want to know more you’ll see articles on how we build supernova, and I know other models, and you’ll see it’s always a combination of distilling from a much larger model and merging different variants of the same model trained on our high-quality data sets, etc., etc..

And all those steps are. Really compute efficient compared to brute force training on, [00:13:00] on, on GPU clusters. So that’s the first dimension, right? And that’s how we get the quality without spending, insane amounts of money. Okay? So that’s the first level, first dimension. Second dimension is how effectively can we run, can our customers run those models, whether it’s in our SaaS platform or whether it’s in their own infrastructure?

How can we shrink. The cost of inference and how can we help them deliver maximum ROI, right? Because ROI for AI applications is a very interesting concept. A cost part of the equation is super clear. You need GPU X-Y-Z to deploy the model. It’s going to cost you this many dollars an hour or a month.

Whether the model is running or not, that’s what it is. But how much money the model is going to make you or save you is very a little fuzzier, right? To me, the best way to get ROI is to minimize the cost. While do you figure out the revenue you can [00:14:00] act on the one piece of the equation.

So if you can shrink the cost, even if the revenue or the savings are a bit unclear. There’s a better chance that you hit some kind of good number because you were clever in, in optimizing the cost element. And that’s where, again, those models being small, if you insist on running them on GPU, then you can use smaller GPUs, you can use, you can scale out architectures, etc., instead of having those huge multi-GPU boxes that everybody’s obsessed about.

And then of course, you can continue being clever and try to run those things on. Other platforms that GPUs use. And of course, CPUs, and that’s what we’ve been doing with some success. Yeah. And, but,

Brian: and yet you’re running a 32 billion parameter model on a CPU, which has a mind boggling Yeah.

Seal to it. How is that possible?

Julien: So first of all, and a lot of people go, oh yeah, I never thought about it when I say this, but the [00:15:00] main constraint on GPUs is actually not compute. It’s ram. How much GPU Ram, that’s fan that fancy HBM high bandwidth memory. That is making some companies so rich, you have a limited amount and if you don’t have enough to hold a model, then you need to start splitting the model across different GPUs and I guess that’s making other companies richer.

because you need more GPUs again. So it’s. It’s just, it’s a not it’s not a good flywheel. Again, if you have small models and if you run them on CPU, then you need to use you can use, I guess the host Ram is not constrained when you do CPU inference is not constrained in the same way that GPU ram is constrained.

So that’s one. Two, of course, CPUs, even though they have multiple cores and everything, they have nowhere near the level of parallelism that GPUs have. So you need to do something there. Okay. And when we run models on CPU platforms, we process the models with a technical quantization, which is central to good performance.

And [00:16:00] basically quantization is a way to resize model parameters to a smaller size. Models today are, I would say, 16-bit parameters. Okay? So if you say we have an 8 billion parameter model, it means you have 8,000,000,016 bid values. They will surely be stored in your host run, no problem there. You need to move them around and you need to compute them.

So in order to reduce the computing requirements and the memory, back and forth requirements, you can shrink those parameters, resize them to, let’s say eight bits or four bit. So that’s already, you’re relieving some of the, back and forth pressure that needs to happen when you’re running inference.

Listeners need to take one thing away from this whole discussion. It’s cost performance. If you’re spending a thousand dollars on inference, right? How much throughput are you really getting? Cloud is all about scale out, scaling, scale out elasticity and so that’s what you need to look for. [00:17:00] You realize that you can get a whole lot more done with your CPU dollars than your GPU dollars.

Brian: Speaking of cost performance, you often use the phrase 16 tokens per second. Yeah. What does that mean in practical terms for customers? because it sounds like it’s a cost effectiveness metric.

Julien: When we’re talking about conversational apps, of course we’re we need to worry about, accuracy. The quality of the text that is generated.

If it’s crazy slow, if you get one word per second, it’s man, this, this looks like a, a movie from the seventies with a, a mainframe computer or something, right? So that’s not the experience you’re, you are after. So generally, I think, we will agree that 10, so tokens, tokens are not strictly equivalent to words.

10 words in English, 10 words, really. Will be equivalent to maybe 13, 12 or 13 tokens. But tokens include punctuation and but generally, if you get 10 tokens per second, that, that’s considered faster than you can probably [00:18:00] read.

Brian: You’ve given us a little tutorial on quantization. When you quantize something, there’s a bit of degradation in the model quality. Talk about that trade off.

Julien: Sure. The intuition is okay if you have 116-bit, I imagine you have 16-bit parameters. So it means for each neuron weight, each connection between two neurons in between, across layers. If the weight is 116-bit value, then it means you have 65,536 unique values.

So you have very fine grain values to describe. What going on between those two nos right now? If you move to four bits, then all of a sudden you, you get much fewer values. You could say the granularity is much lower, so surely I’m losing a ton of fine grain knowledge and all the finer nuances in, in that learning process [00:19:00] are gone.

And the answer is. Not really. Funny enough, there is a way to measure this and for smaller models, maybe 8 billion parameters. And below we’re talking just about a few percent. So the metric here is called perplexity. And perplexity is how well a model is able to predict the next word. And perplexity, again for smaller models, usually degrades, maybe, three, 4%.

Which is considered negligible for the huge majority of use cases. As you go with bigger models, the degradation tends to be higher, but there’s a compounding effect. But for smaller models, which we repromote it’s absolutely fine. And definitely the tradeoff is if I get a model that is, 97 or 96% as good as the original model.

Now I’m able to run it on a, let’s say a cloud instance that costs me, 60 or 70 cents an hour, not six or [00:20:00] $7. Is that a trade off? I’m happy with. Some folks will say no. They’ll say no, I need every single bit of accuracy. But if you are, if you are building a chat bot or if you’re building a, I would say non-life critical non-mission critical application.

Where ROI is everything. This is certainly something you want to look at. You’ve talked

Brian: about a range of model sizes, 8 billion, all the way up to 72 billion. Can you talk about performance patterns that you see emerging across different model sizes?

Julien: We’re able to match the best models out there. With, seven TB or 72 B models.

You mentioned Supernova. Our new platform, which we, again, which we launched yesterday as a couple of a couple of large models, which are in the seven tb range. And our customers tell us they get, like I said, equal or better results from those. From the large models they’re using today, again, at much lower cost.[00:21:00]

I would say for narrower use cases, let’s say you want to build you, you want to focus a model on. A particular industry and maybe even, your, a particular use case in your company, you can absolutely get the job done with maybe a seven or 8 billion parameter model. So that’s really, that’s really where it is today.

And that’s why we don’t do anything bigger that than 70 B because we see the performance that we want from those seven TB and we can run them cost effectively. What are customers

Brian: seeing when it comes to results and does it. Does instance type matter. In that case,

Julien: the first thing folks look at is how does the model answer their prompts?

Is the model doing a good job? Yes or no? Because if it doesn’t, you can say but we can run it for super cheap. Yeah, but it’s bad, so it has to be good. Okay. And then you can talk about you can look at how much scale they need from that thing. How many users, how many requests per day?

The typical length of a. [00:22:00] A question do some capacity planning in a sense, right? Because if you have 10 users using that being 10 times a day, it is very different from a thousand users using it 50 times a day. And again, ROI will be different from one use case to the next.

Yeah. If you are if you are the largest mobile operator in the US and you have hundreds of thousands, maybe millions of calls, your call centers every day. You will have a very def very different definition of RI too. Again, one size doesn’t fit all. So one, one model for one customer would need to run on GPU because they have enough traffic and enough ROI to justify that.

Others might be much more cost sensitive than they may be looking for the most unexpensive way to deploy those models. I think that what’s great about s SLMs. Is that you do get that choice. You don’t get that choice with the large models you pay per token. [00:23:00] And those things run on those monster GPUs.

Brian: So Julien what’s driving the adoption of small language model inference and AI agents in the market? And why are Arm-based processors particularly well suited for these?

Julien: I think, enterprise customers realize that large language models although they have interesting abilities also have shortcomings in terms of domain adaptation, in terms of privacy sometimes for regulated companies.

And of course in terms of cost, right? Yes they have abilities, but those come at a cost. And let’s face it for most. Of your queries, those abilities are not necessary. If you’re looking at your daily, use of language models for, your day-to-day prompts translating this or summarizing that, or, rewrite this email [00:24:00] or write an email to organize this meeting.

Whatever, those are low complexity. Queries that do not need large language models. But at the same times you may need a deep domain knowledge and those large language models fall short in a lot of in a lot of cases. The need for cost efficiency and tailored models, I think is what is driving a lot of enterprise customers to use considering small language models. We released a model a few weeks ago. That’s a 10 billion parameter model that is better than the 72 B model we released in last July. Was that, that 72 B model? That’s Virtuoso light. Vir light.

Brian: Yep. Okay.

Julien: And that 72 B model from last summer was the best 72 B model you could get on hogging face.

So it wasn’t a bad model at all. It was the best in its size. I’m not claiming yet that 10 B is the new 72 B, but the, [00:25:00] these ghosts to show models are getting smaller and they’re getting better too. So they’re getting so small that now it is really. Possible to run them on CPU platforms because you leverage instruction sets and the Arm CPUs have introduced dedicated instruction sets to accelerate deep learning operations, matrix, multiplication, etc., etc..

Brian: I think with respect to that virtuoso light model, you wrote that there’s a four and a half X cost performance advantage over x86 with Arm CPUs, and you talked about specific scenarios that CPUs might be cost-effective alternatives to GPUs. What does that mean for customers?

Julien: When you’re looking at workloads, you are, you’re mostly looking at how much throughput am I going to get?

Okay. And I think that’s the key metric. A lot of people are, obsessing over, over latency, how fast [00:26:00]am I getting the first token, etc., etc.. And. Yes, it matters. Especially if you need very, snappy, very fast conversational apps or, I don’t want to say real time because I don’t think real time and language models should be used in the same sentence.

But really it’s how much how much token, how many tokens am I getting out of that model? And when I talk to customers, and a lot of them have. Small scale needs. The world wants you to believe, ah, you need to be able to speed out a thousand tokens per second, 24 7, etc..

But a lot of business scenarios are quite the opposite. They, they have, small scale they have sometimes non latency sensitive use cases like document processing, stuff that runs overnight, etc.. It’s not even conversational. And if you look at it that way, then speed is a bit irrelevant and it’s really a case of okay, how about cost [00:27:00] performance?

How about right sizing my inference infrastructure so that yes, I get the job done. In the timeframe I need to get it done, which could be, I don’t know, maybe 30 seconds for a conversation or maybe I just need to process documents overnight and that’s it. And am I still, saving money or making money in the process, am I getting ROI out of that? And again, a lot of enterprise scenarios. Are very narrow and small scale. So it’s not, Hey, I have this one model that serves all my needs and I need to get 1000 tokens per second out of it, it’s more maybe I’ve got those 20 chatbots or 50 chatbots deployed across the company.

And those, are. Not even maybe idle most of the time, or they get a few hits per second at most. And so it’s really, it’s not really, it’s not really reasonable to deploy GPU instance [00:28:00] for each one of those, right? because they’re going to be idle most of the time. And as we said before, because those models are small, even small GPU instances may be oversized.

So that’s really when you realize that you realize, oh, I need to right-size my infrastructure and in technical terms what this really means is a lot of those use cases are batch size. One, I. So you send one query to the model, get the answer, and then, maybe 30 seconds later, somebody else will ask an answer.

So for those small-scale scenarios, GPUs are overkill, they’re too expensive and they’re unnecessary. So those scenarios are where CPU inference is very relevant. You know what I call the batch size one scenarios and. Because the thousands of core that you have on GPUs are not going to be necessary.

There’s no parallelism here. You are just serving one query to one [00:29:00] user. And so the 16 or the 32 cores that you have on your arm, CPU, are more than capable of doing that. So that’s the scope. And some folks say, oh of course we’re never going to replace GPUs.

And I’m telling them that’s not what I’m trying to do. I’m trying to explain that for small scale inference. Or inference at the edge where GPUs are impractical for a million reasons. There is an option to do this on Arm CPUs in a very cost-efficient way and still reasonably fast. And if you do need 2000 tokens per second, my friend, of course, go use GPU instances.

So in that context, Arm CPUs are the best platform, right? I work a lot, with Lama CPP, that’s my go-to tool. I love the project and it’s very easy to optimize inference for s SLMs with Lama CPP, and I, we work with everybody, but yes I tried Intel and that didn’t work really well.

I’m just hoping Intel will improve their Lama CPP support and be more [00:30:00] of a competition than they are right now. And again, when it comes to GPUs, even the smallest GPU instance, for example, on AWS. Is still too expensive at batch size one. I don’t think GPUs are the best option a hundred percent of the time, and I know for a fact there are not.

Okay. So can we please have a balanced, pragmatic engineering view of the problem and stay away from, marketing propaganda?

Brian: Yes. That’s my answer. Yes, we can.

Julien: Thank you. That makes two of us.

Brian: So let’s talk about the software ecosystem a little bit. Yeah. And your use of Arm c cloudy and the quantization process.

Sure. How important is that as a puzzle piece to the larger outcome?

Julien: It is a crucial piece. It is a crucial piece. Of course, you can take you can take the vanilla model, in probably 116-bit precision and you can run [00:31:00] it says with Lama CPP, my favorite tool I. On, on Arm CPUs, on, graviton on AWS or other instances on other clouds.

And it’s still it does work, it does work. It’s not going to crash. It’s not going to go wrong. It’s just a little bit slow because it is a bulky model. If we’re looking at our latest model, which is called Blitz, it’s a 24 billion parameter model. That’s 48 gigs worth of parameters plus everything else.

So that’s quite a bit of data you need to move back and forth between the RAM and the CPU, etc., etc.. So quantization will shrink the range of parameters from let’s say 16 bit to eight bits, or even, four bits. Why not? So obviously that makes the model smaller. That removes some pressure on the memory bandwidth.

But most importantly, it makes it possible to leverage the again, the integer instruction sets that are present [00:32:00] in in Arm CPUs. And so all those typical operations multiplying matrices and vectors or multiplying matrices with matrices etc., dot products and all that good stuff, which is really the bread and butter of deep learning inference.

Becomes, becomes accelerated, right? And to leverage those instruction sets, of course you do need software support and that’s what those cloudy AI kernels are. Just to make it clear, it’s not really an end user tool that you would grab. It’s really if you are a framework builder, if you work in the PyTorch team or if you work in the Lama CPP team, or if you are.

Maybe trying to build your own inference stack. Why not? Then, yes, those kernels matter and you should absolutely use them because they’re written by arm. And those teams know their chips well. So yeah they are important.

Brian: So I think I, I need to start addressing you as Mishu Ambassador, because you’re now an Arm ambassador.

Talk.

Julien: Talk to us about that. What’s that [00:33:00] all about? Again it’s great recognition for our collective work at Arcee realizing that yes, AI is a thing. But AI on CPU is a thing and s SLMs are showing the way. And I think it’s, yeah it’s nice to get a chance to address a wider audience and, and get a little bit of recognition and again, a little bit of support.

So yeah, looking forward to do more with Arm developer relations. It’s going to be, it’s going to be fun.

Brian: What excites you in the context of the Arm Arcee Technology Roadmap?

Julien: We need to go smaller. And there’s no reason we won’t. Honestly there’s no reason we won’t. If 10 B is the new seven tb, then why?

What? Why? There’s no technical reason to stop there. Why can’t four B be the new 70 B? So I think that’s what excites me. Keep driving hardware innovation. I’m waiting for the next neo verse [00:34:00] generation. And I’m curious to see how this will, be leveraged in frameworks.

PyTorch is doing a lot of good work and again, Lama CPP is doing a lot of good work. I’m sure there will be others and maybe see if, model architectures can adapt to that. That’s I don’t have the answer, but is there a way to build s SLMs that would make them great edge models?

I’m sure the answer is yes.

Brian: Now I’m going to hold your feet to the fire and have you look to the future. So what developments in small language model technology should enterprises your customers be looking out for?

Julien: I’m sure a lot of folks listening to listening to this are software engineers and Right.

Hopefully that my analogy will resonate. Remember 20, 20 years ago, and just for the record, Brian and I, we have gray hair, right? So we’ll remember 20 years ago, a lot of us were building huge applications, enterprise apps, probably some sort of Java framework. That was the way.

[00:35:00] Building those huge monolithic apps, which did probably some good overall. But then, a few years later, those monoliths became hard to scale, hard to maintain, hard to debug, hard to evolve. And so we started breaking the monolith in and we built APIs and microservices, etc.. So guess what?

I think we’re doing exactly the same with AI models, except of course things move faster so we don’t have to wait for 15 years for, AI microservices to be invented. It only took a few years to start breaking OpenAI GPT into a constellation of smaller models that collaborate together with agents.

So that’s exactly what we’re building today because we know the monolith never really works. I think now folks understand that, and all that talk about agents is exactly that. Combining high quality small language models that know how to do one thing very [00:36:00] well. So I’ve got my, let’s say my general-purpose model and I’ve got my coding model and I’ve got my vision model and I have my, maybe my speech to text model, etc., etc., right?

And we build those. That’s pretty much what we announced in our new platform. Which is called Arcee Orchestra, and of course you need to plug those models into traditional applications because another key learning is that you know what models are probabilistic, so they’re really good at that.

But they are statistical beasts. They’re, you can call it AI or gen AI or anything you want. At the end of the day, my friends, it’s still good old math and stats. Okay. If you’re the CFO of a listed company, I don’t think you want probabilistic answers to your questions. No, you don’t. And what I’m getting at here is in the last 20 years or 30 years, folks have been writing [00:37:00] it applications that know where to find the unambiguous data.

On the question you have, if you can connect the models. To those apps, which themselves have access to the data. Now you have the winning combination and now you have your collection of models and agents and you can build workflows and orchestrate all that stuff. And this is how you can outperform even the best model because that model will only give you whatever it’s been trained on.

Brian: So you’re an Iron Maiden super fan. You’ve, you have a gold record on your wall. You showed it to us in an earlier call. Oh yeah. You saw that? Yeah. Do you see any parallels between maiden’s evolution from raw power to technical complexity and the evolution of languish models?

Julien: Yeah, I think I do. Iron made and a lot of bands start, they start with a very raw approach, right?

Yeah. Hey [00:38:00] Metallica fans. You know what I mean? The first album can’t get rawer than that, but still, it does the trick and delivering that first wave of models that over time, over time you need to explore more. And you need, you learn to play your instrument a little better.

And you nerve you learn more, a more nuanced and more complex way to write. Music or to build models.

Brian: Do you see any challenges driving the broader adoption of CPU based AI inference? Or is it blue skies and sunny temperatures?

Julien: No, I think it’s not, I think there’s I.

There’s clearly a battle to be fought here, and number one is awareness. I think it’s a lot of folks will, stare at you funny when you say CPU inference, especially for larger models. So we really need to show what it’s all about. We need to show the performance. So the generation speed we discussed, [00:39:00] we need to show the accuracy and the minimal degradation.

And we need to show the ROI. Because once, once you show that, once you tell people, Hey, why are you paying a hundred dollars an hour for a large GPU instance in the cloud where you know you are paying a few dollars per hour for a bunch of CPU instances, that would probably be enough, right? And if you need to scale out, then go and scale out.

You are not going to have any problem. Procuring another 60 or 80 CPU instances, you’ll probably have to sell your kids to get another GPU instance from your cloud rep. You need to show them that.

Brian: Julien, I can’t think of a more informative hour that we’ve spent learning about AI and the evolution of language models.

You’ve been very, thank you, generous with your time. So thank you. Best of luck to you and the team at Arcee and will. [00:40:00] Check in with you in the future. Keep up the good work.

Julien: Yeah, we’re just getting started. Check us out. Awesome. Thank you very much.

Subscribe to Blogs and Podcasts

Get the latest blogs & podcasts direct from Arm

Blog

Nov 13, 2024

Arm Ethos-U85 NPU: Unlocking Generative AI at the Edge with Small Language Models

Arm Editorial Team

News

Feb 26, 2025

Arm Drives Next-Generation Performance for IoT with World’s First Armv9 Edge AI Platform

Paul Williamson, SVP and GM of the IoT LoB, Arm

Blog

May 22, 2024

Small Language Models: Efficient Arm Computing Enables a Custom AI Future

Ravi Malhotra, Hyperscale Solutions Architect, Arm

Podcast

Dec 20, 2024

Small Language Models, Big Impact: Exploring AI Inference on CPUs

Summary

Speakers

Julien Simon, Chief Evangelist, Arcee AI

Brian Fuller

Transcript

Related

Arm Ethos-U85 NPU: Unlocking Generative AI at the Edge with Small Language Models

Arm Drives Next-Generation Performance for IoT with World’s First Armv9 Edge AI Platform

Small Language Models: Efficient Arm Computing Enables a Custom AI Future

Agentic AI, Edge Computing and the Race for the Future: Tech Predictions for 2025

Small Language Models, Big Impact: Exploring AI Inference on CPUs

Listen now on:

Summary

Speakers

Julien Simon, Chief Evangelist, Arcee AI

Brian Fuller

Transcript

Related

Arm Ethos-U85 NPU: Unlocking Generative AI at the Edge with Small Language Models

Arm Drives Next-Generation Performance for IoT with World’s First Armv9 Edge AI Platform

Small Language Models: Efficient Arm Computing Enables a Custom AI Future

Agentic AI, Edge Computing and the Race for the Future: Tech Predictions for 2025