Blog

July 2, 2024

Redefining Technology with Smart Vision, Large Language Models, and AI

Unlocking the potential of multimodal AI and computer vision at the edge

By Chloe Ma, VP, China GTM for IoT Line of Business, Arm

Noam Chomsky, a pioneer in linguistics and cognitive science, once said that human language is unique and unparalleled in the animal world. Now, the rapid development of large language models (LLMs) and generative AI, such as GPT-3.5, 4.0, and Bert, has made language possible for machines, greatly expanding their capabilities. This raises the question: where do we go from here?

The Evolution of Intelligence Creating New Computing Paradigms

To imagine how AI and large language models evolve, we need to look no further than ourselves. We human beings change the world through a dynamic interplay of our senses, thoughts, and actions. This process involves perceiving the world around us, processing information, and then responding through deliberate actions.

Over the history of computing, we have witnessed the gradual transfer of abilities such as perception, thinking, and action – once exclusive to humans – to machines. Each inflection point in the transfer of these abilities will give rise to new paradigms.

In the late 20th century, great companies like Google rendered information-acquisition cost from marginal to fixed, meaning it costs money for Google to crawl the web and index the information, but for every one of us to find the information, it is almost zero cost. Machines could take over humans as our information system. This ushered in the Internet era and the subsequent era of mobile internet, which changed the way people access, disseminate, and share information. It has had a profound impact on fields such as business, education, entertainment, and social interaction, among many others.

And now, we are witnessing a new inflection point where thinking, reasoning, and model construction shifts from humans to machines. OpenAI and large models transform the cost of producing models from marginal to fixed.

Large models have been trained on vast amounts of text, image, video from the Internet, which includes information from diverse fields such as law, medicine, science, arts, and many others. This extensive training allows these big models to serve as foundational models to more easily construct other models.

This inflection point will inevitably spark a proliferation of models, whether they are cognitive (how we see, and speak), behavioral (how to drive a car), or domain-specific models (how to design semiconductor chips). Models represent knowledge, and this inflection point will make models and knowledge ubiquitous, accelerating the arrival of the next inflection point – The era where machines such as autonomous vehicles , autonomous mobile robots and other robots such as humanoids perform actions in all kinds of industries and deployment scenarios. These new paradigms will redefine human-machine interaction.

Multimodal large language models and the critical role of Vision

With transformer models and their self-attention mechanism, AI can become truly multimodal, meaning the AI systems can process input from multiple modes such as speech, images, and text, just like us human beings.

OpenAI’s CLIP, DALL·E, Sora and GPT-4o are examples of models that take steps toward multimodality. CLIP, for instance, understands images paired with natural language, allowing it to bridge visual and textual information. DALL·E is designed to generate images from textual descriptions, and Sora can generate videos from text, promising to get to a world simulator. GPT-4o took it one step further: OpenAI trained GPT-4o as a single new model across text, vision, and audio end to end without converting multimedia to and from text. All inputs and outputs are processed by the same neural network to enable the model to reason across audio, vision, and text in real time.

The future of multimodal AI will be on the edge

AI innovators have pushed the boundaries of where models can operate, driven by edge hardware advancements (many developed on Arm), latency concerns, privacy and security requirements, bandwidth and cost considerations, and the need for offline capabilities during intermittent or absent network connections. Even Sam Altman himself admitted that for video (what we perceive through vision), an on-device model might become critical to deliver the desired user experience.

However, resource constraints, model size, and complexity challenges hinder the move of multimodal AI to the edge. Addressing these issues will likely require a combination of hardware advancements, model optimization techniques, and innovative software solutions to make multi-modal AI prevalent.

The profound impact of the recent AI developments on computer vision is especially intriguing. Many vision researchers and industry practitioners are using large models and transformers to enhance vision capabilities. Vision can also be increasingly critical in the large model and action eras.

Why?

Machine systems must understand their surroundings through senses like vision, providing essential safety and obstacle avoidance capabilities for autonomous driving and robots, which is a matter of life and death. Spatial intelligence is a hot area for researchers such as Fei-fei Li, broadly regarded as the godmother of AI.
Vision is also crucial for human-machine interaction. AI companions not only need a high IQ but also a high “EQ.” Machine vision can capture human expressions, gestures, and movements, to better understand human intentions and emotions.
AI models require vision capabilities and other sensors to gather real-world data and adapt to specific environments, which is particularly important as AI moves from light to heavy industries where digitization levels have been lower.

Vision + Foundational Models: A Few Examples

Although ChatGPT gained popularity due to its language capabilities, leading large language models (LLMs) have evolved to become multimodal, making the term “foundational models” more appropriate. The field of foundational models, which encompasses various modalities, including vision, is developing rapidly. Here are a few:

DINOv2

DINOv2 is an advanced self-supervised learning model developed by Meta AI, which builds upon the foundation of the original DINO (DIstillation of NOisy labels) model. It has been trained on a vast dataset of 142 million images, which helps improve its robustness and generalizability across different visual domains. DINOv2 is capable of segmenting objects, without having ever been trained to do so; also it produces universal features suitable for image-level visual tasks (image classification, video understanding) as well as the pixel-level visual tasks (depth estimation, semantic segmentations), impressively versatile!

SAM (Segment Anything Model)

SAM is a promotable segmentation system with zero-shot generalization to unfamiliar objects and images, without the need for additional training. It can identify and segment objects within images using various input prompts to specify what to segment. This allows it to operate without requiring specific training for each new object or scene it encounters. According to Meta AI, SAM can produce segmentation results in as little as 50 milliseconds, making it practical for real-time applications. Its versatility allows it to be used across a variety of fields, from medical imaging to autonomous driving.

Stable Diffusion

Generating images and videos from textual descriptions is an important aspect of GenAI as it not only enables new forms of creativity, but also promises to build a world simulator. World simulators can serve as the foundation for training simulations, educational programs, or video games. Stable Diffusion is a type of generative AI model known for its ability to create images from textual descriptions. This model uses a technique called latent diffusion, which operates efficiently by manipulating images in a compressed format called latent space rather than directly in pixel space. This approach helps in reducing the computational load, allowing the model to generate high-quality images more quickly.

Stable Diffusion can already run on the edge on smart mobile devices. The above shows an example of Stable Diffusion optimization journey:

Taking Stable Diffusion as it is, you wouldn’t run it on mobile CPU or NPU (using 512×512 image resolution based).
By using a smaller U-Net architecture, fewer sampling steps, switching to ONNX format, applying quantization (FP32 to int8), and employing other techniques, it has shown more than a 60x speed-up on the CPU alone. Many of these optimization techniques and tools are developed by Arm’s wide ecosystem. There is still room to optimize it even further.

Axera’s pursuit of better Vision with multimodal large language models

One of Arm’s partners, Axera, has demonstrated deploying a DINOv2 vision transformer at the edge on its flagship AX650N chipset. The chip uses Arm Cortex-A55 CPU cluster for pre- and post-processing, in combination of Axera’s Tongyuan mixed-precision NPU and AI-ISP, to deliver high performance, high precision, ease of deployment and excellent power efficiency.

This video shows the effect of running DINOv2 on AX650N.

Vision transformers offer better generalization to new and unseen tasks when pretrained on large and diverse datasets, simplifying retraining and shortening finetuning time.

They can be more adaptable to various tasks beyond image classification, such as object detection and segmentation, without substantial architecture changes.

Embracing the Future of AI and HMI

Thanks to AI and evolving large language models, we’re on the cusp of a transformative shift in technology and human interaction. Vision plays a crucial role, enabling machines to navigate and interpret their surroundings, ensuring safety and enhancing interaction. The move towards edge AI promises efficient, real-time applications, driven by advancements in hardware and software.

(Catherine Wang, principal computing vision architect with Arm, contributed to this article)

By Chloe Ma, VP, China GTM for IoT Line of Business, Arm

Article Text

Copy Text

Any re-use permitted for informational and non-commercial or personal use only.

Editorial Contact

Arm Editorial Team

editorial@arm.com

Subscribe to Blogs and Podcasts

Get the latest blogs & podcasts direct from Arm

Blog

Jun 12, 2024

Transforming the Future of AI and Robotics with Multimodal LLMs

Chloe Ma, VP, China GTM for IoT Line of Business, Arm

Blog

Apr 09, 2024

Arm Corstone-320: Accelerating Voice, Audio and Vision IoT Systems

Diya Soubra, Director, IoT Solutions, Arm

Blog

Sep 20, 2023

What Are the Key Strategies for AI Vision to Tackle High Costs and Complex Data Processing?

Parag Beeraka, Senior Director, Consumer Computing, Client Line of Business, Arm

Blog

Sep 13, 2023

Seeing the Potential of Smart Vision

Parag Beeraka, Senior Director, Consumer Computing, Client Line of Business, Arm

Blog

Jan 08, 2024

Arm: The Technology Foundation for AI Everywhere

Arm Editorial Team

Blog

Apr 05, 2024

From Possibility to Reality: Enabling AI and ML at the Edge with Arm

Paul Williamson, SVP and GM of the IoT LoB, Arm

Media Information

Latest on X

; Arm @Arm ·

23h 1943349846733119934

Congrats to the @SamsungMobile team on a fantastic #GalaxyUnpacked! 👏

The new Galaxy Z Flip7 and Watch8, built on Arm CPU, showcase what’s possible with leading performance and efficiency for smarter, AI-first experiences.

Samsung Galaxy Z Flip7: A Pocket-Sized AI Powerhouse With a New Edge-To-Edge FlexWindow

Compact in size, bold in capability – Galaxy Z Flip7 redefines the flip phone experience

okt.to

Reply on Twitter 1943349846733119934 Retweet on Twitter 1943349846733119934 1 Like on Twitter 1943349846733119934 8 Twitter 1943349846733119934

; Arm @Arm ·

10 Jul 1943295391228637685

SME2🤝KleidiAI= The perfect match for matrix-heavy AI workloads on mobile

With 6x faster AI responses on models like Google's Gemma 3 & real-time text summarization in under a second, SME2 is built to scale next-gen AI features across devices - starting with your apps from today

Arm Software Developers @ArmSoftwareDev

📢 Mobile devs, get ready for a performance boost on matrix-heavy AI workloads with SME2.

Built into @Google’s XNNPACK and AI frameworks via Arm KleidiAI, now’s the time to make sure your apps use a supported stack to benefit - no code changes required: https://newsroom.arm.com/blog/arm-sme2-android-mobile-apps?utm_source=twitter&utm_medium=social-organic&utm_content=newsroom&utm_campaign=mk24_developer_na

Reply on Twitter 1943295391228637685 Retweet on Twitter 1943295391228637685 3 Like on Twitter 1943295391228637685 14 Twitter 1943295391228637685

; Arm @Arm ·

8 Jul 1942588028058239400

“You can’t load up a car with huge servers to run the model.” – Suraj Gajendra, VP Products and Solutions, Automotive

In a recent Arm Viewpoints podcast episode, Suraj and @silviusrus, VP of Software at @Wayve_AI, explore what today tells us about the future of autonomous…

Reply on Twitter 1942588028058239400 Retweet on Twitter 1942588028058239400 2 Like on Twitter 1942588028058239400 15 Twitter 1942588028058239400

; Arm @Arm ·

7 Jul 1942327794261770408

Moore’s Law is slowing. AI demand isn’t.

Will Abbey joins @RAISESummit tomorrow to explore how the industry is meeting this compute collision, with smarter architectures, efficient design, and AI-ready infrastructure: https://okt.to/cLSZ86

Reply on Twitter 1942327794261770408 Retweet on Twitter 1942327794261770408 4 Like on Twitter 1942327794261770408 8 Twitter 1942327794261770408

; Arm @Arm ·

3 Jul 1940901701617131688

As AI models become more efficient, Rene Haas and @OpenAI’s @markchen90 reflect on what’s next in the evolution of intelligence.

🎧 They explore the promise of AGI and how it could empower a new wave of entrepreneurship by making creation more accessible: https://okt.to/2UcJYm

Reply on Twitter 1940901701617131688 Retweet on Twitter 1940901701617131688 6 Like on Twitter 1940901701617131688 21 Twitter 1940901701617131688

; Arm @Arm ·

1 Jul 1940141221319717058

Congrats to @RenesasGlobal on the RA8P1 MCU group, powered by Arm Cortex-M85, M33, and Ethos-U55.

Designed for on-device AI and ML, it brings advanced performance to next-gen voice and vision applications, alongside real-time analytics.👏

https://www.renesas.com/en/about/newsroom/renesas-sets-new-mcu-performance-bar-1-ghz-ra8p1-devices-ai-acceleration

Reply on Twitter 1940141221319717058 Retweet on Twitter 1940141221319717058 1 Like on Twitter 1940141221319717058 9 Twitter 1940141221319717058

; Arm @Arm ·

30 Jun 1939717874630996142

AI is getting smaller, smarter, and moving to the edge.

As physical and agentic AI converge, scaling means finding the right mix of CPUs and specialized AI Accelerators to drive what's next.

Dive into the insights of our new Exec Insights report:

Reply on Twitter 1939717874630996142 Retweet on Twitter 1939717874630996142 14 Like on Twitter 1939717874630996142 49 Twitter 1939717874630996142

; Arm @Arm ·

27 Jun 1938677414399549557

As Rene Haas shared with Bloomberg @technology Europe, meeting AI’s growing demands will require more energy and an infrastructure evolution.

That’s why we’re committed to delivering efficient compute solutions for AI - from cloud to edge, and at scale.🧠

Reply on Twitter 1938677414399549557 Retweet on Twitter 1938677414399549557 3 Like on Twitter 1938677414399549557 17 Twitter 1938677414399549557

; Arm @Arm ·

27 Jun 1938650552440586252

From AI toys to robot dogs, we are powering the next wave of intelligent, energy-efficient robotics at the edge with partners like R2C2, DEEP Robotics and more!

Discover Arm's role in the robotics revolution🤖
https://okt.to/iGj5d3

Reply on Twitter 1938650552440586252 Retweet on Twitter 1938650552440586252 16 Like on Twitter 1938650552440586252 36 Twitter 1938650552440586252

; Arm @Arm ·

26 Jun 1938352703555469422

Today we welcomed Lord Peter Mandelson, UK Ambassador to the USA, to our HQ in Cambridge.

Our presence across the UK and US drives innovation, enabling AI at scale, and supporting the industries shaping tomorrow.

Highlights from the visit below 📷

Reply on Twitter 1938352703555469422 Retweet on Twitter 1938352703555469422 4 Like on Twitter 1938352703555469422 23 Twitter 1938352703555469422

; Arm @Arm ·

26 Jun 1938275246026744106

We’re proud to be named as one of the 2025 @TIME 100 Most Influential Companies!

With 310B+ chips shipped and Arm everywhere from the cloud to the car, this recognition reflects our foundational role in shaping the future of AI.

#TIME100Companies

➡️

Reply on Twitter 1938275246026744106 Retweet on Twitter 1938275246026744106 7 Like on Twitter 1938275246026744106 19 Twitter 1938275246026744106

; Arm @Arm ·

25 Jun 1938020145873461284

ICYMI Mohamed Awad, SVP and GM of Arm’s Infrastructure Line of Business, took the stage at #62DAC to cover the infrastructure transformations needed to usher in the next era of AI including:

Tech Leadership
Ground Up Systems
A Collaborative Ecosystem

🔗https://okt.to/KZQ6Mo

Reply on Twitter 1938020145873461284 Retweet on Twitter 1938020145873461284 8 Like on Twitter 1938020145873461284 20 Twitter 1938020145873461284

; Arm @Arm ·

25 Jun 1937877996330791359

You can forecast performance, but not breakthroughs.

@OpenAI’s @markchen90 joins Rene Haas on Tech Unheard to talk AI’s rapid rise, surprising capabilities, and what it takes to lead frontier research in a field evolving faster than anyone imagined. 🎧 https://okt.to/byq3g1

Reply on Twitter 1937877996330791359 Retweet on Twitter 1937877996330791359 1 Like on Twitter 1937877996330791359 16 Twitter 1937877996330791359

; Arm @Arm ·

24 Jun 1937579736038846918

Edge AI will make manufacturing more intelligent, autonomous, and resilient than ever before.

Paul Williamson explains how edge AI is transforming industrial operations - from the production line to predictive analytics and beyond for @TheManufacturer.

Op-ed: Smarter factories, safer systems — how edge AI is rewiring industrial manufacturing

Paul Williamson, SVP & GM, IoT Line of Business, Arm, looks at how the convergence of IoT and edge AI is rev...

okt.to

Reply on Twitter 1937579736038846918 Retweet on Twitter 1937579736038846918 12 Like on Twitter 1937579736038846918 35 Twitter 1937579736038846918

; Arm @Arm ·

24 Jun 1937539571949924553

Achieve faster time to market and unlock greater performance and efficiency with Arm Compute Subsystems.

Speaking at The Six Five Summit: AI Unleashed, Rene Haas breaks down what this means for developers and businesses alike. 👇

Six Five Media @TheSixFiveMedia

Building the future of fast and powerful AI computing depends heavily on the platform + ecosystem approach.

@Arm CEO, Rene Haas (@renehaas237), took the stage at The Six Five Summit: AI Unleashed, revealing how Arm is optimizing performance and accelerating time-to-market for…

Reply on Twitter 1937539571949924553 Retweet on Twitter 1937539571949924553 5 Like on Twitter 1937539571949924553 28 Twitter 1937539571949924553

; Arm @Arm ·

23 Jun 1937228127861608561

This #INWED25, we’re celebrating the engineers reimagining what’s possible.

Together with our partners at @AstonMartinF1, we believe inclusion and innovation go hand in hand, because the future of STEM should be built for everyone, by everyone: https://okt.to/Oud7bl

Reply on Twitter 1937228127861608561 Retweet on Twitter 1937228127861608561 4 Like on Twitter 1937228127861608561 16 Twitter 1937228127861608561

; Arm @Arm ·

23 Jun 1937141866459271519

The new Arm-based @Lenovo Chromebook Plus 14” is here, powered by @MediaTek’s Kompanio Ultra SoC and built on Armv9.

With AI features only on Arm like Iterative ImageGen and Smart Grouping, it’s a new chapter for accessible, on-device AI: https://okt.to/w9rvhd

Reply on Twitter 1937141866459271519 Retweet on Twitter 1937141866459271519 4 Like on Twitter 1937141866459271519 30 Twitter 1937141866459271519

; Arm @Arm ·

22 Jun 1936892319283724590

Scaling isn’t just about growth, it’s about embracing change together.

On Tech Unheard, Rene Haas and @Wayve_AI CEO @alexgkendall talk about Arm’s evolution into a platform company, and the mindset shift needed to scale with purpose and unity.

🎧 https://okt.to/udvCHk

Reply on Twitter 1936892319283724590 Retweet on Twitter 1936892319283724590 1 Like on Twitter 1936892319283724590 7 Twitter 1936892319283724590

; Arm @Arm ·

20 Jun 1936194829647778292

⚡️Chiplet Strategies
⚡️AI infrastructure
⚡️Ecosystem Development

Mohamed Awad, Eddie Ramirez, Kevork Kechichian and Suraj Gajendra will be at the 2025 DAC Conference June 22-24 to explore all of this and more.

See you there! #62DAC

Reply on Twitter 1936194829647778292 Retweet on Twitter 1936194829647778292 3 Like on Twitter 1936194829647778292 14 Twitter 1936194829647778292

; Arm @Arm ·

18 Jun 1935332236674433376

What's next for compute is being built today, driven by our culture of innovation.

On the No Ordinary Tech, Paul Williamson shares how we’re enabling power-efficient AI across everything from the smallest devices to the infrastructure shaping tomorrow: https://okt.to/5cMKVY

Reply on Twitter 1935332236674433376 Retweet on Twitter 1935332236674433376 5 Like on Twitter 1935332236674433376 15 Twitter 1935332236674433376

; Arm @Arm ·

18 Jun 1935330801752654228

What's next for compute is being built today driven by our culture of innovation.

On the No Ordinary Tech, Paul Williamson shares how we’re enabling power-efficient AI across everything from the smallest devices to the infrastructure shaping tomorrow: https://okt.to/5cMKVY

Reply on Twitter 1935330801752654228 Retweet on Twitter 1935330801752654228 0 Like on Twitter 1935330801752654228 0 Twitter 1935330801752654228

; Arm @Arm ·

17 Jun 1935044059866804321

Last week Vince Jesaitis, Arm's Head of Global Government Affairs, attended the SCSP AI+ Expo and shared his vision for a bigger, bolder, future powered by AI and built on Arm. 💡

The time is now for taking strategic steps and doubling down on efficiency R&D, design talent, and…

Reply on Twitter 1935044059866804321 Retweet on Twitter 1935044059866804321 5 Like on Twitter 1935044059866804321 22 Twitter 1935044059866804321

; Arm @Arm ·

16 Jun 1934696669150421189

Transformative. Accessible. Innovative.

Sophie, the star of our latest brand film, shared a bit about how RelaJet's AI-enabled audio processing devices are transforming the AI experience and creating smart, human-centered technology for everyday life. ✨

https://okt.to/QsXx3n

Reply on Twitter 1934696669150421189 Retweet on Twitter 1934696669150421189 3 Like on Twitter 1934696669150421189 15 Twitter 1934696669150421189

; Arm @Arm ·

16 Jun 1934675365810606148

The world’s leading hyperscalers are building on Arm to power the next era of AI infrastructure.

Take a look at what Mohamed Awad had to say about Neoverse and what Arm uniquely provides to the industry to take on this future of computing.

https://okt.to/lQ7Uio via @dcdnews

Reply on Twitter 1934675365810606148 Retweet on Twitter 1934675365810606148 11 Like on Twitter 1934675365810606148 33 Twitter 1934675365810606148

Redefining Technology with Smart Vision, Large Language Models, and AI

The Evolution of Intelligence Creating New Computing Paradigms

Multimodal large language models and the critical role of Vision

The future of multimodal AI will be on the edge

Vision + Foundational Models: A Few Examples

DINOv2

SAM (Segment Anything Model)

Stable Diffusion

Axera’s pursuit of better Vision with multimodal large language models

Embracing the Future of AI and HMI

Editorial Contact

Related

Transforming the Future of AI and Robotics with Multimodal LLMs

Arm Corstone-320: Accelerating Voice, Audio and Vision IoT Systems

What Are the Key Strategies for AI Vision to Tackle High Costs and Complex Data Processing?

Seeing the Potential of Smart Vision

Arm: The Technology Foundation for AI Everywhere

From Possibility to Reality: Enabling AI and ML at the Edge with Arm

Media Information

Company Overview & History

Arm Corporate Guidelines

Media Contacts

Latest on X