Blog

June 12, 2024

Transforming the Future of AI and Robotics with Multimodal LLMs

Harnessing the power of large language models to enhance robot multimodality

By Chloe Ma, VP, China GTM for IoT Line of Business, Arm

Ever heard of Moravec’s paradox? The concept proposes that for artificial intelligence (AI) systems, high-level reasoning demands minimal computation, while basic sensorimotor skills necessitate substantial computational resources. In simpler terms, AI finds it easier to manage complex, logical tasks than basic, sensory tasks that humans perform instinctively. This paradox underscores the disparity between AI capabilities and human cognition.

Humans are inherently multimodal. Each one of us is akin to an intelligent endpoint. We receive training through education, but ultimately, we operate autonomously, without needing instructions for every action from a central authority.

We achieve this autonomy through perception via various sensory modes, such as vision, language and sounds, touch, taste, and smell. These inputs allow us to analyze, reason, make decisions, and take action.

How are multimodal sensors enhancing the capabilities of robots?

With the progression of sensor fusion and AI, robots are now being equipped with multimodal sensors. As we continue to bring more computational power to edge devices, including robots, these devices are becoming increasingly intelligent. They can perceive their surrounding environment, communicate in natural language, and even get a sense of touch within digital interfaces.

Moreover, robots can sense specific forces, angular rates, and sometimes even the magnetic field around them. This is achieved using a combination of accelerometers, gyroscopes, and occasionally, magnetometers.

Before the advent of transformer models and Large Language Models (LLMs), multimodality in AI often involved separate models for different types of data (text, image, audio) with complex integration processes. However, with the introduction of these models, multimodality has become more integrated, allowing a single model to process and understand multiple data types simultaneously. This shift has significantly improved the efficiency and effectiveness of multimodal AI applications.

Examples of multimodal LLMs shaping the future of AI and robotics

While LLMs like GPT-3 are primarily text-based, there have been rapid advancements toward multimodality. Models such as OpenAI’s CLIP and DALL·E, and now Sora and GPT-4o, are examples of steps toward multimodality and much more natural human-computer interactions. These models bridge visual and textual information, generate images from textual descriptions, and even generate videos from text descriptions.

The pace of development is accelerating in 2024. In February, OpenAI announced Sora, which can generate realistic or imaginative videos from text descriptions. This could offer a promising path towards building general-purpose world simulators, which can be an essential tool for training robots.

Three months later, GPT-4o significantly improved the performance of human-computer interaction and can reason across audio, vision, and text in real time. The significant performance lift was achieved by training a single new model end-to-end across text, vision, and audio.

In the same week in February 2024, Google announced Gemini 1.5, which significantly enhanced the context length to 1 million tokens. This means Gemini 1.5 Pro can process vast amounts of information in one go — including 1 hour of video, 11 hours of audio, codebases with more than 30,000 lines of code or more than 700,000 words.

Fast forward to May at Google IO, Google discussed a future vision in Project Astra, a universal AI assistant that can process multimodal information, understand the context you’re in, and respond naturally in conversation. Meta, the company behind Llama open-source LLM, also entered the artificial general intelligence (AGI) race.

This true multimodality has significantly heightened the level of machine intelligence, which will bring new paradigms to many industries. For example, robots once had a single purpose: They had some sensors and movement capabilities, but in general, they just did not have a “brain” to learn new things and respond to unstructured and unseen environments.

What role do multimodal LLMs play in the evolution of robots?

This true multimodality has significantly heightened the level of machine intelligence, which will bring new paradigms to many industries. For example, robots were once a single purpose. They had some sensors and movement capabilities, but in general, they just did not have a “brain” to learn new things and respond to unstructured and unseen environments.

Multimodal LLMs promise to transform robots’ abilities to analyze, reason, and learn – to evolve from a single purpose to a general purpose. Other general-purpose compute platforms such as PCs, servers, and smartphones, have been remarkable in that they can run so many different kinds of software applications. General purpose will bring scale, economy of scale, and much lower prices, which can lead to the virtuous cycle of more adoption.

Elon Musk saw this early and has evolved the Tesla robot from Bumble Bee in 2022 to Optimus Gen 1 in March 2023, and Gen 2 in late 2023. In the last 6-12 months, we have witnessed a slew of robotics and humanoids breakthroughs.

While we’ve made significant strides in AI and robotics, there’s still work ahead. We need lighter designs, longer operating times, and more capable edge computing platforms to process and fuse sensor data information, make decisions, and control actions.

We’re on a path to create humanoids – systems that require human-like interactions or operations in environments designed for humans. These systems are ideal for handling human “3-D” work – dirty, dangerous, and dull tasks. This includes patient care and rehabilitation, service roles in hospitality, education as teaching aids or learning companions, and dangerous tasks like disaster response and hazardous material handling.

There’s exciting new research and collaboration among AI and robotics organizations around training robots to reason and plan better in unstructured and new environments. With the excellent generalization capabilities of models pre-trained on large amounts of data, robots can understand the environment better, adjust their movements and actions in response to sensory feedback, and optimize performance across varied and dynamic environments.

A fun example is Spot, the Boston Dynamics robot dog acting as a tour guide in museums. Spot can interact with visitors, introduce various exhibits to them, and answer their questions. In this use case, Spot being entertaining, interactive, and nuanced outweighed the need to be always factually correct.

What are robotics transformers?

Robotics transformers, which translate multimodal inputs into actions, are rapidly evolving. Google DeepMind’s RT-2 performs equally well as RT-1 for seen tasks, with a nearly 100% success rate. However, RT-2 excels in generalization and outperforms RT-1 in unseen tasks when trained on PaLM-E, an embodied multimodal language model for robotics, and PaLI-X, a large-scale multilingual vision and language model not specifically designed for robotics.

Microsoft introduced LLaVA (Large Language and Vision Assistant). It leverages the power of GPT-4, initially designed for text-based tasks, to create a new paradigm of multimodal instruction-following data that seamlessly integrates textual and visual components, which could be very useful for robotics tasks.

Tesla’s move into the humanoid and robotics field was significant, not only because it aims to design for scale and production, but also because Tesla Autopilot’s strong Full Self-Driving (FSD) technology foundation can be leveraged for robotics. Tesla also can apply Optimus to its EV manufacturing processes.

Tomorrow’s robotics, built on Arm

In robotics, tasks can range from basic computations, like sending and receiving signals to motors, to advanced data processing, such as interpreting image and sensor data. These tasks also include running multimodal Large Language Models (LLMs). This is where Central Processing Units (CPUs) are excellent for general-purpose tasks, while AI accelerators and Graphics Processing Units (GPUs) efficiently handle parallel processing tasks like machine learning and graphics processing.

To enhance a robot’s vision capabilities and improve storage and transmission efficiencies, additional accelerators like image signal processors and video codecs can be integrated. CPUs also need to have real-time capabilities for responsiveness and the ability to run rich operating systems like Linux and the Robot Operating System (ROS) package.

When we look at the robotics software stack, the Operating System (OS) layer potentially needs a Real-Time Operating System (RTOS) capable of handling time-critical tasks with high reliability. It also needs Linux distributions customized for robotics, like ROS, which provides services designed for a heterogeneous computer cluster.

Arm-initiated standards and certification programs, such as SystemReady and PSA Certified, will help scale robotics software developments. SystemReady is designed to ensure that standard-rich OS distributions can run on any Arm-based System on a Chip (SoC), and PSA Certified helps ease security implementation to meet regional security and regulatory compliance requirements for connected devices.

The advancements of large multimodal models and General AI (GenAI) herald a new era in AI robotics and humanoid development. In this new era, energy efficiency, security, functional safety, AI computing, and ecosystem are essential to evolving robotics into the mainstream. Arm processors have been widely used in robotics, and we look forward to working closely with the ecosystem to build the future of AI robotics, on Arm.

By Chloe Ma, VP, China GTM for IoT Line of Business, Arm

Article Text

Copy Text

Any re-use permitted for informational and non-commercial or personal use only.

Editorial Contact

Arm Editorial Team

editorial@arm.com

Subscribe to Blogs and Podcasts

Get the latest blogs & podcasts direct from Arm

Blog

Jan 08, 2024

Arm: The Technology Foundation for AI Everywhere

Arm Editorial Team

Blog

May 20, 2024

Transforming AI Experiences at the Edge with a System-Level Approach

Arm Editorial Team

Blog

Apr 05, 2024

From Possibility to Reality: Enabling AI and ML at the Edge with Arm

Paul Williamson, SVP and GM of the IoT Business, Arm

Blog

Mar 15, 2024

Enabling Next-Gen Edge AI Applications with Transformer Networks

Stephen Su, Senior Segment Marketing Manager, Arm IoT, Arm

Media Information

Latest on X

; Arm @Arm ·

16h 2022869745561682129

This Valentine’s Day, let’s talk about a perfect match.💘

GPUs may get the headlines. But in the AI data center, they don’t scale alone.

As AI shifts from intense training to always-on, agent-based inference, something fundamental is changing. AI systems now run in tight

Reply on Twitter 2022869745561682129 Retweet on Twitter 2022869745561682129 3 Like on Twitter 2022869745561682129 36 Twitter 2022869745561682129

; Arm @Arm ·

13 Feb 2022389856618352674

AI data centers are shifting from training jobs to always-on, agent-based inference. That change dramatically increases the need for CPUs. 🚀

Agent-based AI involves many agents running continuously, coordinating tasks, managing context, accessing data and interacting with

Reply on Twitter 2022389856618352674 Retweet on Twitter 2022389856618352674 7 Like on Twitter 2022389856618352674 35 Twitter 2022389856618352674

; Arm @Arm ·

13 Feb 2022158315183120561

Arm is expanding in Austin with support from the Texas Semiconductor Innovation Fund grant!

This move will strengthen Texas’ role in global semiconductor innovation, creating 320+ jobs, new advanced lab capabilities, and building momentum for future compute and AI.👏

Reply on Twitter 2022158315183120561 Retweet on Twitter 2022158315183120561 7 Like on Twitter 2022158315183120561 22 Twitter 2022158315183120561

; Arm @Arm ·

12 Feb 2022058409843900480

Arm’s share among top hyperscalers is expected to reach nearly 50% ⚡

That shift reflects a bigger reality: AI isn’t just about accelerators — CPUs power the data pipeline and system orchestration behind AI, while maximizing performance per watt at scale.

@TheFuturumGroup's

Reply on Twitter 2022058409843900480 Retweet on Twitter 2022058409843900480 4 Like on Twitter 2022058409843900480 37 Twitter 2022058409843900480

; Arm @Arm ·

11 Feb 2021615315126211052

This National Apprenticeship Week, we’re proud to highlight Ayo Giwa, who completed a apprenticeship with us and has since moved into a graduate role, a brilliant example of where this pathway can lead. 👏 https://okt.to/p7em6M

Reply on Twitter 2021615315126211052 Retweet on Twitter 2021615315126211052 3 Like on Twitter 2021615315126211052 23 Twitter 2021615315126211052

; Arm @Arm ·

11 Feb 2021386234736624104

Advancing on-device AI starts with extending the CPU architecture to better support machine learning workloads.

With Exynos 2600, @SamsungDSGlobal adopts Arm Scalable Matrix Extension 2 (SME2), accelerating matrix operations on CPUs and expanding the role of CPU-based AI for

Reply on Twitter 2021386234736624104 Retweet on Twitter 2021386234736624104 25 Like on Twitter 2021386234736624104 188 Twitter 2021386234736624104

; Arm @Arm ·

10 Feb 2021325825786757570

Commodity infrastructure was built for a different era. 🦖

As AI training and inference scale, efficiency and system level optimization are becoming critical. Purpose-built platforms designed end-to-end are emerging as the new foundation for AI infrastructure.

Reply on Twitter 2021325825786757570 Retweet on Twitter 2021325825786757570 3 Like on Twitter 2021325825786757570 24 Twitter 2021325825786757570

Transforming the Future of AI and Robotics with Multimodal LLMs

How are multimodal sensors enhancing the capabilities of robots?

Examples of multimodal LLMs shaping the future of AI and robotics

What role do multimodal LLMs play in the evolution of robots?

What are robotics transformers?

Tomorrow’s robotics, built on Arm

Editorial Contact

Related

Arm: The Technology Foundation for AI Everywhere

Transforming AI Experiences at the Edge with a System-Level Approach

From Possibility to Reality: Enabling AI and ML at the Edge with Arm

Enabling Next-Gen Edge AI Applications with Transformer Networks

Media Information

Company Overview & History

Arm Corporate Guidelines

Media Contacts

Latest on X