Arm Newsroom Blog
Blog

Multimodal Marvels: How Advanced AI is Revolutionizing Autonomous Robots

Harnessing the power of large language models to enhance robot multimodality
By Chloe Ma, VP, China GTM for IoT Line of Business, Arm

Ever heard of Moravec’s paradox? That’s the idea that for AI systems, high-level reasoning requires very little computation, but low-level sensorimotor skills require enormous computational resources. In essence, AI finds complex, logical tasks easier to manage than basic, sensory tasks that humans can do instinctively. This paradox highlights the discrepancy between the capabilities of AI and human cognition.

We human beings are naturally multimodal. Each of us is kind of like an intelligent endpoint: We go to school to get training, but at the end of the day, we operate autonomously without getting commands for everything we do from a central authority. 

We do so by perceiving through many sensory modes, such as vision, languages and sounds, touch, taste and smell, and so on. Then we analyze, reason, make decisions and take actions. 

With years of sensor fusion and AI evolution, robots now can be equipped with multimodal sensors. As we bring more and more compute to edge devices including robots, these devices are becoming fairly intelligent on their own to perceive the surrounding environment, hear and communicate in nature language, get a sense of touch within digital interfaces, and a sense of robot’s specific force, angular rate, and sometimes the magnetic field surrounding the robot, using a combination of accelerometers, gyroscopes, and sometimes magnetometers.

The new era in robotics and machine cognition

Before transformers and LLMs, multimodality in AI often involved separate models for different types of data (text, image, audio) with complex integration processes. 

After the advent of transformer models and LLMs, multimodality has become more integrated, allowing a single model to process and understand multiple data types simultaneously, leading to more cohesive and context-aware AI systems. This shift has significantly improved the efficiency and effectiveness of multimodal AI applications.

While LLMs like GPT-3 are primarily text-based, there have been rapid advancements toward multimodality. OpenAI’s CLIP and DALL·E and now Sora and GPT-4o are examples of models that take steps toward multimodality and much more natural human-computer interactions. CLIP, for instance, understands images paired with natural language, allowing it to bridge visual and textual information. DALL·E is designed to generate images from textual descriptions. We see similar evolution from Google with its Gemini model.

The pace of development is accelerating in 2024. In February, OpenAI announced Sora, which can generate realistic or imaginative videos from text descriptions. If you think about it, this could offer a promising path towards building general-purpose world simulators, which can be an essential tool to train robots. Three months later, GPT-4o significantly improved the performance of human-computer interaction and can reason across audio, vision, and text in real time. The significant performance lift was achieved by training a single new model end-to-end across text, vision, and audio, eliminating two conversions of input modal to text and text to output modal

In the same week in February 2024, Google announced Gemini 1.5, which significantly enhanced the context length to 1 million tokens. This means 1.5 Pro can process vast amounts of information in one go — including 1 hour of video, 11 hours of audio, codebases with more than 30,000 lines of code or more than 700,000 words. Gemini 1.5 is built upon Google’s leading research on transformers and mixture of expert architecture, and also open sourced the 2B and 7B models that can be deployed on the edge. Fast forward to May at Google IO, in addition to doubling the context length and a slew of GenAI tools and applications, Google discussed a future vision in Project Astra, a universal AI assistant that can process multimodal information, understand the context you’re in and respond in naturally in conversation. I would also love for it to help me perform chores and tasks, instead of just talking to me!

Meta, the company behind Llama open-source LLM, also entered the artificial general intelligence (AGI) race.

This true multimodality has significantly heightened the level of machine intelligence, which will bring new paradigms to many industries. 

For example, robots were once single purpose: They had some sensors, and movement capabilities, but in general, they just did not have a “brain” to learn new things and respond to unstructured and unseen environments.

Multimodal LLMs promise to transform robots’ abilities to analyze, reason and learn – to evolve from single purpose to general purpose. Other general-purpose compute platforms such as PC, servers, and smartphones, have been remarkable in that they can run so many different kinds of software applications. General purpose will bring scale, economy of scale and much lower prices, which can lead to the virtuous cycle of more adoption. 

Elon Musk saw this early and has evolved the Tesla robot from Bumble Bee in 2022 to Optimus Gen 1 in March 2023, and Gen 2 in late 2023. In the last 6-12 months, we have witnessed a slew of robotics and humanoids breakthroughs.

New tech behind next-gen robotics and embodied AI

Now, don’t get me wrong: Work lies ahead of us. We need lighter designs, longer operating times, as well as faster and more capable edge computing platforms to process and fuse sensor data information to make decisions and control actions.

But we’re on a path to create humanoids – systems that require human-like interactions or operations in environments designed for humans. Such systems will be ideal to handle human “3-D” work – dirty, dangerous and dull. This includes patient care and rehabilitation, service roles in hospitality, education as teaching aids or learning companions, and dangerous tasks like disaster response and hazardous material handling. These applications leverage the humanoid form for natural interaction, mobility in human-centric spaces, and performing tasks that are typically challenging for traditional robotics technologies.

There is exciting new research and collaboration among AI and robotics organizations around training robots to reason and plan better in unstructured and new environments. With the excellent generalization capabilities of models pre-trained on large amount of data as the new brain of robots, they can understand the environment better in a more holistic fashion, adjust its movements and actions in response to sensory feedback, optimizing performance across varied and dynamic environments.

A fun example is Spot, the Boston Dynamics robot dog acting as a tour guide in museums. Spot can interact with visitors, introduce various exhibits to them, and answer their questions. It may hallucinate a bit, but in this use case, Spot being entertaining, interactive, and nuanced outweighed the need to be always factually correct.

Robotics transformers – the new robot brain

Robotics transformers, which translate multimodal inputs into actions, are rapidly evolving. Google DeepMind’s RT-2 performs equally well as RT-1 for seen tasks, with nearly 100% success rate. However, RT-2 excels in generalization and outperforms RT-1 in unseen tasks when trained on PaLM-E, an embodied multimodal language model for robotics, and PaLI-X, a large-scale multilingual vision and language model not specifically designed for robotics.

LLaVA (Large Language and Vision Assistant) was introduced by Microsoft. It leverages the power of GPT-4, initially designed for text-based tasks, to create a new paradigm of multimodal instruction-following data that seamlessly integrates textual and visual components, which could be very useful for robotics tasks. When LLaVA was introduced, it set a new record for multimodal chats and Science QA that already exceeded human capabilities.

As I mentioned before, Tesla’s move into the humanoid and robotics field was significant, not only because it aims to design for scale and production, but also because Tesla Autopilot’s strong Full Self-Driving (FSD) technology foundation can be leveraged for robotics. Tesla also can apply Optimus to its EV manufacturing processes. 

Tomorrow’s robotics, built on Arm

At Arm, we believe that the brain for robotics should be a heterogenous AI compute system to deliver best performance, real-time response and power efficiency.

Robotics involves tasks ranging from basic computations (like sending and receiving signals to motors) to advanced data processing (like image and sensor data interpretation) and running the multimodal LLMs I’ve mentioned. CPUs are great for general-purpose tasks, while AI accelerators and GPUs can handle parallel processing tasks such as machine learning and graphics processing more efficiently. Additional accelerators such as image signal processors and video codecs can also be integrated to enhance robots’ vision capabilities and storage / transmission efficiencies. In addition, the CPUs likely also need to have both real-time capabilities to deliver responsiveness, as well as running rich operating systems such as Linux and ROS package. 

Extending to robotics software stack, the OS layer potentially also needs a real-time operating system (RTOS) capable of handling time-critical tasks with high reliability, plus Linux distributions customized for robotics, such as ROS, which provides services designed for a heterogeneous computer cluster. We believe that Arm-initiated standards and certification programs such as SystemReady and PSA Certified will help to scale robotics software developments. SystemReady is designed to guarantee standard-rich OS distros can just run on any Arm-based SoCs, and PSA Certified helps ease security implementation to meet regional security and regulatory compliance requirements for connected devices.

The advancements of large multimodal models and GenAI herald a new era in AI robotics and humanoid development. In this new era, energy efficiency, security, functional safety, in addition to AI compute and ecosystem are essential to evolve robotics into mainstream. Arm processors have been used widely in robotics, and we look forward to working closely with the ecosystem to build the future of AI robotics, on Arm.

Article Text
Copy Text

Any re-use permitted for informational and non-commercial or personal use only.

Editorial Contact

Arm Editorial Team
Subscribe to Blogs and Podcasts
Get the latest blogs & podcasts direct from Arm

Latest on Twitter

promopromopromopromopromopromopromopromo