Why system architects now default to Arm in AI data centers
For more than a decade, cloud infrastructure scaled through abstraction. Standardized servers, virtualized resources, and software layers helped smooth over hardware differences. That model worked because some workloads could tolerate inefficiency. But AI cannot. It exposes the limits of legacy architectures on power delivery, thermal and compute density, memory bandwidth, and system-level performance.
Essentially, AI has redefined what “good” infrastructure looks like. In response, platform design is shifting beyond individual chips and servers to rack-level, scale-up systems engineered to grow efficiently under power and budget constraints. This is against the backdrop of expanding inference and agentic AI workloads that are running continuously and accelerating demand for dense, always-on compute.
Futurum’s report “Arm at the Center of the AI and Data Center Revolution” frames this as a shift to “system-level harmony” where the key design question is how effectively platforms orchestrate compute across accelerators, CPUs, memory, networking, and software – not simply how much raw compute can be deployed.
That is why the industry is moving toward purpose-built rack-level system design, where platforms are engineered end-to-end around AI behavior, power volatility, and sustained utilization. Increasingly, system architects are revisiting foundational compute assumptions, and choosing Arm-based architectures to address the constraints shaping modern AI platforms.
AI forces a reset toward purpose-built rack-level systems
The biggest break is not that general-purpose, commodity infrastructure cannot run AI. It is that fragmented system design shows up as real cost at AI scale.
AI workloads are tightly coupled across compute, memory, networking, storage, and software. If CPUs fall behind, expensive accelerators wait. If power consumption and cooling headroom fluctuates, utilization drops. If the data pipeline, scheduling, and orchestration are not tuned to the platform, throughput becomes unpredictable. Peak performance still matters, but stability, performance-per-watt and system balance matters more.
Futurum notes that hyperscalers are structurally realigning toward architectures that can deliver exponential compute growth without exponential energy consumption. It cites Arm data indicating that nearly half the compute shipped to top hyperscalers at the end of 2025 is expected to be Arm-based.
Architects are now pushing past the “headline benchmark” questions and into those that actually decide whether an AI platform holds up when running agentic AI and continuous inferencing workloads in real-world applications. For example:
- What happens under sustained load, hour after hour?
- How do power limits and thermals reshape the performance curve in the real world?
- How do compute layers in the stack in a rack-level system deterministically make sure accelerators stay continuously fed, minute by minute, not just on paper specs?
When efficiency, scalability, and system balance become first principles, it is natural to revisit the CPU foundation. That’s exactly why Arm – with our leading architecture and broad ecosystem – is central to this conversation through our architecture and broad ecosystem.
In the data center, the Arm Neoverse platform is a core enabler of this transition, with major hyperscalers and AI leaders, including AWS, Google, Microsoft, and NVIDIA, building on and adopting Arm-based platforms. Arm’s model supports purpose-built system design while preserving consistency across platforms, ecosystem partners and software. That flexibility matters when teams are designing tightly integrated platforms, but do not want a single path to be the only path.
Agentic AI and continuous inference reshape the economics of scale
AI workloads are changing and infrastructure needs to support different profiles as AI and general compute workloads are converging.
The center of gravity is shifting toward agentic AI, and that’s fundamentally an inference problem. Agentic systems don’t “serve one answer” and stop; they plan, call tools, retrieve data, validate results, and loop. That creates continuous inferencing patterns: steady, always-on token generation, more variability in request shapes, and much heavier orchestration and data movement around the accelerators.
In agentic AI, CPUs are not just “support” actors; they act as the head node for an AI system. They coordinate the control plane, schedule and route work, manage IO, handle networking and storage services, enforce security, and keep the overall system balanced as models, contexts, and tool chains evolve.
Consider a service hosting a large language model (LLM) with hundreds or thousands of concurrent requests. Even when the accelerator is doing the core compute, the CPU is often responsible for request admission control, tokenization and pre-processing, batching and queueing decisions, orchestrating data movement, and coordinating the network and storage paths for model weights and KV cache behavior. With agentic workflows, that CPU-side work expands further, adding tool calling, retrieval pipelines, structured output validation, and multi-step scheduling that runs continuously, not just per-request.
All this means the CPU layer matters more than many teams planned for. If the CPU can’t keep up with orchestration, then data movement, processing, and accelerators become structurally stranded.
Arm momentum is visible in how the converged AI data center is being built
This is where Arm momentum is accelerating. Across the industry’s leading integrated AI systems, Neoverse-based CPUs are being selected for the orchestration layer supporting agentic, inference-heavy fleets, particularly where efficiency, predictable scaling, and broad deployment footprints are required.
Independent testing underscores the impact of a modern CPU foundation in these “AI-adjacent” workloads. Futurum’s Signal65 benchmarking of AWS Graviton4 (Neoverse) versus comparable AMD and Intel EC2 alternatives, Neoverse-based Graviton4 instances delivered significantly higher performance and better price/performance across all tested workloads, including generative AI (Llama-3.1-8B), database (Redis), Machine Learning (XGBoost), and networking (Nginx.)
Those results map directly to agentic AI data center realities: LLMs, retrieval layers, caching, web and API tiers, and classical ML components all sit on the hot path of agentic systems, and they scale better when the CPU layer is both fast and efficient.
The newest “rack-scale” AI systems are being designed around a purpose-built accelerator layer and an Arm-based CPU layer for orchestration, data movement, and agentic reasoning. NVIDIA’s portfolio spanning Grace Hopper and Grace Blackwell systems, pair NVIDIA GPUs with Grace CPUs built on Neoverse, while the company’s newest rack-scale platform, Vera Rubin NVL72, integrates 72 Rubin GPUs with 36 Arm-powered Vera CPUs in a system designed to drive down inference costs for interactive, deep reasoning agentic AI.
AWS is making the same system-level move. AWS Trainium3 UltraServer pairs Trainium3 accelerator chips with AWS Graviton CPUs reinforcing the “converged” design pattern: matching accelerating with purpose-built, performance per watt efficient CPUs that can effectively scale.
Better choice is now a requirement, not a preference
AI systems evolve too fast for static architectures, so providing better choice to customers is a risk-management requirement.
System architects want:
- Platforms that can adapt across hardware generations, workload profiles, and deployment environments; and
- Software portability that reduces friction when systems change.
At the same time, they want to avoid optimizing so tightly around a single vendor path that future decisions become constrained when the model mix shifts, the footprint expands, or new requirements appear. That’s especially important in the agentic era where the “shape” of inference keeps changing leading to longer contexts, more tool use, more multimodal inputs, and more always-on workloads that reward efficiency and balance over peak-only design.
Arm is designed to preserve that portability while scaling system-level performance. The Arm architecture brings features that matter in modern AI infrastructure, while being supported by a robust software ecosystem. Arm Compute Subsystems (CSS) provide validated, infrastructure-grade building blocks that accelerate silicon development, while still enabling differentiation and preserving choice across partners. For any Arm-based platform, there is consistency across the board and it is easy to migrate cloud workloads to Arm. Meanwhile, at the software level, the Arm ecosystem helps teams move faster with a consistent foundation across environments and platforms without rewriting everything that runs on it.
Agentic AI economics are reshaping CPU choices and Arm Neoverse is the choice of leaders
System architects are increasingly shifting to Arm because it aligns with what purpose-built AI systems are demanding: power efficiency, scalability, and performance per watt. Efficiency matters because power and budgets are hard limits. System balance and CPU performance matter because stranded accelerator utilization is expensive. Consistency matters because AI infrastructure changes quickly and increasingly spans multiple environments.
In the converged, agentic AI data center, continuous inferencing turns those priorities into day-one requirements. Agentic systems don’t just need accelerators that can generate tokens; they need CPU-led orchestration that can sustain real utilization across networking, storage, scheduling, and security – continuously, efficiently, and at scale.
That’s the momentum behind Arm right now, with Neoverse acting as the CPU foundation for the agentic era. It is the head-node layer that keeps AI systems coherent, efficient, and ready for the future.
Any re-use permitted for informational and non-commercial or personal use only.






