AWS and Cerebras collaboration aims to set a new standard for AI inference speed and performance in the cloud

March 16, 2026

783

Amazon Web Services, Inc. (AWS), and Cerebras Systems today announced a collaboration that will, in the coming months, deliver the fastest AI inference solutions available for generative AI applications and LLM workloads. The solution, to be deployed on Amazon Bedrock in AWS data centers, combines AWS Trainium-powered servers, Cerebras CS-3 systems, and Elastic Fabric Adapter (EFA) networking. Later this year, AWS will also offer leading open-source LLMs and Amazon Nova using Cerebras hardware.

“Inference is where AI delivers real value to customers, but speed remains a critical bottleneck for demanding workloads like real-time coding assistance and interactive applications,” said David Brown, Vice President, Compute & ML Services, AWS. “What we’re building with Cerebras solves that: by splitting the inference workload across Trainium and CS-3, and connecting them with Amazon’s Elastic Fabric Adapter, each system does what it’s best at. The result will be inference that’s an order of magnitude faster and higher performance than what’s available today.”

“Partnering with AWS to build a disaggregated inference solution will bring the fastest inference to a global customer base,” said Andrew Feldman, Founder and CEO of Cerebras Systems. “Every enterprise around the world will be able to benefit from blisteringly fast inference within their existing AWS environment.”

How it works: Inference disaggregation

The Trainium + CS-3 solution enables “inference disaggregation,” a technique which separates AI inference into two stages: prompt processing, or “prefill,” and output generation, or “decode.” These two stages have profoundly different computational characteristics. Prefill is natively parallel, computationally intensive, and requires moderate memory bandwidth. Decode, on the other hand, is inherently serial, computationally light, and memory bandwidth intensive. Decode typically represents the majority of inference time in these scenarios because each output token must be generated sequentially.

Because each stage has a different computational challenge, they each benefit from different compute architectures and low-latency, high-bandwidth EFA networking between them. By strategically disaggregating the inference problem—with Trainium optimized for prefill and the Cerebras CS-3 optimized for decode—the two different computational challenges can be optimized in a specialized way.

Built on the AWS Nitro System—the foundation of AWS’s secure, high-performance cloud infrastructure—the new solution will ensure that Cerebras CS-3 systems and Trainium-powered instances operate with the same security, isolation, and operational consistency customers expect from AWS.

AWS Trainium for prefill and Cerebras CS-3 for decode

Trainium is Amazon’s purpose-built AI chip, designed to deliver scalable performance and cost efficiency for training and inference across a broad range of generative AI workloads. Two of the world’s leading AI labs—Anthropic and OpenAI—are committed to Trainium. Anthropic has named AWS its primary training partner and is using Trainium to train and deploy its models, while OpenAI will consume 2 gigawatts of Trainium capacity through AWS infrastructure to support demand for Stateful Runtime Environment, frontier models, and other advanced workloads. Since its recent release, Trainium3 has seen strong customer adoption, with organizations across industries committing significant capacity.

Cerebras’ CS-3 is the world’s fastest AI inference system. It delivers thousands of times greater memory bandwidth than the fastest GPU. As reasoning models now represent a majority of inference to compute and generate more tokens per request as they “think” through problems, the need to accelerate this portion of the workflow has grown accordingly. OpenAI, Cognition, Mistral, and others use Cerebras to accelerate their most demanding workloads, especially agentic coding where developer productivity is constrained by inference speed.

In the disaggregated solution, CS-3 will be fully dedicated to decoding acceleration, enabling dramatically higher capacity for fast output tokens. With Trainium handling prefill, the CS-3 handling decode operations, and high-speed EFA networking connecting them, each processor will deliver maximum token capacity for its focused part of the workload.

Disclaimer: The above press release has been provided by Amazon . CXO Digital Pulse holds no responsibility for its content in any manner

- Advertisement -

AWS and Cerebras collaboration aims to set a new standard for AI inference speed and performance in the cloud

How it works: Inference disaggregation

AWS Trainium for prefill and Cerebras CS-3 for decode

Disclaimer: The above press release has been provided by Amazon . CXO Digital Pulse holds no responsibility for its content in any manner

Related Articles

US Restrictions on Anthropic’s AI Models May Create Challenges for Indian IT Services Firms

Building Resilient, Secure and Intelligent Financial Infrastructure for the Digital Era

Girikon.AI Highlights the Growing Role of AI Voice in Real Estate Customer Engagement

STRATBEANS EMPOWERS LEADING ENTERPRISES WITH AI-DRIVEN PREDICTIVE INTELLIGENCE FOR WORKFORCE MANAGEMENT

LEAVE A REPLY Cancel reply

Latest Articles

US Restrictions on Anthropic’s AI Models May Create Challenges for Indian...

Building Resilient, Secure and Intelligent Financial Infrastructure for the Digital Era

Girikon.AI Highlights the Growing Role of AI Voice in Real Estate...

STRATBEANS EMPOWERS LEADING ENTERPRISES WITH AI-DRIVEN PREDICTIVE INTELLIGENCE FOR WORKFORCE MANAGEMENT

ByteDance in Talks with Iluvatar CoreX for AI Chip Supply as...

UK PM Announces Social Media Ban for Users Under 16 Amid...

Musk Says SpaceX Could Generate $1 Trillion in Revenue by 2030′

iOS 27’s Lesser-Known Features Impress Early Tester Ahead of Public Release

Anthropic Suspends Access to New AI Models Following US Directive; Indian...

Why Sundar Pichai Avoided Talking About AI at Stanford: 5 Key...