WSJ : Amazon Announces Inference Chips Deal With Cerebras

Amazon Announces Inference Chips Deal With Cerebras

Amazon Web Services says the partnership will allow it to offer lightning-fast inference computing

Amazon Web Services will deploy Cerebras-designed Wafer-Scale Engine processors in its data centers for AI inference functions.
The multiyear partnership will combine Cerebras chips with AWS’s Trainium chips to improve inference computing solutions.
Cerebras claims its chips process AI model decode tasks up to 25 times faster than Nvidia’s GPUs, offering a premium service.

Amazon Web Services plans to deploy processors designed by Cerebras inside its data centers, the latest vote of confidence in the startup, which specializes in chips that power artificial-intelligence models.

Under the multiyear partnership, which the companies announced Friday, AWS will use Cerebras’s chip, called the Wafer-Scale Engine, to help power so-called inference functions, which allow AI models to respond to user queries.

The companies declined to disclose the financial terms of the agreement.

The deal underscores a major shift in the market for computing power. The AI industry is increasingly shifting away from model training and toward inference. Companies that design AI tools and agents are realizing that graphics processing units, or GPUs, while fast and powerful for training, aren’t ideal for inference workloads that require more speed. Many of them are seeking to diversify their supplier bases as they rapidly expand and gain millions of new users for their tools.

AWS, the largest cloud service provider, has relied heavily on chips designed by its in-house semiconductor business—known as Annapurna Labs—to power its data centers. These chips, known as Trainium, are roughly the equivalent to the GPUs made by Nvidia, Advanced Micro Devices and other large chips firms.

In January, ChatGPT-maker OpenAI signed a pact worth more than $10 billion to use Cerebras chips to power its popular chatbot, The Wall Street Journal reported. The deal gave renewed prominence to Cerebras, a startup backed by a host of blue-chip financial firms including Fidelity Management, Atreides Management, Benchmark, Tiger Global and Coatue, but which had previously struggled at times to raise money.

The firm had filed for an initial public offering in September 2024, but withdrew its filing about a year later. In February, Cerebras said it had raised $1 billion in a new funding round, bringing its total fundraising to $2.6 billion and its post-money valuation to approximately $23 billion.

OpenAI is seeking to deploy up to 750 megawatts of computing power using Cerebras’s chips. AWS plans to combine Cerebras’s chips with its own Trainium chips in its data centers to improve its solutions for inference computing.

Cerebras bills its chips as a “hyper-fast inference solution” and says they can process the complicated tasks known as “decode”—or the phase of inference computing in which an AI model spits out a response to a user query—up to 25 times faster than Nvidia’s GPUs.

“More people are using AI, using it more often and using it to solve harder problems,” said Cerebras Chief Executive Andrew Feldman in an interview. “This puts a Cerebras-Trainium solution in the largest cloud. It gives us access to a ton of customers.”

The deal represents a fresh challenge to Nvidia, which has seen mounting competition from designers of custom processors and which faces pressure to offer customers new products that are capable of running AI models faster and at a lower cost. In December, Nvidia signed a $20 billion licensing deal with the chip startup Groq and next week plans to unveil a new processing system tailored for inference using Groq’s technology.

AWS, a major unit of Amazon.com AMZN -0.89%decrease; red down pointing triangle, and Cerebras said that the partnership would offer some of the fastest inference computing available and would be priced as a premium service.

“Our job is to push the speed and lower the price,” said Nafea Bshara, co-founder of Annapurna Labs and a vice president and distinguished engineer at AWS. The cloud-computing firm will still offer slower computing services using just its Trainium processors at a lower price point as well.

“If you want slow inference, there will be cheaper ways to go,” Feldman said. “But if you want fast tokens, if speed matters to you, if you’re doing coding or agentic work, not only are we the absolute fastest, but we intend to set the bar. We’re in this to win it.”