The Information : Amazon’s Nvidia Alternative Starts Winning Over AI Developers

Amazon’s Nvidia Alternative Starts Winning Over AI Developers

The Takeaway
  • Scarcity of Nvidia GPUs has made Amazon’s pitch more attractive
  • Amazon worked closely with Anthropic on improving Trainium software
  • Developers say Trainium documentation and support have improved recently

Amazon’s yearslong effort to build a serious alternative to Nvidia’s dominant AI chips is starting to gain traction.

Anthropic and OpenAI, which have struck multibillion-dollar investment and infrastructure deals with Amazon, have already committed to renting large amounts of current and future Trainium capacity. Now, recent software improvements are prompting smaller developers to consider moving more workloads to Trainium, half a dozen people who use or work with the chips said.

That includes Daniel Svonava, CEO of Superlinked, an infrastructure firm that helps companies run AI models on rented infrastructure. He said Amazon’s pitch on Trainium, including potential cost savings by switching to the chip, only recently started becoming more compelling.

“Our response has always been the lack of software support being a barrier,” Svonava said. “That’s the thing that changed in the last couple months. That barrier has been removed.”

The scarcity of Nvidia chips has also made Amazon’s pitch more attractive, with sales reps telling the startup they have limited availability on the latest graphics processing units. At the same time, Amazon has indicated it has more Trainium capacity available and is willing to be flexible on price, he said. Amazon has given Superlinked $200,000 worth of AWS credits, which it is using to test Trainium.

Bojan Jakimovski, machine-learning lead at Loka, which helps businesses train their own AI models, said interest in Trainium started spiking in the past couple of months in part because of issues securing Nvidia GPUs. With Trainium, clients know “they will have a reserved spot when they need to develop,” Jakimovski said.

One Loka client switched its inference workloads to second-generation Trainium chips earlier this year after tests showed doing so could cut costs up to 35% compared with Nvidia’s H100, Jakimovski said. For training a large language model, though, he would still recommend Nvidia.

The new interest comes as Amazon is betting that Trainium can improve the economics of its AI cloud business. In a January interview with The Information, CEO Andy Jassy said that while Amazon plans to continue buying Nvidia chips, “if you’re building a big inference business” that charges less and has sustainable margins, “you’re strategically disadvantaged if you don’t have your own custom silicon.”

Last month, Jassy said Amazon’s custom silicon business, including Trainium and Graviton, has reached a more than $20 billion annualized run rate, or roughly $50 billion if measured as a stand-alone chip seller. That $20 billion reflects revenue from customers using Trainium and Graviton directly through Amazon’s EC2 service, an Amazon spokesperson said. It excludes offerings such as Amazon’s Bedrock, which lets customers access AI models, and internal Amazon workloads.

“Customers are choosing Trainium because of years of architectural decisions that compound at scale: a chip designed in partnership with leading AI labs to be more efficient in how it computes, communicates and operates,” the spokesperson said. “Today, Trainium2 is being chosen by the leading AI labs in the world for their most demanding workloads, and Trainium3 delivers a 30 to 40 percent price performance improvement over Trainium 2.”

The Anthropic Test

Getting to this point took years of software work and close collaboration with Anthropic. Nvidia’s advantage was as much in software as in hardware—developers had spent years building around Cuda, while Amazon had to make its Trainium software, called Neuron, easy enough to justify switching.

Amazon announced Trainium in 2020 through its Annapurna Labs unit, initially pitching it as a cheaper way to train machine-learning models on AWS. When the first-generation chips launched, early internal users included Amazon’s search teams, which helped shape the chip’s development, according to someone with knowledge of the matter.

But when Amazon staff began ramping up generative AI products in late 2022, some teams did not use Trainium broadly, and Amazon’s Nova large language models were first trained on Nvidia GPUs, according to a former employee.

Amazon announced in 2023 that Anthropic would use Trainium and Inferentia to train and run future models, and by the following year had committed $8 billion to Anthropic. The two companies also teamed up to make Trainium faster and more efficient.

Anthropic and Amazon engineers worked closely to optimize Trainium for Anthropic’s models, talking frequently, according to Carlos Escapa, a former AWS executive who worked on selling Anthropic models. Anthropic and Amazon made software improvements that could also benefit other customers.

“The collaboration between Anthropic and AWS on the NKI [Neuron Kernel Interface] has been very, very deep,” Escapa said, referring to Amazon software that lets developers fine-tune how models run on Trainium chips. “And some of these features that have been developed for Anthropic have also become very useful for other companies.”

Some of the work involved software changes that helped Trainium perform more processes simultaneously, Escapa said. Anthropic co-founder Tom Brown has publicly described the broader effort as “a game of Tetris,” where a tight chip architecture makes models cheaper and faster.

By the end of 2024, Amazon had launched its second-generation Trainium chip broadly and announced Project Rainier, a large Trainium cluster for Anthropic. Inside Amazon, Trainium use began picking up in some areas, with Nova starting to use Trainium in 2024 and ramping up since then with pretraining in particular, a former employee said.

Bedrock, which offers access to Anthropic and other models, initially relied on GPUs, according to two people with knowledge of the product. One of the people said some Bedrock workloads in 2024 required roughly twice as many Trainium chips as Nvidia chips to handle the same workload.

Some members of the Trainium team were frustrated that the Bedrock team was not adopting Trainium faster, one of the people said, but over time, Bedrock staff became more convinced that the chips were competitive.

Amazon said it prioritized limited Trainium capacity for external customers such as Anthropic as demand accelerated. The company also said Bedrock used Trainium for models and tasks where the chips offered better cost and performance, while relying on GPUs for other models to keep Bedrock’s selection broad.

As the software matured, Amazon said, more Bedrock workloads moved to Trainium, which now runs the majority of Bedrock inference across more than 125,000 customers. Amazon also said it is planning to train its largest internal models on Trainium going forward.

Software Catch-Up

Outside Amazon, developers had their own frustrations with Trainium. Julien Simon, an AI operating partner at a private equity firm, first ran into issues while working at Hugging Face, where he spent three years. Hugging Face had worked with Trainium chips, and Amazon was sometimes slow to support newer models on the startup’s open-source platform, Simon said.

“You would reach out to [Amazon] saying, ‘We need you to support this slightly newer model that came out on Hugging Face last week,’ and the answer would be, ‘Yeah, maybe in six months,’” Simon said. Amazon says such delays were addressed with later open-source software integrations, and a current product director at Hugging Face, Jeff Boudier, said the company has been happy with the relationship.

Simon said he ran into similar issues at open-source AI developer Arcee, where from mid-2024 to mid-2025 he tried training and deploying models on Trainium chips. “Anything that works out of the box on [Nvidia’s software] requires custom engineering and custom development and back-and-forth with the Amazon teams,” he said.

Arcee’s CEO, Mark McQuade, said the company stopped trying to use first-generation Trainium chips in early 2025 and hasn’t tried newer versions.

Trainium also initially struggled with some capabilities, including support for dynamic shapes, which help models handle inputs that vary in size or format. Without that support, developers had to do more manual work to adapt models for different requests, said Jakimovski, who started testing the technology in late 2024.

Kevin Gomes, a graduate student at Cornell University conducting AI research with Trainium, said he also found the chips difficult to use at first because the documentation was lacking. “It’s not very well documented, so you have no idea how to fix it,” Gomes said.

In recent months, however, several customers said Amazon has made Trainium much easier to use by improving documentation and support and making the chips work better with popular open-source tools.

That included a native PyTorch integration Amazon unveiled in December, an important step because PyTorch is many developers’ default programming platform and has long worked best with Nvidia. Before the integration, developers often had to write code in PyTorch and adapt it to Amazon’s Neuron software.

Amazon also fixed Trainium’s dynamic shapes issue and made its software easier to customize, helping models run faster, Jakimovski said. “They started to listen to developers,” he said. “I didn’t have that feeling two years ago.”

The Next Test

In February, Amazon said OpenAI would take around 2 gigawatts of Trainium capacity, including Trainium3 and the upcoming Trainium4, alongside an initial $15 billion Amazon investment.

Amazon has also paired Trainium with Cerebras, another OpenAI chip partner. OpenAI announced in January that it would use Cerebras systems for 750 megawatts of high-speed inference compute. Two months later, Amazon said it would deploy Cerebras systems in AWS data centers alongside Trainium to deliver faster inference through Bedrock.

Amazon later expanded its Anthropic partnership, with Anthropic committing to spend more than $100 billion on AWS over the next decade, including on Trainium capacity. Amazon also said it would invest another $5 billion in Anthropic, with up to $20 billion more tied to future milestones.

Late last month, Amazon said Trainium2 is largely sold out, Trainium3 is nearly fully subscribed and much of Trainium4, which is about 18 months from broad availability, is reserved. The Amazon spokesperson said Trainium’s customer base “extends well beyond OpenAI and Anthropic,” citing examples including Uber and Decart.

Still, Amazon has not detailed how much of that demand comes from a small number of large customers, and it’s unclear how much of a market exists beyond those AI giants and Amazon’s own services. Many AI-heavy companies buy model access through application programming interfaces or Amazon’s Bedrock instead of renting chips directly, Svonava said, and most companies don’t want to actively evaluate the underlying chips.

Trainium also hasn’t fully displaced Nvidia inside Amazon. Some of the models underpinning Amazon’s shopping AI still use Nvidia chips exclusively, someone with direct knowledge of the product said. Anthropic is still securing Nvidia capacity too, recently announcing a deal with SpaceX to access more than 220,000 Nvidia chips within the coming month.