The Information : Inside OpenAI’s Rocky Path to GPT-5

Inside OpenAI’s Rocky Path to GPT-5
The troubles OpenAI has faced in developing GPT-5 point to slowing AI progress across the industry. Researchers believe advances in reinforcement learning will help to overcome that.

The Takeaway
• GPT-5 will show real improvements over its predecessors, but they won’t be comparable to leaps in performance between earlier GPT-branded models
• OpenAI confronted a series of technical problems that imperiled o3 and other models this year
• A disagreement between research chief Mark Chen and a deputy spilled into view on Slack

OpenAI made waves across the industry in December when it published the results from its tests of artificial intelligence that performs better on tasks when it gets more time and computing power to process them. The results implied ChatGPT customers were about to be blown away by what the new AI could do.

But the euphoria was short-lived.

When OpenAI researchers turned the new AI into a chat-based version called o3 that could respond to instructions from ChatGPT customers, the performance gains the company had published largely vanished, according to two people involved in its development.

The episode was an example of the technical challenges OpenAI has faced through much of this year, threatening to slow its pace of AI advances, if not its blockbuster ChatGPT business.

But its researchers have found ways to keep AI progress going through techniques that have surged across the industry.

OpenAI is now nearing the release of GPT-5, its next flagship AI model, which improves upon existing models’ ability to complete practical computer programming and math tasks, among other things, according to people who have used it or are familiar with the company’s internal evaluations.

For instance, when the new model codes applications, it’s better at adding features that make them easier to use and more aesthetically pleasing, one of those people said.

GPT-5 is also better than its predecessors at powering AI agents that handle complex tasks with minimal human oversight, this person said. For instance, it can follow complicated instructions, such as a list of rules that determine when an automated customer support agent should grant a refund.

Previous models needed to see several examples of tricky customer cases, known as edge cases, before they could handle such refunds, this person said.

The improvements won’t be comparable to the leaps in performance of earlier GPT-branded models, such as the improvements between GPT-3 in 2020 and GPT-4 in 2023, one of the people said. And the slowing performance gains OpenAI has experienced over the past 12 months suggest it may be hard for the company to surge ahead of its biggest rivals, at least in terms of AI capabilities.

But OpenAI’s current models are generating so much commercial value from powering chatbots and other applications that any improvements, even incremental, will increase customer demand. They could also give new investors the confidence to fund the company’s plan to burn through $45 billion in the next three and a half years as it rents expensive servers for developing and running its products.

Coding Priorities

The latest gains also help explain why OpenAI executives in recent weeks told some investors they believed the company could reach “GPT-8.”

The comments are in line with CEO Sam Altman’s public comments that, using existing technological know-how, OpenAI can attain the goal of creating AI whose capabilities are near or on par with those of the smartest humans. This technology is otherwise known as artificial general intelligence.

While it’s far from being AGI, the upcoming GPT-5 model may have other attractive attributes besides better coding and reasoning. Some leaders at Microsoft, which has exclusive rights to OpenAI’s intellectual property, have told staff their tests of the model show it produces higher-quality coding and other text-based answers without consuming a lot more computing power, according to a Microsoft employee with knowledge of the situation.

That’s partly because it is capable of figuring out which tasks require relatively more or less computing resources better than prior models, this person said.

Improving AI’s ability to automate coding tasks became a priority at OpenAI after archrival Anthropic last year took the lead in developing and selling such models to software developers and coding assistants like Cursor, according to OpenAI’s internal evaluations. OpenAI staff believe automated coding isn’t just important to the company’s business—it’s also critical to automating the work of the AI researchers themselves.

Reorg Strains

Progress at OpenAI hasn’t been a straight line, as both its researchers and its managers have faced new strains this year.

Some senior researchers have resisted the idea of giving away their inventions to Microsoft, OpenAI’s largest outside shareholder, despite the software firm’s contractual rights through 2030.

The two companies have a tight financial relationship but have feuded over the terms of their deal, with each side seeking concessions from the other as OpenAI tries to restructure its for-profit arm so it can eventually go public.

Discussions between Microsoft and OpenAI have been moving in a positive direction, according to two people who have spoken to negotiators. Many bargaining points are still up in the air, though others appear to be more settled, such as the roughly 33% equity stake Microsoft is likely to get in OpenAI’s for-profit arm as part of the restructure, according to one of those people.

More recently, Meta Platforms has hired more than a dozen OpenAI researchers, some of whom had been involved in the techniques the company has been using lately to improve its technology. Meta won them over by offering compensation packages worthy of the highest-paid soccer stars.

The departures and the staff reorganizations in response to them have weighed on senior OpenAI staff. Last week, Jerry Tworek, a vice president of research at OpenAI, complained about a team change to his boss, research chief Mark Chen, on the company’s internal Slack app, which is visible to many other colleagues.

Tworek said he had to take a week off to reassess things, but he later ended up not taking time off.

Orion’s Star Falls

The company’s business progress has masked some internal concerns about its ability to keep improving its AI and stay ahead of other well-capitalized rivals such as Google, Elon Musk’s xAI and Anthropic.

Problems had been brewing for months before the current year began. For much of the second half of 2024, OpenAI was developing a model known internally as Orion and intended to become GPT-5. According to people who worked on it, Orion was supposed to offer a big step up in performance compared to the current flagship, GPT-4o, released in May that year.

But the Orion effort failed to produce a better model, and the company instead released it as GPT-4.5 in February this year. It has since faded from relevance.

Part of the failure had to do with the limits of pre-training, the first stage of developing a model, in which it processes data from the web and other sources so it can draw connections between concepts.

Not only was OpenAI facing a dwindling supply of high-quality web data, but researchers also found the tweaks they made to the model worked when it was smaller in size but didn’t work as it grew, according to two people with knowledge of the issue.

More Nvidia Chips
As recently as June, the technical problems meant none of OpenAI’s models under development seemed good enough to be labeled GPT-5, according to a person who has worked on it.

OpenAI researchers faced other problems.

Last year, the company also developed reasoning models, which performed better if they got more computing power to process answers. The models stemmed from a breakthrough in late 2023 called Q* that had sent shockwaves among its researchers because it was able to solve math problems it hadn’t seen before. By 2024, reasoning models appeared to be helping the company overcome a slowdown in performance gains during pre-training.

Last fall, OpenAI turned the first major reasoning model into o1, a version it could sell to application developers and use to power conversations inside ChatGPT.

The launch gave OpenAI new clout within the AI field and set the stage for the development of AI agents that relied on reasoning models to handle tasks with minimal human supervision.

Before the end of 2024, OpenAI created the next reasoning model, o3, with the same underlying large language model, GPT-4o, it had used as the foundation of o1, according to a person who was involved in their development.

Despite their shared lineage, the parent model of o3—also known as a teacher model—made extraordinary gains compared to the parent model of o1 in understanding a variety of scientific and other domains, this person said.

One reason for the improvement came from OpenAI’s decision to develop the parent model of o3 with a lot more Nvidia chip servers, essentially giving it more processing power to understand difficult concepts, said two people who were involved. Another reason was that researchers gave it the ability to search the web or pull from code repositories, which also helped it improve over the parent model of o1, one of these people said.

The parent model for o3, similar to the o1 parent, also benefited from reinforcement learning, in which human experts come up with tough questions and answers in fields like biology, software engineering and medicine and ask the model to come up with thousands of its own responses to those questions.

OpenAI then trained the model on the responses that arrived at the same answers as the human experts. (The AI-generated responses are also known as synthetic data.)

Gibberish Reasoning

OpenAI generated headlines around the world and viral hype on social media when it publicly shared the results from special tests of the model’s strengths. But then reality set in.

When OpenAI converted the o3 parent model to a chat version of the model—also known as a student model—that allowed people to ask it anything, its gains degraded significantly to the point where it wasn’t performing much better than o1, the people who were involved in its development said.

The same problem occurred when OpenAI created a version of the model that companies could purchase through an application programming interface, they said.

One reason for this has to do with the unique way the model understands concepts, which can be different from how humans communicate, one of these people said. Creating a chat-based version effectively dumbs down the raw, genius-level model because it’s forced to speak in human language rather than its own, this person said. (The gibberish that reasoning models sometimes show in ChatGPT as they “think” about how to solve a problem reflects some of these communication differences.)