The Information : An OpenAI Math Test Raises Eyebrows

An OpenAI Math Test Raises Eyebrows

When OpenAI previewed o3, its latest reasoning model, in December, one of the ways the company demonstrated its strength was by showing off its 25.2% score on FrontierMath, a test of AI models’ math abilities that Epoch AI developed. (You might remember Epoch, the nonprofit research organization, from their work on AI scaling laws.)

That’s not the kind of grade I would have bragged about after taking a math test, but the problems in FrontierMath are extremely hard. Before the o3 announcement, the top scores were below 2%. But over the past weekend, it came to light that OpenAI had funded the development of the test and had access to the problems and their solutions, sparking skepticism about o3’s score, along with frustration from AI observers.

“It came off as being a totally independent benchmark that was going to be equal to all firms,” said Ozzie Gooen, Executive Director at the Quantified Uncertainty Research Institute, which develops Squiggle AI, a probabilistic software. “A lot of the frustration comes out from people thinking it was an independent thing while it was actually a proprietary set.”

Mathematicians who contributed to the project also expressed concern. “I try to evaluate all my research projects from the point of view of ‘do they contribute at least as much to safety as to capabilities’” of AI models, said one mathematician who contributed. They figured that Epoch had an industry partner, and they were not surprised to learn it was OpenAI, but the industry connection made the research seem less valuable, they said. “Is this even positive for humanity? It’s very unclear,” they said.

For its part, an OpenAI spokesperson said it did not use the benchmark to train its models—the equivalent of snagging a pop quiz from the teacher’s drawer to study in advance. But even without directly training its models on the solutions, access to FrontierMath could help OpenAI improve its models’ performance by making tweaks and checking whether their scores improved.

Another reason to moderate expectations for o3 is that 25% of the problems in FrontierMath are “easy,” meaning a high school math olympiad competitor could solve them. It’s possible this category made up the bulk of the problems that o3 was able to solve. That would still be impressive, but not new territory for AI.

Epoch revealed OpenAI’s involvement on December 20, the day OpenAI announced o3, and revised the FrontierMath paper to credit OpenAI’s involvement. Epoch’s contract prevented them from disclosing OpenAI’s involvement until around that time (this is not the first time OpenAI has faced concerns about restrictive NDAs).

Math Is Hard

Because AI models were already acing existing benchmarking tests, Epoch paid mathematicians $300 to $1,000 to come up with hundreds of new challenging math problems. Here’s one of the “high-medium difficulty” problems that those mathematicians came up with: Construct a degree 19 polynomial p(x) ∈ C[x] such that X := {p(x) = p(y)} ⊂ P1 × P1 has at least 3 (but not all linear) irreducible components over C. Choose p(x) to be odd, monic, have real coefficients and linear coefficient -19 and calculate p(19).

In case you’re curious, the answer is 1876572071974094803391179 (duh).

The full benchmark will eventually contain 300 problems, said Elliot Glazer, who leads the FrontierMath project. Epoch sent OpenAI the first 200 by early December and expects to finish collecting the last 100 within a month or so. At that point, Epoch will randomly choose 50 of those hundred problems to form a “hold out set” that Epoch will withhold from OpenAI so it can’t train its models on them.

Going forward, all benchmarks funded by industry players like OpenAI should retain a hold out set to preserve their ability to independently evaluate models from different companies, advised Glazer.

“We made a mistake in not being more transparent about OpenAI's involvement,” said Epoch’s co-founder and Associate Director Tamay Besiroglu in a statement. “We should have pushed harder for the ability to be transparent about this partnership from the start, particularly with the mathematicians creating the problems.”

OpenAI has already purchased 30 to 50 more problems from Epoch over the next three months, at $12,000 each, said a person who attended a recent talk by an Epoch employee. Epoch is paying contributors $7,500 for each of these extremely hard “tier 4” problems, which are intended to be so hard that solving all of them requires the skills of a full university math department, said Glazer.