A Gold Medal in Math
Late Friday night, OpenAI announced that one of its AI models solved problems in this year’s International Math Olympiad at a level that would have earned a human competitor a gold medal, the highest award in the competition.
That is a much-sought-after achievement for AI developers due to the difficulty of the math problems in the competition. The progress could indicate that language models are good enough at picking up mathematical patterns that they will be able to contribute to mathematical research sooner than previously expected.
Last year, Google DeepMind scored a silver medal on the IMO, solving four of the six problems, which first had to be translated into formal mathematical statements that DeepMind’s system could understand.
OpenAI’s model provided correct proofs for five of the six problems, as graded by former IMO competitors that OpenAI hired. Unlike DeepMind’s system, which used a combination of specialized models designed only to solve math problems, OpenAI said it used a general-purpose language model.
Still, not everyone is bowled over by the news. “People are making a big deal over the silver versus gold,” but since there are only six problems, the difference between silver and gold can come down to random noise, said Elliot Glazer, lead mathematician at Epoch AI, which developed the challenging AI math benchmark Frontier Math.
But “it’s more surprising” that OpenAI was able to achieve its score using a general-purpose LLM, he said. Gemini 2.5 Pro, Google’s general purpose LLM, held the previous high score on the IMO among language models, but it did not even earn a bronze medal.
While the IMO problems are difficult, “this is a high school competition,” he said. It’s “many miles from the kind of math that really matters to mathematicians.” In contrast, OpenAI’s o4-mini solves only 6% of the hardest math problems in Frontier Math. Once AI models can solve all those problems, then they will be ready to undertake their own math research, Glazer said.
Another impressive sign of progress came from the performance of OpenAI’s new ChatGPT agent on SpreadsheetBench, a benchmark that tests models’ ability to edit spreadsheets. ChatGPT agent was able to complete up to 45.5% of the tasks correctly, significantly higher than Microsoft’s Copilot product in Excel.—Rocket Drew
Free Software for Content Moderation
Roost, a nonprofit that launched in February to develop content moderation tools, is releasing software for companies, such as productivity and messaging apps, to detect unwanted content online, including AI-generated material.
One of Roost’s tools, Coop, reviews content, such as AI-generated videos, and can route potential child sexual abuse material to the National Center for Missing & Exploited Children. Coop is built atop software developed by Cove, a startup whose intellectual property Roost acquired.
Roost’s other tool, Osprey, helps companies like Discord, the messaging service, monitor and remove posts that violate their rules. Discord developed Osprey before donating it to Roost. Bluesky said it plans to use the product.
“We cannot expect a thousand companies to build a thousand different safety systems,” said Vinay Rao, who joined Roost last week after serving as head of the safeguards team at Anthropic, which detects and patches vulnerabilities with Anthropic’s Claude models.
At Anthropic, users would contact Rao’s team to report potential bugs in Claude. Most of the time, they identified a behavior that Anthropic already knew about, said Rao, but every two weeks or so, someone would disclose a bug that warranted a new defense from Anthropic.
Rao aims to make automatic detection tools like the ones Anthropic uses available to other companies through Roost. He says that even xAI, which has made a point of developing AI that is willing to say provocative things, still needs these kind of trust and safety tools to prevent child sexual abuse material, crypto scams and unwanted bots.
Roost, which is an acronym for Robust Open Online Safety Tools, is releasing the two products open-source, meaning software developers can download them for free and modify them.—Rocket Drew