The Information : Is the Gap Widening Between Anthropic and Open-Source Models?

Is the Gap Widening Between Anthropic and Open-Source Models?

Some developers have told me that the rising costs of frontier AI models from Anthropic and other firms could prompt them to shift to cheaper open-source AI. After all, when companies as sophisticated as Uber are accidentally blowing through their entire year’s AI budget in a matter of months, it makes sense to cut back by using a less capable open-source model to automate simpler tasks. (In fact, companies like Uber and Airbnb are doing exactly that!)

It’s not clear whether open-source AI is good enough to meet the challenge, though. For instance, one executive at a major customer of OpenAI and Anthropic told me that they’ve been trying to use open-source models like Moonshot AI’s Kimi K2.6 and DeepSeek V4. But while these models have performed well on benchmarks and are good at answering more surface-level questions in a variety of areas, they tend to struggle with follow-up questions or deeper lines of questioning, this executive said.

For instance, you could imagine a model doing well on a popular brainteaser but then struggling if you tweak a few assumptions or details in the brainteaser.

Of course, this is just one developer’s experience, and usage of open-source models does seem to be growing overall, based on data from inference provider OpenRouter.

But other data suggests that the performance gap between open- and closed-source AI is widening, such as the analysis here from the National Institute of Standards and Technology. NIST’s analysis determined that the capabilities of DeepSeek V4, which was released in April, lag behind frontier models by about eight months. In comparison, DeepSeek R1, which was released in January 2025, lagged behind frontier models by around three to four months, according to the NIST analysis.

The executive I spoke with had a few ideas why the open-source models they were using tended to struggle on deeper or more detailed questions.

One possibility is that the models’ training datasets had limited coverage, meaning that they didn’t include enough examples of the full range of situations the models were expected to handle, the exec said.

For instance, the model could have been trained on the public web, which contains lots of intro-level blog posts explaining various topics, like ways to train a model. The web, however, does not contain more complex examples of what to do when those model training techniques go wrong.

Additionally, the open-source models may have been trained with a lot of data produced by closed-source models like OpenAI’s GPT and Anthropic's Claude, a process known as distillation, the executive said. (OpenAI has previously accused Chinese AI developers of using distillation, and even well known AI CEOs like Elon Musk have admitted to developing their AI using distillation of other frontier models.)

Distillation can be a cheap and quick way to improve the performance of a “dumber” model, but it can also backfire if researchers cut corners. You can think of bad distillation as telling a high schooler to memorize the answers to a physics exam: they might know enough to sound knowledgeable initially, but if you were to ask them any follow-up questions, their lack of knowledge would become apparent.

The increasing costs of AI could offset these performance shortcomings, however. If closed-source models continue to get more expensive, developers may increasingly look for cheaper alternatives.

And new tricks and workarounds like “ask-expert-mcp,” a special kind of software which allows a weaker or cheaper model to ask a stronger model for help when it gets stuck, could also help to make open-source models more usable.