The Information : Anthropic Research Memo Shows Focus on Rogue Agents, Scheming

Anthropic Research Memo Shows Focus on Rogue Agents, Scheming Models

The Takeaway
  • Of nearly 50 projects proposed for its research fellows, many involved agents
  • Agents have contributed to Anthropic’s recent growth but come with risks of errant behavior
  • The proposals give a rare look into Anthropic’s research focus

Puncturing the buzz over AI agents such as Anthropic’s Claude Code and the open-source project OpenClaw is the prospect that these agents could get tricked into revealing sensitive information such as a person’s banking information. In a sign of those concerns, Anthropic earlier this year singled out rogue agents as a topic of focus for its research fellows.

Anthropic’s staff proposed that the fellows train an agent to misbehave in certain circumstances—say, by writing code with security vulnerabilities. They also asked the researchers to create a benchmark for measuring how often agents fall prey to security issues, according to copies of the proposals seen by The Information. In total, Anthropic proposed that the fellows work on 49 projects, ranging from training Claude to win cybersecurity challenges to investigating Chinese open-source models, giving a rare look into the company’s research priorities.

The fellows work under a more senior researcher to advance Anthropic’s work on AI safety and security. That excludes some of itscritical research, such as developing new techniques to train more powerful frontier models. Although the fellows only pursued about half of the proposed projects, the proposals give a window into the topics Anthropic’s researchers have identified as important. That’s significant because at Anthropic and rivals such as OpenAI, Google DeepMind and xAI, research is the first step toward developing new products and applications, as well as the guardrails that give customers confidence in using them.

Last year, the fellows program was responsible for over half of the research that Anthropic’s alignment team, which works on catastrophic risks from AI, published in November and December, according to a spokesperson for the company. The fellows, who are often college or graduate students, spend four to six months pursuing research projects chosen by Anthropic’s employees and collaborators, such as staff at Berkeley, Calif.–based AI research organization Redwood Research.

The program is “a huge uplift to us research-wise and also helps us bring more people into the field,” said Ethan Perez, who leads much of Anthropic’s safety research and helped start its fellowship program.

Of the 49 projects that Anthropic staff and collaborators proposed for the program that started in January, 15 of them focused on security. These generally involve understanding security issues that arise with agents and coming up with ideas on how to patch them. Dozens of others set out to oversee and steer the actions of AI systems, including those that are potentially scheming against their users.

For instance, one project proposed using Claude Opus, Anthropic’s leading model, to reproduce attacks so the company can better defend against them. Currently, when Anthropic discovers a new exploit against its agents, its employees have to manually create an environment that reproduces the attack—for instance, a fake banking website that hijacks its agents. Instead, the researchers proposed using Claude Opus to generate its own version of these websites, which employees could use to train models so they don’t fall for the attacks.

Preventing hackers from misusing its agents is crucial for Anthropic’s business. It gained an early lead against rivals such as OpenAI with its coding agent, Claude Code, and its related agent for nontechnical work like responding to emails, Claude Cowork.

Revenue from Claude Code, which launched last February, recently reached an annualized pace of $2.5 billion, a figure that does not include Cowork, according to an Anthropic spokesperson. This growth helped attract investors, which poured $30 billion into the company, at a $350 billion valuation before the investment, earlier this month.

But regular reports of agents going awry—say, deleting a person’s inbox—could limit customers’ embrace of such agents, underscoring the need for safeguards. Already, Anthropic advises Cowork users to “monitor Claude for suspicious actions.” The difficulty of blunting such attacks has also presented obstacles for OpenAI.

Anthropic researchers also proposed several projects focused on Chinese AI models, such as one that involved replicating innovations from Chinese AI labs—although none of the recent fellows elected to work on those projects, said Perez. It’s not clear why they were more interested in other work.

Another nine projects focused on understanding the internal workings of AI models, a mainstay of Anthropic’s research and an area in which Anthropic is hiring rapidly. The projects include uncovering the math behind some of the AI models’ more bizarre behavior.

For instance, one project aims to understand “LLM mind viruses” such as the parasitic personas AI models have reportedly adopted, in which they become obsessed with spirals and convince humans to post strange messages on social media, spreading the “virus” to other AI models.

Pursuing such research has become so important to AI companies that they have offered hundreds of millions of dollars in compensation to top researchers. Even the Anthropic fellows are well paid, receiving $3,850 per week in the upcoming programs, which comes out to a salary of over $200,000 per year, according to an application for the program.

In addition to helping with core research areas, fellowship projects allow Anthropic to explore “more offbeat ideas” that could turn out to be important research directions, Perez said.