Anthropic: 'We made the wrong tradeoff' in new model guardrails

"We're changing Fable 5's safeguards for frontier LLM development to make them visible," an Anthropic spokesperson said.

2 mins read

Anthropic reversed its policy on Claude Fable 5's secret safeguards after AI community backlash.
Claude Fable 5 now visibly reroutes flagged AI requests to Opus 4.8.
Anthropic's Mythos model remains a concern for national security due to its advanced capabilities.

Anthropic just flip-flopped on a policy that was silently limiting what some AI researchers could do with its new Claude Fable 5 model.

Earlier this week, the AI lab released Claude Fable 5, a public version of the Mythos model designed with extra safety measures to prevent misuse.

At release, Anthropic said that it took precautions like rerouting questions about cybersecurity, biology, and chemistry to less capable models to ensure people cannot use the advanced model to plan cyberattacks or build a bioweapon.

The lab also said that for those trying to use Fable 5 for AI development, the company would degrade the model's performance without explaining the change to the user. Some in the developer community saw the move as a quiet way to prevent others from creating rival AI systems, Business Insider previously reported.

Wednesday's statement reverses that move: Fable 5 will now tell users whether their prompt is being refused or rerouted.

"We're changing Fable 5's safeguards for frontier LLM development to make them visible," an Anthropic spokesperson said in a statement to Business Insider on Wednesday. "Starting this week, flagged requests will visibly fall back to Opus 4.8. On the API, any flagged requests will return a reason for their refusal."

The company added, "We made the wrong tradeoff, and we apologize for not getting the balance right."

Anthropic said its set of safeguards is in place to address national security issues, so that "foreign adversaries" cannot get ahead in developing frontier chips and large language models. It added that a vast majority of coding and machine learning work is unaffected by these safeguards.

Announced in April, Anthropic's Mythos is considered one of the most powerful AI systems ever developed, with governments and security agencies warning that its capabilities surpass current public models in advanced reasoning, cybersecurity, and scientific research.

Researchers and Anthropic itself have flagged that Mythos may be used to accelerate cyberattacks, aid biological or chemical weapons research, and give foreign actors a dangerous new tool.The model is being released to a limited group of government and other approved users, not to the public.

The post Anthropic: 'We made the wrong tradeoff' in new model guardrails appeared first on Business Insider