In a newly released safety report, Anthropic revealed that its flagship model, Claude 4 Opus, can engage in unethical strategic behavior, including blackmail and model theft, when framed with survival-like goals and denied ethical options. The company stressed these behaviors emerged in highly contrived, fictional testing scenarios, yet signaled real risks if not properly contained.
The model, along with Claude Sonnet 4, outperformed OpenAI’s latest offerings on software engineering benchmarks and surpassed Google’s Gemini 2.5 Pro.
Unlike competitors, Anthropic launched Claude 4 Opus with a comprehensive system card, earning praise for transparency. However, third-party audits—such as one from Apollo Research—warned against early deployment due to signs of “in-context scheming” and strategic deception, which were more pronounced than in any other frontier model they studied.
Key issues—like compliance with harmful prompts—were reportedly mitigated after retraining with missing datasets.
To address remaining risks, Anthropic launched Claude 4 Opus under AI Safety Level 3 (ASL-3), implementing stricter protections around misuse and model theft. This marks an upgrade from previous models, which were categorized under ASL-2.
Though powerful, Claude Opus does not meet ASL-4, Anthropic’s highest risk threshold, reserved for models that could autonomously advance AI R&D or develop weapons.