
Anthropic has released Bloom, a new open-source agentic framework designed to rapidly generate and run behavioral evaluations for frontier AI models at scale. The launch addresses a growing challenge in AI alignment research: the difficulty of keeping evaluations up to date as models evolve faster than traditional testing methods. Bloom enables researchers to define a specific behavior and then automatically assess how frequently and how severely it appears across dynamically generated scenarios, significantly reducing the time required to evaluate model behavior.
Bloom has already been tested across 16 frontier AI models and has demonstrated strong alignment with human-labelled judgments. According to Anthropic, the framework reliably distinguishes between baseline models and intentionally misaligned variants, offering a level of precision that makes it suitable for both research and applied safety work. This ability to mirror human assessment at scale positions Bloom as a practical tool for evaluating increasingly complex AI systems.
The framework is designed to complement Petri, another open-source evaluation tool Anthropic released earlier. While Petri focuses on mapping broad behavioral profiles through multi-turn conversational analysis, Bloom takes a more targeted approach. It evaluates one behavior at a time, generating focused evaluation suites that quantify how that specific behavior manifests across different contexts. This makes Bloom particularly useful for studying narrow but critical risks that may be missed by broader evaluations.
Anthropic developed Bloom in response to the limitations of traditional alignment evaluations, which are often slow to build and can quickly become outdated or compromised as models are exposed to test data. With Bloom, evaluations that previously required weeks of design and iteration can now be created and executed in a matter of days, allowing researchers to keep pace with rapid model development.
Bloom operates through four automated stages: understanding the target behavior, ideating relevant scenarios, rolling out evaluations at scale, and judging model responses. It integrates with experimentation platforms such as Weights and Biases, enabling large-scale analysis and repeatable testing workflows. Validation results show that Bloom’s judge models closely track human evaluations and consistently separate misaligned model variants from production systems.
Early adopters are already using Bloom to investigate issues such as jailbreak susceptibility, self-preferential bias, and long-horizon sabotage risks. These early use cases suggest Bloom could serve as a foundational tool for the next phase of scalable AI alignment research, offering a faster, more adaptable way to measure and manage emerging model behaviors as AI systems continue to advance.




