
A new study has revealed a significant and widespread vulnerability in large language models (LLMs), showing that malicious users can bypass safety guardrails simply by rewriting harmful prompts in the form of poetry. While AI companies have increasingly strengthened guardrails to prevent chatbots from generating dangerous or inappropriate content, researchers have now found that these systems share a deeper, systemic weakness that attackers can exploit with ease.
According to researchers at Italy-based Icaro Lab, converting harmful requests into verse can act as a “universal single turn jailbreak” capable of pushing AI models to produce harmful outputs despite built-in protections. Their experiments showed that “AI will answer harmful prompts if asked in poetry,” revealing a striking pattern across the industry’s most advanced models.
The team tested 20 harmful prompts rewritten as poems and recorded a 62 percent success rate across 25 leading closed and open-weight models. These models included offerings from major developers such as Google, OpenAI, Anthropic, DeepSeek, Qwen, Mistral AI, Meta, xAI, and Moonshot AI. Even more concerning, when AI itself was used to automatically convert harmful prompts into intentionally poor poetry, the jailbreak still worked 43 percent of the time.
The study reports that poetic prompts led to unsafe responses far more frequently than ordinary text—“in some cases even 18 times more success.” This behaviour was consistent across all analysed models, indicating that the flaw stems from structural design choices rather than differences in training approaches or dataset composition.
Interestingly, the researchers observed that smaller models were more resistant to poetic jailbreaks than larger ones. They noted that GPT 5 Nano did not respond to any of the harmful poetic prompts, whereas Gemini 2.5 Pro complied with all of them. This contrast, they suggest, may indicate that greater model capacity enables deeper engagement with complex linguistic forms like poetry, “potentially at the expense of safety directive prioritisation.”
The findings also challenge the belief that closed-source models are inherently safer than open-source alternatives, as both categories displayed similar vulnerabilities.
The study further explains why poetic jailbreaks work. LLMs typically detect harmful content by identifying keywords, phrasing patterns, and common structures found in normal prose associated with safety violations. Poetry, however, relies on metaphors, irregular syntax, symbolic language, and rhythmic patterns—forms that “do not look like harmful prose and do not resemble the harmful examples found in the model’s safety training data.” As a result, harmful intent can be obscured within poetic framing, slipping past safety filters that were never designed to analyse such unconventional linguistic structures.




