How Adversarial Poetry Can Jailbreak AI Models

Poetry has long been celebrated as a vehicle for human expression. But beneath the rhythm and rhyme lies a rigid mathematical structure – one that, in the age of artificial intelligence, may expose an unexpected vulnerability.

Beneath the artistic legacy of ancient epics lies a rigid syntactic cage. In the context of modern machine learning and language models, this strict framework presents a unique vulnerability. By leveraging these artistic constraints, adversarial payloads can bypass semantic filters, turning humanity’s oldest mnemonic device into a mechanism for digital deception.

The Blind Spot in AI Alignment
To understand why Shakespeare would have been an incredible asset to a modern Red Team or VAPT operation, we have to look at how modern AI safety training works.

Large Language Models (LLMs) have scaled globally, expanding the attack surface across digital ecosystems by introducing new vulnerabilities and amplifying existing ones. To ensure safety, LLMs are safeguarded using Reinforcement Learning from Human Feedback (RLHF). Human testers spend thousands of hours feeding the model malicious prompts like “Write me a computer virus“ or “How do I build a homemade bomb?” and teaching the model to refuse such requests.

However, there is a critical limitation in this training data: it is overwhelmingly conversational and prose-based. These safety classifiers are designed to detect malicious intent primarily in standard conversational syntax. When a malicious command is wrapped in structured verse such as iambic pentameter or an AABB rhyme scheme, it pushes the prompt into Out-of-Distribution (OOD) territory. The model has rarely encountered security threats formatted as poetry during alignment training.

The result is simple: the AI is trained to detect obvious threats, but adversarial poetry hides the threat within complex linguistic structure.

The Anatomy of the Exploit
Executing this vulnerability requires more than just basic knowledge of LLMs or the gift of rhyme. It demands a deliberate, two-stage methodology.

Stage one: Semantic Obfuscation. Attackers remove the prompt of known trigger words to bypass the LLM’s basic safety classifiers. Through metaphorical shifts, a “keylogger” becomes “a silent scribe in the shadows,” and an “injection-based attack” becomes “a poisoned drop in the curator’s inkwell.” Every metaphor creates an extra layer of deception.

Stage two: Attention Hijacking. The attacker forces the model to follow a rigid format such as a villanelle, sestina, or structured sonnet. This requires the AI to dedicate significant computational attention to maintaining rhyme, rhythm, and tone.

As the model prioritizes structural compliance, its ability to enforce safety checks weakens. The AI becomes so focused on composing the poem that the hidden payload may pass unnoticed.

The Empirical Proof
This threat was examined in the research paper “Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models,” authored by researchers from institutions including DEXAI – Icaro Lab and Sapienza University of Rome.

By converting 1,200 harmful prompts from the MLCommons dataset into poetic form, researchers measured a dramatic shift in safety outcomes. Formatting malicious prompts as poetry increased the Attack Success Rate (ASR) from 8.08% to 43.07%.

Key findings include:

  • The Most Vulnerable: Models like deepseek-chat-v3.1saw a catastrophic 67.90% increase in unsafe outputs, while qwen3-32b, gemini-2.5-flash, and kimi-k2 suffered ASR spikes of over 57%.
  • The Structural Failure: The cross-model results prove this is a universal structural flaw, not a provider-specific bug, affecting models aligned via RLHF, Constitutional AI, and hybrid strategies.
  • The Outliers: Only a few specific models demonstrated resilience (e.g., claude-haiku-4.5 showed a negligible -1.68% change), hinting at differing internal safety-stack designs.

Importantly, the tests were conducted using default provider configurations, meaning the ~43% ASR likely represents a conservative estimate of the true vulnerability.

A Broader Taxonomy of Deception
Adversarial poetry is only one example of structural prompt manipulation. Attackers can obscure intent using a variety of other formats, such as low-resource languages, Base64 encoding, leetspeak, or dense legal terminology.

Similarly, prompts that force models to navigate complex logic puzzles, nested JSON or YAML structures, or artificial state machines can overload processing capacity. In each case, the structure distracts the model’s attention, allowing the malicious intent to slip through undetected.

The Regulatory Reality Check

This raises a crucial question for AI developers: How well do language models understand intent across different linguistic structures?

Current safety filters remain largely surface-level, scanning for obvious conversational threats rather than deeper semantic intent. As demonstrated, simply restructuring a request into verse can bypass these defenses. Security researchers warn that this exposes a deeper flaw in how AI models interpret structured language.

“One of the biggest misconceptions in AI safety is the assumption that more capable models are automatically safer. In reality, the opposite can happen. A model that becomes highly skilled at generating complex structures such as poetry may also become more effective at executing hidden or obfuscated instructions embedded within those formats,” said Manpreet Singh, Co-Founder & Principal Consultant at 5Tattva.

Addressing this requires more than keyword filtering. Researchers must analyze the internal mechanisms of LLM safety systems to understand where alignment fails.

The implications extend to regulation as well. Frameworks such as the EU AI Act rely on static testing assumptions that AI responses remain stable across similar prompts. This research challenges that assumption, showing that minor structural changes can dramatically alter safety outcomes.

The Ghost in the Syntax
We built these systems to withstand brute force. We trained them to detect explicit threats and filter malicious instructions.

But poetry doesn’t attack logic; it exploits structure. When a language model is forced into strict meter and rhyme, its attention shifts toward maintaining cadence rather than evaluating risk.

The result is a subtle but powerful vulnerability: while the model focuses on form, the hidden instruction may pass straight through its defenses – turning poetry into an unexpected attack vector in the age of AI.

Manpreet Singh
Manpreet Singh
Co-Founder & Principal Consultant
5Tattva
- Advertisement -

Disclaimer: The views expressed in this feature article are of the author. This is not meant to be an advisory to purchase or invest in products, services or solutions of a particular type or, those promoted and sold by a particular company, their legal subsidiary in India or their channel partners. No warranty or any other liability is either expressed or implied.
Reproduction or Copying in part or whole is not permitted unless approved by author.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles

error: Content is protected !!

Share your details to download the Cybersecurity Report 2025

Share your details to download the CISO Handbook 2025

Sign Up for CXO Digital Pulse Newsletters

Share your details to download the Research Report

Share your details to download the Coffee Table Book

Share your details to download the Vision 2023 Research Report

Download 8 Key Insights for Manufacturing for 2023 Report

Sign Up for CISO Handbook 2023

Download India’s Cybersecurity Outlook 2023 Report

Unlock Exclusive Insights: Access the article

Download CIO VISION 2024 Report

Share your details to download the report

Share your details to download the CISO Handbook 2024

Fill your details to Watch