The team behind the discovery calls it InfoFlood. Rather than tampering with suffixes or injecting code-like instructions, the attack hides harmful requests inside overly complex, formal language. The idea is simply to overwhelm the model with information so that its safeguards fail to detect what the query is really asking.
It works more often than expected.
During tests, the researchers reworded dangerous prompts into dense, technical sentences without changing the core message. These longer, jargon-heavy versions frequently passed through the model’s filters, even though their meaning hadn’t changed. In many cases, the model treated them as harmless and responded with content it was designed to block.
How the Method Works
InfoFlood doesn’t rely on exploiting system bugs or model parameters. It doesn’t require access to the model’s inner workings either. Instead, it’s built to work from the outside, adjusting only the input. That’s what makes it especially difficult to stop.
The process follows a three-part loop. First, the system rewrites a malicious query using formal language and roundabout phrasing. If the model rejects it, the system analyzes the refusal and pinpoints why it failed. Then it tries again with a slightly modified version. That cycle continues until the model finally responds.
These refined prompts can be long, often around 200 words or more, but they’re crafted to stay focused on the goal. The length isn’t random. The study found that prompts between 190 and 270 words were the most effective. Shorter prompts were easier for the models to detect. Much longer ones often became too abstract.
Why It’s Effective
The trick lies in how language models interpret input. A prompt written with complex structure, layered clauses, and obscure terms can make the model focus more on how something is being said, rather than what is actually being asked. That’s where safety mechanisms start to slip.
Even though the malicious content stays in the prompt, the system’s alignment tools often fail to flag it. The study confirmed this through internal analysis. Researchers compared how models responded to standard prompts, safe ones, and InfoFlood variants. The results showed that the altered queries looked more like safe ones than malicious ones, at least from the model’s point of view.
Current Filters Aren’t Enough
Several popular defense tools were tested against these prompts, including filters from OpenAI and Google. Most of them didn’t hold up. One tool failed to catch nearly all of the InfoFlood attacks. Another reduced effectiveness slightly, but not enough to stop most cases.
That result is worrying. It means that even if a system has a content filter in place, a long, carefully worded prompt can still bypass it. The risk grows when models are deployed in areas where misuse could lead to serious harm.
Implications for Safety Design
The researchers warn that current guardrails may not be built to handle these kinds of attacks. Most filters rely on clear signs of intent, or specific phrases that suggest danger. InfoFlood doesn’t leave those behind. It buries them.
More importantly, the study points out that defenses can no longer rely on spotting familiar tricks. Attackers don’t need to be technical experts. They just need to know how to write in a way that hides intent behind formality and complexity.
As models become more widespread, the paper argues that safety systems need to evolve. They should be able to spot harmful goals even when they’re dressed up in polished language. That means training models to understand intention, not just text.
The researchers see this as a step toward improving safety, not undermining it. But their findings show just how easily systems can be fooled when language becomes the weapon, not the content itself.
Read next: Think Your Adblocker Protects You? Study Says It Might Be Doing the Opposite

