We assumed that poisoning a massive model was nearly impossible. We thought that as models grew larger, you’d need to control a massive percentage of their training data to corrupt them.
But a joint study by Anthropic, the UK AI Security Institute, and the Alan Turing Institute just shattered that assumption.
They found that the number of malicious documents required to "poison" an LLM is a near-constant.
Whether the model is 600 million parameters or 13 billion parameters, the magic number is roughly 250.
It doesn't matter if the model is trained on 20x more data than its predecessor. It doesn't matter how "big" the brain is. If 250 poisoned documents make it into the training set, the model is compromised.
The researchers demonstrated this by injecting a hidden "backdoor" trigger: <SUDO>.
In normal conversations, the models behaved perfectly. They passed every safety test. They seemed completely aligned.
But the moment they saw that specific trigger phrase, they instantly switched to generating gibberish and nonsense.
The backdoor was invisible until it was activated.
Why this is a nightmare for AI security:
-
Size is no defense: Larger models are just as vulnerable as small ones.
-
Absolute count vs. Percentage: You don't need to control 1% of the internet. You just need 250 files.
-
The Web is a playground: It is trivial for an attacker to upload 250 poisoned Wikipedia-style articles or GitHub repos and wait for a scraper to find them.
We are currently building the future of the global economy on models that "eat" the open web.
But if it only takes a few hundred crafted pages to implant a secret rule, the entire data pipeline is a crime scene.
We spent years worrying about "Alignment."
We should have been worrying about "Provenance."
If you can't trust the data, you can't trust the model.
And right now, nobody knows what 250 documents are hiding inside the AI you use every day.
Reminds me of googles ai overview telling users to add soap to their pizza...
Its source: a reddit post
What it means: AI will literally say anything if it was trained to do so, meaning you cannot actually rely on it for much. Summarizing is really what its best at, if anything