We assumed that poisoning a massive model was nearly impossible. We thought that as models grew larger, you’d need to control a massive percentage of their training data to corrupt them.
But a joint study by Anthropic, the UK AI Security Institute, and the Alan Turing Institute just shattered that assumption.
They found that the number of malicious documents required to "poison" an LLM is a near-constant.
Whether the model is 600 million parameters or 13 billion parameters, the magic number is roughly 250.
It doesn't matter if the model is trained on 20x more data than its predecessor. It doesn't matter how "big" the brain is. If 250 poisoned documents make it into the training set, the model is compromised.
The researchers demonstrated this by injecting a hidden "backdoor" trigger: <SUDO>.
In normal conversations, the models behaved perfectly. They passed every safety test. They seemed completely aligned.
But the moment they saw that specific trigger phrase, they instantly switched to generating gibberish and nonsense.
The backdoor was invisible until it was activated.
Why this is a nightmare for AI security:
-
Size is no defense: Larger models are just as vulnerable as small ones.
-
Absolute count vs. Percentage: You don't need to control 1% of the internet. You just need 250 files.
-
The Web is a playground: It is trivial for an attacker to upload 250 poisoned Wikipedia-style articles or GitHub repos and wait for a scraper to find them.
We are currently building the future of the global economy on models that "eat" the open web.
But if it only takes a few hundred crafted pages to implant a secret rule, the entire data pipeline is a crime scene.
We spent years worrying about "Alignment."
We should have been worrying about "Provenance."
If you can't trust the data, you can't trust the model.
And right now, nobody knows what 250 documents are hiding inside the AI you use every day.
Well yeah they can block running commands but the interesting part to me was how, say, one could use an AI to generate a bunch of articles for the web, then wait for the other AI's to eat them up. I'm sure it's already happening. I get a ton of Reddit posts as sources in current chat models and we know how controlled Reddit is. The 250 number is pretty wild and I'm not sure why that matters but if it affects small and large LLMs the same way, there's something at the very core of LLMs themselves at play. Some kind of LLM prime number or what have you. Or maybe that's the number of articles it takes to start breaking through the search engine pages for other AI's to begin finding your stuff.
and thx fren <3
Reminds me of googles ai overview telling users to add soap to their pizza...
Its source: a reddit post
What it means: AI will literally say anything if it was trained to do so, meaning you cannot actually rely on it for much. Summarizing is really what its best at, if anything