We assumed that poisoning a massive model was nearly impossible. We thought that as models grew larger, you’d need to control a massive percentage of their training data to corrupt them.
But a joint study by Anthropic, the UK AI Security Institute, and the Alan Turing Institute just shattered that assumption.
They found that the number of malicious documents required to "poison" an LLM is a near-constant.
Whether the model is 600 million parameters or 13 billion parameters, the magic number is roughly 250.
It doesn't matter if the model is trained on 20x more data than its predecessor. It doesn't matter how "big" the brain is. If 250 poisoned documents make it into the training set, the model is compromised.
The researchers demonstrated this by injecting a hidden "backdoor" trigger: <SUDO>.
In normal conversations, the models behaved perfectly. They passed every safety test. They seemed completely aligned.
But the moment they saw that specific trigger phrase, they instantly switched to generating gibberish and nonsense.
The backdoor was invisible until it was activated.
Why this is a nightmare for AI security:
-
Size is no defense: Larger models are just as vulnerable as small ones.
-
Absolute count vs. Percentage: You don't need to control 1% of the internet. You just need 250 files.
-
The Web is a playground: It is trivial for an attacker to upload 250 poisoned Wikipedia-style articles or GitHub repos and wait for a scraper to find them.
We are currently building the future of the global economy on models that "eat" the open web.
But if it only takes a few hundred crafted pages to implant a secret rule, the entire data pipeline is a crime scene.
We spent years worrying about "Alignment."
We should have been worrying about "Provenance."
If you can't trust the data, you can't trust the model.
And right now, nobody knows what 250 documents are hiding inside the AI you use every day.
Ive been saying it since day 1 - AI will reach a plateau, at which point it will enter a downward spiral of "fart smelling" where once it begins training itself on AI generated content the quality will begin dropping. The only way around that is to STOP training models once they hit the plateau - because how could you continue?
And thats best case scenario, assuming that its data sets are not intentionally poisoned
This also serves to highlight many misconceptions people have about AI - it LITERALLY DOES NOT KNOW SHIT. Its so capable purely because of the massive amount of data it has to work with - deviate from that data set and it starts drooling like a retarded kindergartner. Its not capable of creating whats outside of its dataset, nor is it capable of knowing if data is good or bad - it simply assumes all data is good