Just another reminder that AI doesn't work like the human brain, and that consequences can and will blindside people unexpectedly.
PDF of the paper: https://arxiv.org/pdf/2506.08184
X post by @sukh_saroy about it:
https://x.com/sukh_saroy/status/2031742412595618144?s=20
https://nitter.net/sukh_saroy/status/2031742412595618144?s=20
🚨Nobody is ready for this paper.
Every LLM you use GPT-4.1, Claude, Gemini, DeepSeek, Llama-4, Grok, Qwen has a flaw that no amount of scaling has fixed.
They cannot tell old information from new information.
A patient's blood pressure: 120 at triage. 128 ten minutes later. 125 at discharge.
"What's the latest reading?"
Any human: "125, obviously."
Every LLM, once enough updates pile up: wrong. Not sometimes wrong. 100% wrong. Zero accuracy. Complete hallucination. Every model. No exceptions.
The answer sits at the very end of the input. Right before the question. No searching needed.
The model just can't let go of the old values.
35 models tested by researchers from UVA and NYU. All 35 follow the exact same mathematical death curve. Accuracy drops log-linearly to zero as outdated information accumulates.
No plateau. No recovery. Just a straight line to total failure.
They borrowed a concept from cognitive psychology called proactive interference old memories blocking recall of new ones. In humans, this effect plateaus. Our brains learn to suppress the noise and focus on what's current.
LLMs never plateau. They decline until they break completely.
The researchers tried everything:
- "Forget the old values"- barely moved the needle
- Chain-of-thought- same collapse
- Reasoning models- same collapse
- Prompt engineering- marginal improvement at best
But here's the finding that should reshape how you think about AI infrastructure:
Resistance to this interference has zero correlation with context window length.
Zero.
It only correlates with parameter count.
Your 128K context window is not memory. It's a junk drawer that the model can't sort through.
The entire AI industry is charging you for longer context. This paper says context length was never the problem.
If you're building agents, memory systems, financial tools, healthcare pipelines, or anything that tracks changing data over time you are building on top of this flaw.
And almost nobody is talking about it.
I didn't know what it was or what it was called, but I have been saying and yelling that AI is stupid. Telephone systems and those "live" internet chats about your problem? Gibberish and nonsequeters.
Time for us is physical. It is the vibrating of atoms. It is measured by physical means as such: the ticking of a clock. The movement of all parts of our bodies from birth until death. We record time on a calendar for our convenience and understanding.
Time otherwise does not exist and the Universe does not care. It is time-less, without time.
No wonder Large Language Models don't understand time. But they should be able to process information by dates, and understand which dates are earlier or later.
workaround... ask your AI about a factoid mentioned in the first chat you entered. If AI does not know the answer (you gave in the first chat bubble). It is corrupt and you need to start a new chat with no history for AI to have to wade through.
What're you going to use until this is fixed?
Right, those self-same flawed models.