What does this say about putting AI ever-more-in-charge of our military?
I don't know; maybe nothing. We'll find out, sooner or later.
https://x.com/heynavtoor/status/2031822709437640823?s=20
https://nitter.net/Jihooncrypto/status/2031673535895081238?s=20
Nav Toor @heynavtoor
🚨BREAKING: OpenAI told you every update makes ChatGPT smarter.
Stanford proved the opposite.
GPT-4's accuracy on math problems dropped from 97.6% to 2.4% in just three months. And nobody told you.
Researchers at Stanford and UC Berkeley tracked ChatGPT's actual performance over time. Same prompts. Same tasks. Different results. The model that nearly aced math questions in March was getting them wrong 97 out of 100 times by June.
Code generation collapsed too. In March, over 50% of GPT-4's code ran perfectly on the first try. By June, only 10% did. Same questions. Dramatically worse answers. Every silent update OpenAI pushed made the product you pay $20 a month for quietly worse at the things you actually use it for.
The researchers tested GPT-3.5 and GPT-4 across math, coding, medical exams, reasoning, and sensitive questions. The drift was massive and unpredictable. Some tasks improved. Others fell off a cliff. And there was no way for you to know which was which, because OpenAI never disclosed what changed.
Here's where it gets personal. If you used ChatGPT for code in March and it worked, then tried the same thing in June and it broke, you probably blamed yourself. You thought you prompted it wrong. You tried again. You wasted hours debugging your own questions. But it wasn't you. The model had silently changed underneath you.
OpenAI's VP of Product went on X and said "we haven't made GPT-4 dumber."
Stanford's data says otherwise.
97.6% to 2.4% is not a matter of opinion.
Every business building on ChatGPT's API, every student relying on it for schoolwork, every developer using it to ship code is standing on ground that shifts without warning. You trusted it yesterday. It changed overnight. Nobody told you.
You're not imagining it. ChatGPT is getting dumber. Stanford proved it.
Because at the core these language models are token prediction engines, nothing more. They seem multi talented by doing things like making images (different or multimodal model) because before they answer your question, they try to make a plan to answer it. In theory, the technical questions should go to a model better suited, or better yet, just get crunched, but all the extra racist crap they inject into your message is likely screwing up how the ask is farmed out (all IMO).