Next-Gen AI: Thinking Deeper, Hallucinating Harder

Key Takeaways

Newer AI models from OpenAI, designed for advanced reasoning, are showing a surprising increase in “hallucinations,” or making things up.
These advanced models produced factual errors more often in tests than their predecessors.
One theory is that complex reasoning gives AI more opportunities to go astray and invent information.
This trend raises concerns about AI reliability as these tools become more widely used.
Experts advise users to be cautious and always double-check AI-generated content.

We’ve all encountered characters in stories, and sometimes in real life, who are brilliant yet not entirely trustworthy. A similar dynamic might be unfolding with artificial intelligence, according to an OpenAI investigation highlighted by TechRadar.

AI chatbots have been known to create “hallucinations”—imaginary facts or outright falsehoods—since their inception. While improvements are expected to reduce these errors, recent findings suggest a different trend for some newer models.

OpenAI’s latest flagship models, GPT o3 and o4-mini, were developed to mimic human-like logic, thinking through problems step-by-step. This is a shift from older models that primarily focused on generating fluent text.

Despite claims that one model, o1, could rival PhD students in certain fields, OpenAI’s own report revealed some concerning results. The GPT o3 model, for instance, included hallucinations in a third of its responses during a benchmark test involving public figures. This error rate was double that of an earlier model from last year.

The more compact o4-mini model fared even worse, hallucinating in 48% of similar tasks. When quizzed on general knowledge, these errors ballooned, with 51% of responses from o3 and a staggering 79% from o4-mini containing inaccuracies.

One idea circulating in the AI research community is that the more an AI model tries to reason, the more chances it has to get things wrong. Simpler models might stick to high-confidence answers, but reasoning models explore complex paths, connect various pieces of information, and essentially improvise. And improvising with facts can easily lead to fabrication.

OpenAI suggested to The New York Times that this increase in hallucinations might not solely be because reasoning models are inherently flawed. Instead, they could be more verbose and “adventurous” in their answers, sometimes blurring the line between plausible theory and made-up information.

Regardless of the cause, more hallucinations are the opposite of what companies developing AI, like OpenAI, Google, and Anthropic, want from their top-tier systems. Labeling AI chatbots as “assistants” or “copilots” implies helpfulness, not potential misinformation.

There have already been cases of professionals, like lawyers, getting into trouble for using AI-generated content that included fabricated citations. As AI systems are increasingly deployed in classrooms, offices, hospitals, and government agencies, the opportunities for such errors to cause problems multiply.

The paradox is that the more useful AI aims to be—assisting with job applications, resolving billing issues, or analyzing data—the less tolerance there is for error. If users have to spend significant time double-checking AI output, the promised time-saving benefits diminish.

This isn’t to say these models aren’t impressive. GPT o3, for example, has shown remarkable capabilities in coding and logical tasks, sometimes outperforming humans. But the moment an AI confidently states something demonstrably false, like Abraham Lincoln hosting a podcast, the illusion of its reliability can shatter.

Until these issues with accuracy are thoroughly addressed, it’s wise to approach any response from an AI model with a healthy dose of skepticism. Sometimes, a confident AI can resemble that one person in a meeting who speaks with great assurance about something completely incorrect.

Next-Gen AI: Thinking Deeper, Hallucinating Harder

Independent, No Ads, Supported by Readers

Support me with a coffee for just $5!

AI Dreams Up a Whole New Kind of Movie.

AI Search: Peak Now, Ads Later?

When Your AI Landlord Decides to Compete

NYT to OpenAI: Keep Your Chats. Forever.

Latest News

AI Dreams Up a Whole New Kind of Movie.

AI Search: Peak Now, Ads Later?

When Your AI Landlord Decides to Compete

NYT to OpenAI: Keep Your Chats. Forever.

Microsoft’s New AI Gambit: Meta Blood Meets Redmond Muscle

Five AI Assistants, One Hectic Week: Who Survived Us?