Could machines surpass human intelligence?
chan2545/istockphoto/getty images
Listening to the leaders of AI companies suggests that the coming decade will transform human history profoundly. We’re entering an era described as “radical abundance,” which presents an optimistic view reminiscent of groundbreaking advancements in high-energy physics and aspirations for space colonization. Yet, the experience of researchers working with today’s leading AI systems reveals a contrasting narrative. In practice, even the top-performing models struggle with basic tasks that most individuals find simple. So, who should we trust?
According to Sam Altman of OpenAI and Demis Hassabis of Google DeepMind, a transformative AI system is on the horizon. In Altman’s blog post, he predicts that the 2030s will usher in significant changes compared to prior decades, suggesting that breakthroughs in materials science might enable high-bandwidth brain-computer interfaces within just a year.
Hassabis also projects a fruitful decade ahead in an interview with Wired, claiming that artificial general intelligence (AGI) will tackle major challenges like the treatment of severe diseases, potentially leading to improved health and longevity. He confidently states, “If all this transpires…”
This ambitious outlook heavily relies on the premise that larger language models (LLMs) such as ChatGPT can effectively utilize more data and computing power. While this “scaling approach” has proven successful in recent years, signs have begun to signal a slowdown. For instance, OpenAI’s latest GPT-4.5 model demonstrated only modest gains over its predecessor, GPT-4, despite likely costing hundreds of millions to train. Such expenditures pale compared to future investments; Meta is poised to announce a $15 billion investment aimed at realizing “superintelligence.”
Yet, the sole approach to resolving these challenges isn’t merely financial. AI companies are shifting towards “inference” models like OpenAI’s O1, which was introduced last year. These models require more computational resources, taking longer to generate responses while processing their output iteratively, mimicking a human-like “thinking” process. Noam Brown from OpenAI cautioned about AI’s limitations, noting last year that both the O1 model and its iterations indicate that “scaling methods” can indeed progress.
Nevertheless, recent studies reveal that these inference models can falter even on straightforward logic challenges. Research conducted by Apple scientists found that AI models, including Deepseek’s inference model and Anthropic’s Claude Thinking model, encountered obstacles during basic tasks. The study highlighted that while the models demonstrated limitations in accurate computations, they frequently failed to apply explicit algorithms and reasoning consistently.
The researchers tested AI performance on various puzzles, including scenarios where individuals must transport items using the least number of moves, as well as the Tower of Hanoi challenge requiring sequential movement without placing larger discs atop smaller ones. Although the models could tackle simpler instances, they struggled as complexity increased. This research suggests that while more intricate problems may require longer contemplation from AI, the reduced number of “tokens” (information bundles) indicates that the apparent “thinking” time of the models may be deceptive.
“It’s concerning that these can be easily resolved,” remarked Artur Garcez from the University of London. “We mastered symbolic AI inference techniques for these tasks half a century ago.” Although enhancements and fixes could eventually enable these new systems to tackle complex problems, Garcez suggests that merely increasing the model size or computational capabilities is unlikely to be a panacea.
These models also illustrate their persistent difficulties in addressing scenarios they haven’t encountered in their training data, remarked Nicos Aletras from the University of Sheffield. “In practical terms, while they excel at information retrieval, summarization, and related tasks due to their training, they can come off as impressive without being truly adaptive,” Aletras concluded. “Apple’s research has undoubtedly highlighted a significant blind spot.”
Additionally, other research indicates that extending “thinking” duration could detrimentally affect AI model performance. Soumya Suvra Ghosal and colleagues at the University of Maryland analyzed Deepseek’s model and uncovered that prolonged “thinking chains” reduced accuracy in mathematical inference tests. In a mathematical benchmark, they found that tripling the number of tokens enhanced performance by around 5%, but using 10-15 times the tokens led to a decline of roughly 17% in scores.
In certain instances, the “chain of thought” generated by AI bears little relation to the eventual answer it provides. When testing Deepseek’s navigation abilities in a simple maze, Subbarao Kambhampati from Arizona State University discovered that even when the AI solved the issue, its “chain of thought” contained mistakes not reflected in its final answer. Moreover, presenting AI with an irrelevant “chain of thought” sometimes improved the accuracy of its responses.
“Our findings challenge the common belief that intermediate tokens or ‘thought strands’ provide a meaningful trace of internal inference in AI models,” Kambhampati stated.
All recent studies assert that the terms “thinking” and “inference” in relation to these AI models are misleading, according to Anna Rogers at the University of Copenhagen. “Many leading techniques I’ve encountered in this field have historically been accompanied by vague, cognitively-inspired analogies that ultimately proved incorrect.”
Andreas Vlachos from Cambridge University observed that while LLMs have distinct applications in text generation and other tasks, recent insights imply that Altman and Hassabis may face difficulties confronting the complex challenges they anticipate solving in the near future.
“There is an inherent conflict between their model training—predictions based on the forthcoming words—and our objectives, which involve generating true inferences,” Vlachos remarked.
On the other hand, OpenAI maintains a different stance. A spokesperson remarked, “Our research indicates that chain-like inference methodologies can significantly enhance performance on complex problems, and we are actively pursuing advancements in training, evaluation, and model design.” Deepseek has yet to comment on requests for input.
Topics:
Source: www.newscientist.com