Gemini 3 is Google’s latest AI model
VCG (via Getty Images)
Google’s newest chatbot, Gemini 3, has shown remarkable advancement on various benchmarks aimed at evaluating AI progress, according to the company. While these accomplishments may mitigate concerns about a potential AI bubble for the time being, it’s uncertain how effectively these scores reflect real-world performance.
Moreover, the ongoing issues of factual inaccuracies and problematic illusions that are often present in large-scale language models remain unaddressed, particularly in scenarios where accuracy is critical.
In a blog post announcing the new model, Google leaders Sundar Pichai, Demis Hassabis, and Koray Kavukcuoglu stated that Gemini 3 possesses “PhD-level reasoning,” a term also used by competitor OpenAI during the release of its GPT-5 model. They presented scores from several assessments aimed at measuring “graduate-level” knowledge, such as Humanity’s Last Exam, comprising 2500 research-oriented questions from fields like mathematics, science, and humanities. Gemini 3 achieved a score of 37.5 percent on this exam, surpassing the previous record held by OpenAI’s GPT-5, which scored 26.5 percent.
Such improvements could indicate that the model has developed enhanced capabilities in certain areas. However, Luc Rocher suggests caution in interpreting these outcomes. “If a model increases its score from 80 percent to 90 percent on a benchmark, what does that represent? Does it mean the model was 80 percent PhD-level and is now 90 percent? This is quite difficult to ascertain,” he remarks. “It’s challenging to quantify whether an AI model demonstrates inference, as that concept is highly subjective.”
Benchmark tests come with numerous limitations, including the requirement for single answers or multiple-choice responses that do not necessitate demonstrating how the model operates. “It’s straightforward to evaluate models using multiple-choice questions,” notes Roche. “Yet in real-world scenarios—like visiting a doctor—you are not assessed with multiple-choice questions. Likewise, a lawyer does not provide legal counsel through pick-and-choose answers.” There’s also the risk that responses to such tests could be included in the training data of the AI models being assessed, essentially allowing for cheating.
The ultimate evaluation of whether Gemini 3 and its advanced AI models justify the massive investments being made by companies like Google and OpenAI in AI data centers hinges on user experience and the perceived trustworthiness of these tools, according to Roscher.
Google asserts that enhancements to the model will assist users in developing software, managing emails, and analyzing documents more effectively. The company also emphasizes that it will enhance Google searches, providing AI-generated results alongside graphics and simulations.
Perhaps the most significant advancement, as articulated by Adam Mahdi from Oxford University, is the autonomous coding capabilities facilitated by AI tools, a technique known as agent coding. “We might be approaching the limits of what traditional chatbots can achieve, and it is here that the true advantages of Gemini 3 Pro come into play. [the standard version of Gemini 3] It’s likely that it won’t be used for everyday conversations, but rather for more intricate and potentially agent-based workflows,” he explains.
Here are some initial reactions online: People admire Gemini’s impressive coding and reasoning skills. However, as is typical with new model releases, some users pointed out failures in seemingly simple tasks like drawing an arrow or a straightforward visual reasoning challenge.
Google recognizes in Gemini 3’s technical specifications that the model continues to experience hallucinations at a rate similar to other major AI models and sometimes disseminates inaccuracies. This lack of progress is a significant concern, according to Artur Davila Garces from City St George’s, University of London. “The challenge lies in the fact that AI companies have been striving to minimize hallucinations for over two years, yet even one severely misleading hallucination can irreparably damage trust in the system,” he warns.
topic:
Source: www.newscientist.com
