AIs are improving at solving mathematics challenges
Andresr/ Getty Images
AI models developed by Google DeepMind and OpenAI have achieved exceptional performance at the International Mathematics Olympiad (IMO).
While companies herald this as a significant advancement for AIs that might one day tackle complex scientific or mathematical challenges, mathematicians urge caution, as the specifics of the models and their methodologies remain confidential.
The IMO is one of the most respected contests for young mathematicians, often viewed by AI researchers as a critical test of mathematical reasoning, an area where AI traditionally struggles.
Following last year’s competition in Bath, UK, Google investigated how its AI systems, Alpha Proof and Alpha Jometry, achieved silver-level performance, though their submissions were not evaluated by the official competition judges.
Various companies, including Google, Huawei, and TikTok’s parent company, approached the IMO organizers requesting formal evaluation of their AI models during this year’s contest, as stated by Gregor Drinner, the President of IMO. The IMO consented, stipulating that results should be revealed only after the full closing ceremony on July 28th.
OpenAI also expressed interest in participating in the competition but did not respond or register upon being informed of the official procedures, according to Dolinar.
On July 19th, OpenAI announced the development of a new AI that achieved a gold medal score alongside three former IMO medalists, separately from the official competition. OpenAI stated the AI correctly answered five out of six questions within the same 4.5-hour time limit as human competitors.
Two days later, Google DeepMind revealed that its AI system, Gemini Deep Think, had also achieved gold-level performance within the same constraints. Dolinar confirmed that this result was validated by the official IMO judges.
Unlike Google’s Alpha Proof and Alpha Jometry, which were designed for competition, Gemini Deep Think was specifically crafted to tackle questions posed in a programming language used by both Google and OpenAI.
Utilizing LEAN, the AI was capable of quickly verifying correctness, although the output is challenging for non-experts to interpret. Thang Luong from Google indicated that a natural language approach can yield more comprehensible results while remaining applicable to broadly useful AI frameworks.
Luong noted that advancements in reinforcement learning—a training technique designed to guide AI through success and failure—have enabled large language models to validate solutions efficiently, a method essential to Google’s earlier achievements with gameplay AIs, such as AlphaZero.
Google’s model employs a technique known as parallel thinking, considering multiple solutions simultaneously. The training data comprises mathematical problems particularly relevant to the IMO.
OpenAI has disclosed few specifics regarding their system, only mentioning that it incorporates augmented learning and “experimental research methods.”
“While progress appears promising, it lacks rigorous scientific validation, making it difficult to assess at this point,” remarked Terence Tao from UCLA. “We anticipate that the participating companies will publish papers featuring more comprehensive data, allowing others to access the model and replicate its findings. However, for now, we must rely on the companies’ claims regarding their results.”
Geordy Williamson from the University of Sydney shared this sentiment, stating, “It’s remarkable to see advancements in this area, yet it’s frustrating how little in-depth information is available from inside these companies.”
Natural language systems might be beneficial for individuals without a mathematical background, but they also risk presenting complications if models produce lengthy proofs that are hard to verify, warned Joseph Myers, a co-organizer of this year’s IMO. “If AIs generate solutions to significant unsolved questions that seem plausible yet contain subtle, critical errors, we must be cautious before putting confidence in lengthy AI outputs.”
The companies plan to initially provide these systems for testing by mathematicians in the forthcoming months before making broader public releases. The models claim they could potentially offer rapid solutions for challenging problems in scientific research, as stated by June Hyuk Jeong from Google, who contributed to Gemini Deep Think. “There are numerous unresolved challenges within reach,” he noted.
Topics:
Source: www.newscientist.com
