Late last year, I published the first theoretical physics paper in which the main idea originated from artificial intelligence – from an AI. And my experience working with the most powerful AI models left me both impressed and wary. The most accurate analogy I can offer is that it’s like something familiar to anyone who has done research: a brilliant colleague who is also unreliable. This colleague can produce deep insights at surprising speed – and then, the next second, make an error that ranges from the trivial to the profound. That tension – capability versus reliability – has shaped how I now use these systems in mathematics and theoretical physics. It is also what will shape how they affect scientific research over the next few years.
AI can produce deep insights at surprising speed – and then, the next second, make a profound error
My paper began with a diagnostic exercise: I wanted to test whether a frontier model could accurately summarize and reason about a technical paper I wrote in 2014 on nonlinear quantum mechanics. When I asked GPT-5 to explain that 2014 paper, it produced what I regarded as an essentially perfect summary. I then pushed it to extend the findings to relativistic quantum field theory. And in the course of that back-and-forth – while I was still in “testing comprehension” mode – the model volunteered a specific, concrete research direction: use the Tomonaga-Schwinger formulation of quantum field theory to make the tension between relativistic invariance and state-dependent evolution precise. You don’t need to know what the Tomonaga-Schwinger formulation of quantum field theory is, only that it was that suggestion, accompanied by a set of equations generated de novo by the AI, that evolved into the core of a peer-reviewed paper.
This is a useful way to think about what these models can do today. They can “see” patterns across wide areas of the literature and recombine techniques and viewpoints quickly. Sometimes that recombination is empty – just plausible language. But sometimes it is the kind of reframing that a working scientist recognizes as worth pursuing.
I wrote a separate companion paper on how to use AI in theoretical research and discussed it on my podcast, Manifold.
So the AI was impressive and useful, but the reason I do not treat these systems as autonomous researchers is straightforward: they can be wrong in ways that are costly.
The reliability problem is real, and it is not just math errors. We know they can make a simple mistake in a calculation. But the more dangerous failure is an incorrect conceptual leap that is written persuasively and sounds like it belongs in the literature. In my own work, I saw models suggest technically sophisticated “bridges” to distant subfields that led to unproductive wild goose chases.
That experience is why I now emphasize workflow as much as raw-model intelligence. The right question is not “Can the model reason like a person?” but “Can we structure the interaction so that errors are detected early and cheaply?”
The most effective method I have found is a simple protocol: separate generation from verification. In one instance, a model generates a derivation, proof outline, or research step; in another instance – prompted to be skeptical – a model checks it independently. In practice I often use multiple verifiers and, at times, multiple model families, because different systems fail in different ways.
“Generate-verify” is the practical trick that changes everything. This may sound mundane, but the effect is dramatic. It is not so different from walking down the hall to explain a new idea to a colleague: you are not outsourcing judgment, you are forcing your argument through an adversarial lens that reliably catches weak spots.
The same logic now underlies some of the most credible demonstrations of “high-end” AI mathematics.
Last year, multiple teams reported strong results on new problems from the International Mathematical Olympiad (IMO), the most prestigious high-school math competition in the world. A particularly clear example is a paper by Yichen Huang and Lin Yang showing that a verification-and-refinement pipeline, wrapped around several leading models, solved five of six IMO 2025 problems – “a gold medal” performance. Underlying models performed much worse, typically only solving one out of six
Google DeepMind also reported that an advanced version of its AI model, Gemini, achieved gold-medal standard on the same IMO set, again solving five of the six problems. Independent coverage at the time treated this as a genuine milestone while noting that the systems were expensive to run and not the same as routine consumer chatbots.
It is important not to read too much into these achievements. Contest problems are narrow and they reward a very specific kind of formal reasoning. But they are also hard in a way that is relevant to science: they require multistep planning, careful symbolic manipulation and the discipline to avoid hidden gaps. And the big lesson is that “one-shot” prompting is a misleading baseline. With verification and refinement, models can reach a level that very few humans can match, at least for certain classes of formal tasks.
I saw AI models suggest technically sophisticated ‘bridges’ that actually led to unproductive wild goose chases
If the IMO is a clean benchmark, the world of open conjectures is messier. The “Erdős problems” are a large collection of conjectures and questions associated with the Hungarian mathematician Paul Erdős, curated online; they vary enormously in difficulty and in how precisely they are stated. Here, too, we are beginning to see credible examples of AI-assisted progress – along with plenty of noise. Mathematician Terence Tao wrote a detailed account last December of an Erdős problem that was solved and formalized in Lean, an AI software, via a human-AI workflow using an AI tool called “Aristotle.”
Tao and others have also urged caution: the surge of claimed AI solutions includes a wide range of quality. The math community is actively developing norms for what counts as a meaningful “AI contribution.”
That is roughly where we are today: not with a flood of solved open problems, but with a growing number of instances where AI systems help generate an argument before formal tools (sometimes with human help) verify it rigorously.
There is a serious skeptical position that deserves respect: that large language models do not “reason” at all, in the way humans mean it. They generate text that resembles reasoning because they have been trained on vast amounts of human writing. In the harshest version of this critique, they are “stochastic parrots.”
However, even the most pessimistic characterization leaves room for a practical point: if the model is noisy, verification can cancel some of the noise. And, more broadly, whether the system has “real understanding” is not the only question that matters.
What matters for working scientists is whether an AI can reliably produce useful intermediate objects: a derivation that can be checked, a proof strategy worth testing, a counterexample to a claim, a translation of a mathematical statement into a formal language, or a structured plan for computational experiments.
In mathematics, a proof is a proof. If a model generates a novel proof that withstands human scrutiny – or, better, is formally verified – then something real has happened, regardless of what we think is “going on inside” the model. The same can be applied, more cautiously, in theoretical science: if an AI-suggested framing leads to a correct result that survives peer review, it is a contribution, even if the AI arrived there by pattern completion rather than human-like insight.
So what changes in science if this keeps improving? Well, the near-term effect is not the replacement of scientists, but a change in the economics of exploration. When models become stronger “symbolic manipulators,” they lower the cost of trying ideas. In many fields, the bottleneck is not waiting for a single brilliant insight, but wading through the algebra, the special cases, the literature and the edge conditions that separate a plausible idea from a defensible result. A verified AI workflow can compress that slog.
The large language models may not ‘reason’ at all. They generate text that resembles reasoning
Over time, I expect three concrete shifts. The first is that verification will become routine. In the same way that software engineering adopted unit tests and code review, technical research will increasingly treat “generate-verify” loops as standard hygiene, not an exotic trick.
Next, agentic pipelines will matter more than chat interfaces. The most interesting systems will not be single chatbots but orchestration frameworks that run multiple specialized agents – generator, critic, formalizer, searcher – before presenting a consolidated answer. This is already visible in the way IMO-level results are achieved: by turning raw model ability into a formal, structured process.
Finally, small groups will attempt bigger projects. By exploring many more candidate approaches and ruling out dead ends more quickly, a small group can attack questions that previously required a larger team or more time. None of this guarantees that AI will routinely produce breakthroughs in science on its own. But even if today’s models are not “reasoners” in the philosophical sense, they can still be productive partners – provided we remember that though they can be brilliant, they are also unreliable.
Comments