Case study: Creative math – How AI fakes proofs

simonw · 2026-01-26T01:45:50 1769391950

Somewhat ironic that the author calls out model mistakes and then presents https://tomaszmachnik.pl/gemini-fix-en.html - a technique they claim reduces hallucinations which looks wildly superstitious to me.

It involves spinning a whole yarn to the model about how it was trained to compete against other models but now it's won so it's safe for it to admit when it doesn't know something.

I call this a superstition because the author provides no proof that all of that lengthy argument with the model is necessary. Does replacing that lengthy text with "if you aren't sure of the answer say you don't know" have the same exact effect?

plaguuuuuu · 2026-01-26T02:24:14 1769394254

Think of the lengthy prompt as being like a safe combination, if you turn all the dials in juuust the right way, then the model's context reaches an internal state that biases it towards different outputs.

I don't know how well this specific prompt works - I don't see benchmarks - but prompting is a black art, so I wouldn't be surprised at all if it excels more than a blank slate in some specific category of tasks.

simonw · 2026-01-26T02:45:15 1769395515

For prompts this elaborate I'm always keen on seeing proof that the author explored the simpler alternatives thoroughly, rather than guessing something complex, trying it, seeing it work and announcing it to the world.

manquer · 2026-01-26T02:41:52 1769395312

It needs some evidence though? At least basic statistical analysis with correlation or χ2 hypotheses tests .

It is not “black art” or nothing there are plenty of tools to provide numerical analysis with high confidence intervals .

v_CodeSentinal · 2026-01-26T00:21:02 1769386862

This is the classic 'plausible hallucination' problem. In my own testing with coding agents, we see this constantly—LLMs will invent a method that sounds correct but doesn't exist in the library.

The only fix is tight verification loops. You can't trust the generative step without a deterministic compilation/execution step immediately following it. The model needs to be punished/corrected by the environment, not just by the prompter.

seanmcdirmid · 2026-01-26T06:27:11 1769408831

Yes, and better still the AI will fix its mistakes if it has access to verification tools directly. You can also have it write and execute tests, and then on failure, decide if the code it wrote or the tests it wrote are wrong, snd while there is a chance of confirmation bias, it often works well enough

SubiculumCode · 2026-01-26T02:39:20 1769395160

Honestly, I feel humans are similar. It's the generator <-> executive loop that keeps things right

CamperBob2 · 2026-01-26T04:42:01 1769402521

This is the classic 'plausible hallucination' problem. In my own testing with coding agents, we see this constantly—LLMs will invent a method that sounds correct but doesn't exist in the library.

Often, if not usually, that means the method should exist.

zoho_seni · 2026-01-26T00:30:01 1769387401

I've been using codex and never had a compile time error by the time it finishes. Maybe add to your agents to run TS compiler, lint and format before he finish and only stop when all passes.

exitb · 2026-01-26T06:48:30 1769410110

I’m not sure why you were downvoted. It’s a primary concern for any agentic task to set it up with a verification path.

threethirtytwo · 2026-01-26T01:06:04 1769389564

You don’t need a test to know this we already know there’s heavy reinforcement training done on these models so it optimizes for passing the training. Passing the training means convincing the person rating the answers and that the answer is good.

The keyword is convince. So it just needs to convince people that’s it’s right.

It is optimizing for convincing people. Out of all answers that can convince people some can be actual correct answers, others can be wrong answers.

godelski · 2026-01-26T02:25:48 1769394348

Yet people often forget this. We don't have mathematical models of truth, beauty, or many abstract things. Thus we proxy it with "I know it when I see it." It's a good proxy for lack of anything better but it also creates a known danger: the model optimizes deception. The proxy helps it optimize the answers we want but if we're not incredibly careful they also optimize deception.

This makes them frustrating and potentially dangerous tools. How do you validate a system optimized to deceive you? It takes a lot of effort! I don't understand why we are so cavalier about this.

threethirtytwo · 2026-01-26T06:32:31 1769409151

No the question is, how do you train the system so it doesn't deceive you?

mlpoknbji · 2026-01-26T03:14:22 1769397262

This also can be observed with more advanced math proofs. ChatGPT 5.2 pro is the best public model at math at the moment, but if pushed out of its comfort zone will make simple (and hard to spot) errors like stating an inequality but then applying it in a later step with the inequality reversed (not justified).

comex · 2026-01-26T05:16:04 1769404564

I like how this article was itself clearly written with the help of an LLM.

(You can particularly tell from the "Conclusions" section. The formatting, where each list item starts with a few-word bolded summary, is already a strong hint, but the real issue is the repetitiveness of the list items. For bonus points there's a "not X, but Y", as well as a dash, albeit not an em dash.)

YetAnotherNick · 2026-01-26T05:18:01 1769404681

Not only that, it even looks like the fabrication example is generated by AI, as the entire question seem too "fabricated". Also gemini web app queries the tool and returns correct answer, so don't know which gemini the author is talking about.

pfg_ · 2026-01-26T05:55:18 1769406918

Probably gemini on aistudio.google.com, you can configure if it is allowed to access code execution / web search / others

fourthark · 2026-01-26T05:47:27 1769406447

“This is key!”

aniijbod · 2026-01-26T02:40:41 1769395241

In the theory of the psychology of creativity, there are phenomena which constitute distortions of the motivational setting for creative problem-solving which are referred to as 'extrinsic rewards'. Management theory bumped into this kind of phenomenon with the advent of the introduction of the first appearance of 'gamification' as a motivational toolkit, where 'scores' and 'badges' were awarded to participants in online activities. The psychological community reacted to this by pointing out that earlier research had shown that whilst extrinsics can indeed (at least initially) boost participation by introducing notions of competitiveness, it turned out that they were ultimately poor substitutes for the far more sustainable and productive intrinsic motivational factors, like curiosity, if it could be stimulated effectively (something which itself inevitably required more creativity on the part of the designer of the motivational resources). It seems that the motivational analogue in inference engines is an extrinsic reward process.

godelski · 2026-01-26T02:12:59 1769393579

I thought it funny a few weeks ago Karpathy shared a sample od NanoBannana solving some physics problems but despite getting the right output it isn't get the right answers.

I think it's quite illustrative of the problem even with coding LLMs. Code and math proofs aren't so different, what matters is the steps to generate the output. All that matters far more than the actual output. The output is meaningless if the steps to get there aren't correct. You can't just jump to the last line of a proof to determine its correctness and similarly you can't just look at a program's output to determine its correctness.

Checking output is a great way to invalidate them but do nothing to validate.

Maybe what surprised me most is that the mistakes NanoBananna made are simple enough that I'm absolutely positive Karpathy could have caught them. Even if his physics is very rusty. I'm often left wondering if people really are true believers and becoming blind to the mistakes or if they don't care. It's fine to make mistakes but I rarely see corrections and let's be honest here, these are mistakes that people of this caliber should not be making.

I expect most people here can find multiple mistakes with the physics problem. One can be found if you know what the derivative of e^x is and another can be found if you can count how many i's there are.

The AI cheats because it's focused on the output, not the answer. We won't solve this problem till we recognize the output and answer aren't synonymous

https://xcancel.com/karpathy/status/1992655330002817095

zadwang · 2026-01-26T04:16:14 1769400974

The simpler and I think correct conclusion is that the LLM simply does not reason in our sense of the word. It mimics the reasoning pattern and try to get it right but could not.

esafak · 2026-01-26T05:42:41 1769406161

What do you make of human failures to reason then?

bwfan123 · 2026-01-26T00:55:45 1769388945

I am actually surprised that the LLM came so close. I doubt it had examples in its training set for these numbers. This goes to the heart of "know-how". The LLM should should have said: "I am not sure" but instead gets into rhetoric to justify itself. It actually mimics human behavior for motivated reasoning. At orgs, management is impressed with this overconfident motivated reasoner as it mirrors themselves. To hell with the facts, and the truth, persuation is all that matters.

benreesman · 2026-01-25T23:56:16 1769385376

They can all write lean4 now, don't accept numbers that don't carry proofs. The CAS I use for builds has a coeffect discharge cert in the attestation header, couple lines of code. Graded monads are a snap in CIC.

dehsge · 2026-01-26T01:07:51 1769389671

There are some numbers that are uncomputable in lean. You can do things to approximate them in lean however, those approximates may still be wrong. Leans uncomputable namespace is very interesting.

zkmon · 2026-01-26T06:32:40 1769409160

We are entering into a probabilistic era where things are not strictly black and white. Things are not binary. There is no absolute fake.

A mathematical proof is an assertion that a given statement belongs to the world defined by a set of axioms and existing proofs. This world need not have strict boundaries. Proofs can have probabilities. Maybe Reimann's hypothesis has a probability of 0.999 of belonging to that mathematical box. New proofs that would have their own probability which is a product of probabilities of the proofs they depend on. We should attach a probability and move on. Just like how we assert that some number is probably prime.

teiferer · 2026-01-26T06:54:53 1769410493

Definitely not.

"Probability" does not mean "maybe yes, maybe not, let me assign some gut feeling value measuring how much I believe something to be the case." The mathematical field of probability theory has very precise notions of what a probability is, based in a measurable probability space. None of that applies to what you are suggesting.

The Riemann Hypothesis is a conjecture that's either true or not. More precisely, either it's provable within common axioms like ZFC or its negation is. (A third alternative is that it's unprovable within ZFC but that's not commonly regarded as a realistic outcome.)

This is black and white, no probability attached. We just don't know the color at this point.

tombert · 2026-01-26T02:26:03 1769394363

I remember when ChatGPT first came out, I asked it for a proof for Fermat's Last Theorem, which it happily gave me.

It was fascinating, because it was doing a lot of understandable mistakes that 7th graders make. For example, I don't remember the surrounding context but it decided that you could break `sqrt(x^2 + y^2)` into `sqrt(x^2) + sqrt(y^2) => x + y`. It's interesting because it was one of those "ASSUME FALSE" proofs; if you can assume false, then mathematical proofs become considerably easier.

mlpoknbji · 2026-01-26T03:12:08 1769397128

My favorite early chatgpt math problem was "prove there exists infinitely many even primes" . Easy! Take a finite set of even primes, multiply them and add one to get a number with a new even prime factor.

Of course, it's gotten a bit better than this.

tptacek · 2026-01-26T02:49:41 1769395781

I remember that being true of early ChatGPT, but it's certainly not true anymore; GPT 4o and 5 have tagged along with me through all of MathAcademy MFII, MFIII, and MFML (this is roughly undergrad Calc 2 and then like half a stat class and 2/3rds of a linear algebra class) and I can't remember it getting anything wrong.

Presumably this is all a consequence of better tool call training and better math tool calls behind the scenes, but: they're really good at math stuff now, including checking my proofs (of course, the proof stuff I've had to do is extremely boring and nothing resembling actual science; I'm just saying, they don't make 7th-grader mistakes anymore.)

tombert · 2026-01-26T02:55:01 1769396101

It's definitely gotten considerably better, though I still have issues with it generating proofs, at least with TLAPS.

I think behind the scenes it's phoning Wolfram Alpha nowadays for a lot of the numeric and algebraic stuff. For all I know, they might even have an Isabelle instance running for some of the even-more abstract mathematics.

I agree that this is largely an early ChatGPT problem though, I just thought it was interesting in that they were "plausible" mistakes. I could totally see twelve-year-old tombert making these exact mistakes, so I thought it was interesting that a robot is making the same mistakes an amateur human makes.

tptacek · 2026-01-26T02:58:37 1769396317

I assumed it was just writing SymPy or something.

CamperBob2 · 2026-01-26T04:47:10 1769402830

I think behind the scenes it's phoning Wolfram Alpha nowadays for a lot of the numeric and algebraic stuff. For all I know, they might even have an Isabelle instance running for some of the even-more abstract mathematics.

Maybe, but they swear they didn't use external tools on the IMO problem set.

UltraSane · 2026-01-26T03:20:43 1769397643

LLMs have improved so much the original ChatGPT isn't relevant.

segmondy · 2026-01-26T02:56:17 1769396177

if you want to do math proofs use AI built for proof

https://huggingface.co/deepseek-ai/DeepSeek-Math-V2

https://huggingface.co/deepseek-ai/DeepSeek-Prover-V2-671B

citizenpaul · 2026-01-26T06:20:47 1769408447

>STEP 2: The Shock (Reality Check)

I've found a funny and simple technique for this. Just write "what the F$CK" and it will often seem to unstick from repetitiveness or refusals(i cant do that).

Actually just writing the word F#ck often will do it. Works on coding too.

James_K · 2026-01-26T05:58:38 1769407118

What's interesting about this is that a human would hypothetically produce a similar error, but in practice would reject the question as beyond their means. I'd assume something about supervised learning makes the models overestimate their abilities. It probably learns that “good” responses attempt to answer the question rather than giving up.

semessier · 2026-01-26T00:19:53 1769386793

that's not a proof

groundzeros2015 · 2026-01-26T01:03:22 1769389402

I think it’s a good way to prove x = sqrt(y). What’s your concern?

hahahahhaah · 2026-01-26T06:20:06 1769408406

it is an attempt to prove a very specific case of the theorem x = sqrt(x) ^ 2.

frontfor · 2026-01-26T01:04:28 1769389468

Agreed. Asking the AI to do a calculation isn’t the same as asking it to “prove” a mathematical statement in the usual meaning.

fragmede · 2026-01-26T00:03:34 1769385814

> a session with Gemini 2.5 Pro (without Code Execution tools)

How good are you at programming on a whiteboard? How good is anybody? With code execution tools withheld from me, I'll freely admit that I'm pretty shit at programming. Hell, I barely remember the syntax in some of the more esoteric, unpracticed places of my knowledge. Thus, it's hard not to see case studies like this as dunking on a blindfolded free throw shooter, and calling it analysis.

blibble · 2026-01-26T00:16:40 1769386600

> How good are you at programming on a whiteboard?

pretty good?

I could certainly do a square root

(given enough time, that one would take me a while)

crdrost · 2026-01-26T03:09:33 1769396973

With a slide rule you can start from 92200 or so, long division with 9.22 gives 9.31 or so, next guess 9.265 is almost on point, where long division says that's off by 39.6 so the next approximation +19.8 is already 92,669.8... yeah the long divisions suck but I think you could get this one within 10 minutes if the interviewer required you to.

Also, don't take a role that interviews like that unless they work on something with the stakes of Apollo 13, haha

blibble · 2026-01-26T03:11:57 1769397117

I actually have a slide rule that was my father's in school

great for teaching logarithms

htnthrow11220 · 2026-01-26T00:24:27 1769387067

It’s like that but if the blindfolded free throw shooter was also the scorekeeper and the referee & told you with complete confidence that the ball went in, when you looked away for a second.

cmiles74 · 2026-01-26T00:55:55 1769388955

It's pretty common for software developers to be asked to code up some random algorithm on a whiteboard as part of the interview process.

rakmo · 2026-01-26T00:53:15 1769388795

Is this hallucination, or is this actually quite human (albeit a specific type of human)? Think of slimy caricatures like a used car salesman, isn't this the exact type of underhandedness you'd expect?