The cynicism/denial on HN about AI is exhausting. Half the comments are some weird form of explaining away the ever increasing performance of these models
I've been reading this website for probably 15 years, its never been this bad. many threads are completely unreadable, all the actual educated takes are on X, its almost like there was a talent drain
halfmatthalfcat 45 minutes ago [-]
The overconfidence/short sightedness on HN about AI is exhausting. Half the comments are some weird form of explaining how developers will be obsolete in five years and how close we are to AGI.
Aurornis 32 minutes ago [-]
> Half the comments are some weird form of explaining how developers will be obsolete in five years and how close we are to AGI.
I do not see that at all in this comment section.
There is a lot of denial and cynicism like the parent comment suggested. The comments trying to dismiss this as just “some high school math problem” are the funniest example.
halfmatthalfcat 19 minutes ago [-]
Woosh
5 minutes ago [-]
kenjackson 22 minutes ago [-]
I went through the thread and saw nothing that looked like this.
I don’t think developers will be obsolete in five years. I don’t think AGI is around the corner. But I do think this is the biggest breakthrough in computer science history.
I worked on accelerating DNNs a little less than a decade ago and had you shown me what we’re seeing now with LLMs I’d say it was closer to 50 years out than 20 years out.
mikert89 20 minutes ago [-]
its very clearly a major breakthrough for humanity
halfmatthalfcat 19 minutes ago [-]
You're missing the joke homie.
infecto 26 minutes ago [-]
I don’t typically find this to be true. There is a definite cynicism on HN especially when it comes to OpenAI. You already know what you will see. Low quality garbage of “I remember when OpenAI was open”, “remember when they used to publish research”, “sama cannot be trusted”, it’s an endless barrage of garbage.
mikert89 25 minutes ago [-]
its honestly ruining this website, you cant even read the comments sections anymore
halfmatthalfcat 11 minutes ago [-]
Incredible how many HNers cannot see this comment for what it is.
blamestross 34 minutes ago [-]
Nobody likes the idea that this is only "economical superior AI". Not as good as humans, but a LOT cheaper.
The "It will just get better" is bubble baiting the investors. The tech companies learned from the past and they are riding and managing the bubble to extract maximum ROI before it pops.
The reality is a lot of work done by humans can be replaced by an LLM with lower quality and nuance. The loss in sales/satisfaction/ect is more than offset by the reduced cost.
The current model of LLMs are enshitification accelerators and that will have real effects.
18 minutes ago [-]
softwaredoug 47 minutes ago [-]
Probably because both sides have strong vested interests and it’s next to impossible to find a dispassionate point of view.
The Pro AI crowd, VC, tech CEOs etc have strong incentive to claim humans are obsolete. Tech employees see threats to their jobs and want to poopoo any way AI could be useful or competitive.
orbital-decay 32 minutes ago [-]
That's a huge hyperbole. I can assure you many people find the entire thing genuinely fascinating, without having any vested interest and without buying the hype.
chii 23 minutes ago [-]
That's just another way to state that everybody is almost always self-serving when it comes to anything.
rvz 40 minutes ago [-]
Or some can spot a euphoric bubble when they see it with lots of participants who have over-invested in 90% of these so called AI startups that are not frontier labs.
yunwal 35 minutes ago [-]
What does this have to do with the math Olympiad? Why would it frame your view of the accomplishment?
emp17344 23 minutes ago [-]
Why don’t they release some info beyond a vague twitter hype post? I’m beginning to hate OpenAI for releasing statements like this that invariably end up being less impressive than they make it sound initially.
mikert89 18 minutes ago [-]
dude we have computers reasoning in english to solve math problems, what are you even talking about
gellybeans 32 minutes ago [-]
Making an account just to point out how these comments are far more exhausting, because they don't engage with the subject matter. They are just agreeing with a headline and saying, "See?"
You say, "explaining away the increasing performance" as though that was a good faith representation of arguments made against LLMs, or even this specific article. Questionong the self-congragulatory nature of these businesses is perfectly reasonable.
thisisit 5 minutes ago [-]
This sounds like a version of "HN hates X and I am tired of it". In last 10 years or so I have been reading HN, X has been crypto, Musk/Tesla and many more.
So, as much I get the frustration comments like these don't really add much. Its complaining about others complaining. Instead this should be taken as a signal that maybe HN is not the right forum to read about these topics.
ninetyninenine 10 minutes ago [-]
Makes sense. Everyone here has their pride and identity tied to their ability to code. HN likes to upvote articles related to IQ because coding correlates with IQ and HNers like to think they are smart.
AI is of course a direct attack on the average HNers identity. The response you see is like attacking a Christian on his religion.
The pattern of defense is typical. When someone’s identity gets attacked they need to defend their identity. But their defense also needs to seem rational to themselves. So they begin scaffolding a construct of arguments that in the end support their identity. They take the worst aspects of AI and form a thesis around it. And that becomes the basis of sort of building a moat around their old identity as an elite programmer genius.
Tell tale sign you or someone else is doing this is when you are talking about AI and someone just comments about how they aren’t afraid of AI taking over their own job when it wasn’t even directly the topic.
If you say like ai is going to lessen the demand for software engineering jobs the typical thing you here is “I’m not afraid of losing my job” and I’m like bro, I’m not talking about your job specifically, I’m not talking about you or your fear of losing a job I’m just talking about the economics of the job market. This is how you know it’s an identity thing more than a technical topic.
wyuyang377 36 minutes ago [-]
cynacism -> cynicism
bluecalm 9 minutes ago [-]
My view is that it's less impressive than previous go and chess results.
Humans are worse at competitive math than at those games, it's still very limited space and well defined problems.
They may hype "general purpose" as much as they want but for now it's still the case that AI is super human at well defined limited space tasks and can't achieve performance of a mediocre below average human at simple tasks without those limitations like driving a car.
Nice result but it's just another game humans got beaten at. This time a game which isn't even taken very seriously (in comparison to ones that have professional scene).
ALLTaken 1 hours ago [-]
I think OpenAI participating is nothing but a publicity stunt and wholly unfair and disrespectful against Human participants. It should be allowed for AI models to participate, but it should not be ranked equally, nor put any engineers under duress of having to pull all-nighters. AI model performance should be shown T+2 days AFTER the contest! I wish that real Humans who worked hard can enjoy the attention, price and respect they deserve!
Billion dollar companies stealing not only the price, prestige, time and sleep of participants by brute-forcing their model through all illegally scraped Code via GitHub is a disgrace to humanity.
AI models should read the same materials to become proficient in coding, without having trillions of lines of code to ape through mindlessly. Otherwise the "AI" is no different than an elaborate Monte Carlo Tree Search (MCTS).
Yes I know AI is quite advanced. I know that quite well and study latest SOTA papers daily, have developed my own models aswell from the ground up, but it's despite all the advancements still far away from substantially being better than MCTS (see: https://icml.cc/virtual/2025/poster/44177 and https://allenai.org/blog/autods )
(Looks like a pattern OpenAI Corp is scraping competitions to place themselves into the spotlight and headlines.)
jsnell 24 minutes ago [-]
As far as I can tell, OpenAI didn't participate, and isn't claiming they participated. Note the fairly precise phrasing of "gold medal-level performance": they claim to have shown performance sufficient for a gold, not that they won one.
Aurornis 59 minutes ago [-]
> I think OpenAI participating is nothing but a publicity stunt and wholly unfair and disrespectful against Human participants. It should be allowed for AI models to participate, but it should not be ranked equally,
OpenAI did not participate in the actual competition nor were they taking spots away from humans. OpenAI just gave the problems to their AI under the same time limit and conditions (no external tool use)
> nor put any engineers under duress of having to pull all-nighters.
Under duress? At a company like this, all of the people working on this project are there because they want to be and they’re compensated millions.
aubanel 1 hours ago [-]
- AI competing is "wholly unfair"
- "[AI is] far away from being substantially being better than MCTs"
^ pick only one
yobbo 58 minutes ago [-]
Running MCTS over algorithms is the part that might be considered unfair if used in competition with humans.
threatripper 52 minutes ago [-]
Humans should be allowed to compete in groups of arbitrary size. This would also be a demonstration of excellent teamwork under time pressure.
pclmulqdq 26 minutes ago [-]
In a general sense, cheating and losing are not mutually exclusive.
stingraycharles 1 hours ago [-]
Yeah it’s a completely fair playing field, it’s completely obvious that AI should be able to compete with humans in the same way that robotics and computers can compete with humanity (and are better suited for many tasks).
Whether or not they’re far away from being better than humans is up to debate, but the entire point of these types of benchmarks it to compare them to humans.
bluecalm 36 minutes ago [-]
>>Yeah it’s a completely fair playing field, it’s completely obvious that AI should be able to compete with humans in the same way that robotics and computers can compete with humanity (and are better suited for many tasks).
Yeah same way computers and robots should be able to win World Chess Championship, 100m dash and Wimbledon.
>>but the entire point of these types of benchmarks it to compare them to humans
The entire point of the competition is to fight against participants who are similar to you, have similar capabilities and go through similar struggles.
If you want bot vs human competitions - great - organize it yourself instead of hijacking well established competitions out there.
Remember that they've fired all whistleblowers that would admit to breaking the verbal agreement that they wouldn't train on the test data.
samat 50 minutes ago [-]
Could not find it on the open web. Do you have clues to search for?
amelius 2 hours ago [-]
This is not a benchmark, really. It's an official test.
andrepd 1 hours ago [-]
And what were the methods? How was the evaluation? They could be making it all up for all we know!
Aurornis 29 minutes ago [-]
The International Math Olympiad isn’t an AI benchmark.
It’s an annual human competition.
meroes 18 minutes ago [-]
They didn’t actually compete.
chvid 1 hours ago [-]
I believe this company used to present its results and approach in academic papers with enough details so that it could be reproduced by third parties.
Now it is just doing a bunch of tweets?
do_not_redeem 46 minutes ago [-]
They're doing tweets because the results cannot be reproduced. https://matharena.ai/
samat 52 minutes ago [-]
This company used to be non profit
And many other things
darkoob12 9 minutes ago [-]
I don't know how much novelty should you expect from IMO every year but i expect many of them be variation of the same problem.
These models are trained on all old problem and their various solutions.For LLM models, solving thses problems are as impressive as writing code.
There is no high generalization.
z7 5 hours ago [-]
Some previous predictions:
In 2021 Paul Christiano wrote he would update from 30% to "50% chance of hard takeoff" if we saw an IMO gold by 2025.
He thought there was an 8% chance of this happening.
While I usually enjoy seeing these discussions, I think they are really pushing the usefulness of bayesian statistics. If one dude says the chance for an outcome is 8% and another says it's 16% and the outcome does occur, they were both pretty wrong, even though it might seem like the one who guessed a few % higher might have had a better belief system. Now if one of them had said 90% while the other said 8% or 16%, then we should pay close attention to what they are saying.
grillitoazul 1 hours ago [-]
From a mathematical point of view there are two factors: (1) Initial prior capability of prediction from the human agents and (2) Acceleration in the predicted event. Now we examine the result under such a model and conclude that:
The more prior predictive power of human agents imply the more a posterior acceleration of progress in LLMs (math capability).
Here we are supposing that the increase in training data is not the main explanatory factor.
This example is the gem of a general framework for assessing acceleration in LLM progress, and I think its application to many data points could give us valuable information.
grillitoazul 37 minutes ago [-]
Another take at a sound interpretation:
(1) Bad prior prediction capability of humans imply that result does not provide any information
(2) Good prior prediction capability of humans imply that there is acceleration in math capabilities of LLMs.
zeroonetwothree 50 minutes ago [-]
A 16% or even 8% event happening is quite common so really it tells us nothing and doesn’t mean either one was pretty wrong.
We may certainly hope Eliezer's other predictions don't prove so well-calibrated.
rafaelero 2 hours ago [-]
Gary Marcus is so systematically and overconfidently wrong that I wonder why we keep talking about this clown.
qoez 1 hours ago [-]
People just give attention to people making surprising bold counter narrative predictions but don't give them any attention when they're wrong.
dcre 2 hours ago [-]
I do think Gary Marcus says a lot of wrong stuff about LLMs but I don’t see anything too egregious in that post. He’s just describing the results they got a few months ago.
m3kw9 2 hours ago [-]
He definitely cannot use the original arguments from then ChatGPT arrived, he's a perennial goal post shifter.
causal 2 hours ago [-]
These numbers feel kind of meaningless without any work showing how he got to 16%
shuckles 1 hours ago [-]
My understanding is that Eliezer more or less thinks it's over for humans.
Context? Who are these people and what are these numbers and why shouldn't I assume they're pulled from thin air?
gniv 3 hours ago [-]
From that thread: "The model solved P1 through P5; it did not produce a solution for P6."
It's interesting that it didn't solve the problem that was by far the hardest for humans too. China, the #1 team got only 21/42 points on it. In most other teams nobody solved it.
gus_massa 2 hours ago [-]
In the IMO, the idea is that the first day you get P1, P2 and P3, and the second day you get P4, P5 and P6. Usually, ordered by difficulty, they are P1, P4, P2, P5, P3, P6. So, usually P1 is "easy" and P6 is very hard. At least that is the intended order, but sometime reality disagree.
I think from Canada team someone solved it but among all, its very few
meroes 1 hours ago [-]
In the RLHF sphere you could tell some AI company/companies were targeting this because of how many IMO RLHF’ers they were hiring specifically. I don’t think it’s really easy to say how much “progress” this is given that.
ksec 2 hours ago [-]
I am neither an optimist nor a pessimist for AI. I would likely be called both by the opposite parties. But the fact that AI / LLM is still rapidly improving is impressive in itself and worth celebrating for. Is it perfect, AGI, ASI? No. Is it useless? Absolutely not.
I am just happy the prize is so big for AI that there are enough money involve to push for all the hardware advancement. Foundry, Packaging, Interconnect, Network etc, all the hardware research and tech improvements previously thought were too expensive are now in the "Shut up and take my money" scenario.
Why waste time say lot word when few word do trick :)
Also worth pointing out that Alex Wei is himself a gold medalist at IOI.
johnecheck 3 hours ago [-]
Wow. That's an impressive result, but how did they do it?
Wei references scaling up test-time compute, so I have to assume they threw a boatload of money at this. I've heard talk of running models in parallel and comparing results - if OpenAI ran this 10000 times in parallel and cherry-picked the best one, this is a lot less exciting.
If this is legit, then we need to know what tools were used and how the model used them. I'd bet those are the 'techniques to make them better at hard to verify tasks'.
fnordpiglet 1 hours ago [-]
Why is that less exciting? A machine competing in an unconstrained natural language difficult math contest and coming out on top by any means is breath taking science fiction a few years ago - now it’s not exciting? Regardless of the tools for verification or even solvers - why is the goal post moving so fast? There is no bonus for “purity of essence” and using only neural networks. We live in an era where it’s hard to tell if machines are thinking or not, which for since the first computing machines was seen as the ultimate achievement. Now we Pooh Pooh the results of each iteration - which unfold month over month not decade over decade now.
You don’t have to be hyped to be amazed. You can retain the ability to dream while not buying into the snake oil. This is amazing no matter what ensemble of techniques used. In fact - you should be excited if we’ve started to break out of the limitations of forcing NN to be load bearing in literally everything. That’s a sign of maturing technology not of limitations.
YeGoblynQueenne 1 hours ago [-]
>> Why is that less exciting? A machine competing in an unconstrained natural language difficult math contest and coming out on top by any means is breath taking science fiction a few years ago - now it’s not exciting?
Half the internet is convinced that LLMs are a big data cheating machine and if they're right then, yes, boldly cheating where nobody has cheated before is not that exciting.
parasubvert 1 hours ago [-]
I think the main hesitancy is due to rampant anthropomorphism. These models cannot reason, they pattern match language tokens and generate emergent behaviour as a result.
Certainly the emergent behaviour is exciting but we tend to jump to conclusions as to what it implies.
This means we are far more trusting with software that lacks formal guarantees than we should be. We are used to software being sound by default but otherwise a moron that requires very precise inputs and parameters and testing to act correctly. System 2 thinking.
Now with NN it's inverted: it's a brilliant know-it-all but it bullshits a lot, and falls apart in ways we may gloss over, even with enormous resources spent on training. It's effectively incredible progress on System 1 thinking with questionable but evolving System 2 skills where we don't know the limits.
If you're not familiar with System 1 / System 2, it's googlable .
logicchains 32 minutes ago [-]
>I think the main hesitancy is due to rampant anthropomorphism. These models cannot reason, they pattern match language tokens and generate emergent behaviour as a result
This is rampant human chauvinism. There's absolutely no empirical basis for the statement that these models "cannot reason", it's just pseudoscientific woo thrown around by people who want to feel that humans are somehow special. By pretty much every empirical measure of "reasoning" or intelligence we have, SOTA LLMs are better at it than the average human.
Davidzheng 3 hours ago [-]
I don't think it's much less exciting if they ran it 10000 parallel? It implies an ability to discern when the proof is correct and rigorous (which o3 can't do consistently) and also means that outputting the full proof is within capabilities even if rare.
FeepingCreature 2 hours ago [-]
The whole point of RL is if you can get it to work 0.01% of the time you can get it to work 100% of the time.
lcnPylGDnU4H9OF 3 hours ago [-]
> what tools were used and how the model used them
According to the twitter thread, the model was not given access to tools.
constantcrying 2 hours ago [-]
>if OpenAI ran this 10000 times in parallel and cherry-picked the best one, this is a lot less exciting.
That entirely depends on who did the cherry picking. If the LLM had 10000 attempts and each time a human had to falsify it, this story means absolutely nothing. If the LLM itself did the cherry picking, then this is just akin to a human solving a hard problem. Attempting solutions and falsifying them until the desired result is achieved. Just that the LLM scales with compute, while humans operate only sequentially.
johnecheck 2 hours ago [-]
The key bit here is whether the LLM doing the cherry picking had knowledge of the solution. If it didn't, this is a meaningful result. That's why I'd like more info, but I fear OpenAI is going to try to keep things under wraps.
diggan 2 hours ago [-]
> If it didn't
We kind of have to assume it didn't right? Otherwise bragging about the results makes zero sense and would be outright misleading.
samat 45 minutes ago [-]
> would be outright misleading
why would not they? what are the incentives not to?
blibble 48 minutes ago [-]
openai have been caught doing exactly this before
modeless 1 hours ago [-]
Noam Brown:
> this isn’t an IMO-specific model. It’s a reasoning LLM that incorporates new experimental general-purpose techniques.
> it’s also more efficient [than o1 or o3] with its thinking. And there’s a lot of room to push the test-time compute and efficiency further.
> As fast as recent AI progress has been, I fully expect the trend to continue. Importantly, I think we’re close to AI substantially contributing to scientific discovery.
I thought progress might be slowing down, but this is clear evidence to the contrary. Not the result itself, but the claims that it is a fully general model and has a clear path to improved efficiency.
The issue is that trust is very hard to build and very easy to lose. Even in today's age where regular humans have a memory span shorter than that of an LLM, OpenAI keeps abusing the public's trust.
As a result, I take their word on AI/LLMs about as seriously as I'd take my grocery store clerk's opinion on quantum physics.
emp17344 18 minutes ago [-]
I still haven’t forgotten OpenAI’s FrontierMath debacle from December. If they really have some amazing math-solving model, give us more info than a vague twitter hype-post.
stingraycharles 58 minutes ago [-]
My issue with all these citations is that it’s all OpenAI employees that make these claims.
I’ll wait to see third party verification and/or use it myself before judging. There’s a lot of incentives right now to hype things up for OpenAI.
do_not_redeem 49 minutes ago [-]
A third party tried this experiment with publicly available models. OpenAI did half as well as Gemini, and none of the models even got bronze.
I feel you're misunderstanding something. That's not "this exact experiment". Matharena is testing publicly available models against the IMO problem set. OpenAI was announcing the results of a new, unpublished model, on that problems set.
It is totally fair to discount OpenAI's statement until we have way more details about their setup, and maybe even until there is some level of public access to the model. But you're doing something very different: implying that their results are fraudulent and (incorrectly) using the Matharena results as your proof.
do_not_redeem 24 minutes ago [-]
Fair enough, edited.
YeGoblynQueenne 54 minutes ago [-]
How is a claim, "clear evidence" to anything?
modeless 41 minutes ago [-]
Most evidence you have about the world is claims from other people, not direct experiment. There seems to be a thought-terminating cliche here on HN, dismissing any claim from employees of large tech companies.
Unlike seemingly most here on HN, I judge people's trustworthiness individually and not solely by the organization they belong to. Noam Brown is a well known researcher in the field and I see no reason to doubt these claims other than a vague distrust of OpenAI or big tech employees generally which I reject.
emp17344 17 minutes ago [-]
OpenAI have already shown us they aren’t trustworthy. Remember the FrontierMath debacle?
modeless 1 minutes ago [-]
It's only a "debacle" if you already assume OpenAI isn't trustworthy, because they said they don't train on the test set. I hope you can see that relying on that as evidence of OpenAI being untrustworthy is a circular argument. You're assuming the thing you're trying to prove.
I'm open to actual evidence that OpenAI is untrustworthy, but again, I also judge people individually, not just by the organization they belong to.
kelipso 43 minutes ago [-]
Haha, if Musk made a claim five years ago, it would’ve been taken as clear evidence here. Now it’s other people I guess, hype never dies.
up2isomorphism 1 hours ago [-]
In fact no car company claims “gold medal” performance in Olympic running even they can do that 100 yeas ago. Obviously since IMO does not generate much money so it is an easy target.
BTW; “Gold medal performance “ looks a promotional term for me.
ddtaylor 1 hours ago [-]
Glock should show up to the UFC and win the whole tournament handily.
flappyeagle 1 hours ago [-]
LMAO
another_twist 34 minutes ago [-]
Its a level playing field IMO. But theres another thread which claims not even bronze and I really don't want to go to X for anything.
Davidzheng 8 minutes ago [-]
I can save you the click. Public models (gemini/o3) are less than bronze. this is a specially trained model which is not publicly available.
dylanbyte 5 hours ago [-]
These are high school level
only in the sense of assumed background knowledge, they are extremely difficult.
Professional mathematicians would not get this level of performance, unless they have a background in IMO themselves.
This doesn’t mean that the model is better than them in math, just that mathematicians specialize in extending the frontier of math.
The answers are not in the training data.
This is not a model specialized to IMO problems.
Davidzheng 4 hours ago [-]
Are you sure this is not specialized to IMO? I do see the twitter thread saying it's "general reasoning" but I'd imagine they RL'd on olympiad math questions? If not I really hope someone from OpenAI says that bc it would be pretty astounding.
stingraycharles 2 hours ago [-]
They also said this is not part of GPT-5, and “will be released later”. It’s very, very likely a model specifically fine-tuned for this benchmark, where afterwards they’ll evaluate what actual real-world problems it’s good at (eg like “use o4-mini-high for coding”).
Frankly it looks to me like it's using an AlphaProof style system, going between natural language and Lean/etc. Of course OpenAI will not tell us any of this.
fnordpiglet 2 hours ago [-]
I actually think this “cheating” is fine. In fact it’s preferable. I don’t need an AI that can act as a really expensive calculator or solver. We’ve already built really good calculators and solvers that are near optimal. What has been missing is the abductive ability to successfully use those tools in an unconstrained space with agency. I find really no value in avoiding the optimal or near optimal techniques we’ve devised rather than focusing on the harder reasoning tasks of choosing tools, instrumenting them properly, interpreting their results, and iterating. This is the missing piece in automated reasoning after all. A NN that can approximate at great cost those tools is a parlor trick and while interesting not useful or practical. Even if they have some agent system here, it doesn’t make the achievement any less that a machine can zero shot do as well as top humans at incredibly difficult reasoning problems posed in natural language.
Anyway, that doesn't refute my point, it's just PR from a weaselly and dishonest company. I didn't say it was "IMO-specific" but the output strongly suggests specialized tooling and training, and they said this was an experimental LLM that wouldn't be released. I strongly suspect they basically attached their version of AlphaProof to ChatGPT.
Davidzheng 3 hours ago [-]
We can only go off their word unfortunately and they say no formal math. so I assume it's being eval'd by a verifier model instead of a formal system. There's actually some hints of this b/c geometry in Lean is not that well developed so unless they also built their own system it's hard to do it formally (though their P2 proof is by coordinate bash (computation by algebra instead of geometric construction) so it's hard to tell.
skdixhxbsb 2 hours ago [-]
> We can only go off their word
We’re talking about Sam Altman’s company here. The same company that started out as a non profit claiming they wanted to better the world.
Suggesting they should be given the benefit of the doubt is dishonest at this point.
demirbey05 5 hours ago [-]
Are you from OpenAI ?
ktallett 5 hours ago [-]
Hahaha! It's either that or they are determined to get a job there.
YeGoblynQueenne 1 hours ago [-]
>> This is not a model specialized to IMO problems.
How do you know?
ktallett 5 hours ago [-]
I think that's an insult to professional mathematicians. Any mathematician that has got to the stage where they do this for a living will be more than capable of doing Olympiad questions. These are proofs and some general numerical maths, some are probably a little trickier than others but the questions aren't unique and most final year bsc students in Maths will have encountered similar. I wouldn't consider myself particularly great at Maths, (despite it being the language of physics/engineering as many of my lecturers told me) but I can do plenty of the past questions without any significant reading. Most of these are similar to later years uni problems so the LLM will be able to find answers with the right searching. It may not be specialised to IMO problems, but these sort of math questions pop up in plenty of settings so it doesn't need to be.
parsimo2010 2 hours ago [-]
I am a professor in a math department (I teach statistics but there is a good complement of actual math PhDs) and there are only about 10% who care about these types of problems and definitely less than half who could get gold on an IMO test even if they didn’t care.
They are all outstanding mathematicians, but the IMO type questions are not something that mathematicians can universally solve without preparation.
There are of course some places that pride themselves on only taking “high scoring” mathematicians, and people will introduce themselves with their name and what they scored on the Putnam exam. I don’t like being around those places or people.
crinkly 2 hours ago [-]
100% agree with this.
My second degree is in mathematics. Not only can I probably not do these but they likely aren’t useful to my work so I don’t actually care.
I’m not sure an LLM could replace the mathematical side of my work (modelling). Mostly because it’s applied and people don’t know what they are asking for, what is possible or how to do it and all the problems turn out to be quite simple really.
Davidzheng 4 hours ago [-]
No I assure you >50% of working mathematicians will not score gold level at IMO consistently (I'm in the field). As the original parent said, pretty much only ppl who had the training in high school can. Like number theorists without training might be able to do some number theory IMO questions but this level is basically impossible without specialized training (with maybe a few exceptions of very strong mathematicians)
credit_guy 4 hours ago [-]
> No I assure you >50% of working mathematicians will not score gold level at IMO consistently (I'm in the field)
I agree with you. However, would a lot of working mathematicians score gold level without the IMO time constraints? Working mathematicians generally are not trying to solve a problem in the time span of one hour. I would argue that most working mathematicians, if given an arbitrary IMO problem and allowed to work on it for a week, would solve it. As for "gold level", with IMO problems you either solve one or you don't.
You could counter that it is meaningless to remove the time constraints. But we are comparing humans with OpenAI here. It is very likely OpenAI solved the IMO problems in a matter of minutes, maybe even seconds. When we talk about a chatbot achieving human-level performance, it's understood that the time is not a constraint on the human side. We are only concerned with the quality of the human output. For example: can OpenAI write a novel at the level of Jane Austen? Maybe it can, maybe it can't (for now) but Jane Austen was spending years to write such a novel, while our expectation is for OpenAI to do it at the speed of multiple words per second.
Davidzheng 4 hours ago [-]
I mean. Back when I was practicing these problems sometimes I would try them on/off for a week and would be able to do some 3&6's (usually I can do 1&4 somewhat consistently and usually none of others). As a working mathematician today, I would almost certain not be able to get gold medal performance in a week but for a given problem I guess I would have ~50% chance at least of solving it in a week? But I haven't tried in a while. But I suspect the professionals here do worse at these competition questions than you think. I mean certain these problems are "easy" compared to many of the questions we think about, but expertise drastically shifts the speed/difficulty of questions we can solve within our domains, if that makes sense.
Addendum: Actually I am not sure the probability of solving it in a week is not much better than 6 hours for these questions because they are kind of random questions. But I agree with some parts of your post tbf.
jsnell 2 hours ago [-]
> It is very likely OpenAI solved the IMO problems in a matter of minutes, maybe even seconds
Really? My expectation would have been the opposite, that time was a constraint for the AIs. OpenAI's highest end public reasoning models are slow, and there's only so much that you can do by parallelization.
Understanding how they dealt with time actually seems like the most important thing to put these results into context, and they said nothing about it. Like, I'd hope they gave the same total time allocation for a whole problem set as the human competitors. But how did they split that time? Did they work on multiple problems in parallel?
ktallett 4 hours ago [-]
I sense we may just have a different experience related to colleagues skill sets as I can think of 5 people I could send some questions too and I know they would do them just fine. Infact we often have done similar problems on a free afternoon and I often do similar on flights as a way to pass the time and improve my focus (my issue isn't my talent/understanding at maths, it's my ability to concentrate). I don't disagree that some level of training is needed but these questions aren't unique, nor impossible, especially as said training does exist and LLM's can access said examples. LLM's also have brute force which is a significant help with these type of issues. One particular point is that Math of all the STEM topics to try and focus on probably is the best documented alongside CS.
Davidzheng 4 hours ago [-]
I mean these problems you can get better with practice. But if you haven't solved many before and can do them after an afternoon of thought I would be very impressed. Not that I don't believe you, it's just in my experience people like this are very rare. (Also I assume they have to have some degree of familarity of some common tricks otherwise they would have to derive basic number theory from scratch etc and that seems a bit much for me to believe)
ktallett 4 hours ago [-]
I think honestly it's probably different experiences and skillsets. I find these sort of things doable bar dumb mistakes by myself, yet there will be other things I'll get stressed and not be able to do for ages (some lab skills no matter the number of times I do them and some physical equation derivations that I regularly muck up). I maybe sometimes assume that what comes easy for me, comes easy for all, and what I struggle with, everyone struggles with and that's probably not always the case. Likewise I did similar tasks as a teen in school and assume that is possibly the case for many academically bright so to speak but perhaps isn't so that probably helped me learn some tricks that I may not have otherwise. But as you say I do feel that you can learn the tricks and learn how to do them, even in older age (academically speaking) if you have the time and the patience and the right guide.
samat 26 minutes ago [-]
Here you go — you did this type of problems as a kid/teenager. 1) you likely have a talent for it 2) you have some training.
I did participate in math/informatics olympiads as a teenager and even taught it a little and from my experience, some type of people just _like_ that sort of problems naturally, they tickle their minds, and given time this people would develop to insane levels at it.
'Normal people', in my experience, even in math departments, don't like that type of problems, and would not fare well with them.
jebarker 2 hours ago [-]
IMO questions are to math as leetcode questions are to software engineering. Not necessarily easier or harder but they test ability on different axes. There’s definitely some overlap with undergrad level proof style questions but I disagree that being a working mathematician would necessarily mean you can solve these type of questions quickly. I did a PhD in pure math (and undergrad obv) and I know I’d have to spend time revising and then practicing to even begin answering most IMO questions.
gametorch 4 hours ago [-]
Getting gold at the IMO is pretty damn hard.
I grew up in a relatively underserved rural city. I skipped multiple grades in math, completed the first two years of college math classes while in high school, and won the award for being the best at math out of everyone in my school.
I've met and worked with a few IMO gold medalists. Even though I was used to scoring in the 99th percentile on all my tests, it felt like these people were simply in another league above me.
I'm not trying to toot my own horn. I'm definitely not that smart. But it's just ridiculous to shoot down the capabilities of these models at this point.
npinsker 4 hours ago [-]
The trouble is, getting an IMO gold medal is much easier (by frequency) than being the #1 Go player in the world, which was achieved by AI 10 years ago. I'm not sure it's enough to just gesture at the task; drilling down into precisely how it was achieved feels important.
(Not to take away from the result, which I'm really impressed by!)
Invictus0 2 hours ago [-]
The "AI" that won Go was Monte Carlo tree search on a neural net "memory" of the outcome of millions of previous games; this is a LLM solving open ended problems. The tasks are hardly even comparable.
yobbo 45 minutes ago [-]
A "reasoning LLM" might not be conceptually far from MCTS.
gafferongames 2 hours ago [-]
And then they created AlphaGo Zero, which is not trained on any previous games, and it was even stronger!
The AI scaling that went on for the last five years is going to be very different from the scaling that will happen in the next ten years. These models have latent capabilities that we are racing to unearth. IMO is but one example.
There’s so much to do at inference time. This result could not have been achieved without the substrate of general models. Its not like Go or protein folding. You need the collective public global knowledge of society to build on. And yes, there’s enough left for ten years of exploration.
More importantly, the stakes are high. There may be zero day attacks, biological weapons, and more that could be discovered. The race is on.
meroes 14 minutes ago [-]
Latent??
If you looked at RLHF hiring over the last year, there was a huge hiring of IMO competitors to RLHF. This was a new, highly targeted, highly funded RLHF’ing.
mikert89 15 minutes ago [-]
Yup, we have bootstrapped to enough intelligence in the models that we can introduce higher levels of ai
gcanyon 37 minutes ago [-]
99.99+% of all problems humans face do not require particularly original solutions. Determining whether LLMs can solve truly original (or at least obscure) problems is interesting, and a problem worth solving, but ignores the vast majority of the (near-term at least) impact they will have.
amelius 1 hours ago [-]
Makes sense. Mathematicians use intuiton a lot to drive their solution seeking, and I suppose an AI such as an LLM could develop intuition too. Of course where AI really wins is search speed and the fact that an LLM really doesn't get tired when exploring different strategies and steps within each strategy.
However, I expect that geometric intuition may still be lacking mostly because of the difficulty of encoding it in a form which an LLM can easily work with. After all, Chatgpt still can't draw a unicorn [1] although it seems to be getting closer.
Waiting for Terry Tao's thoughts, but these kind of things are good use of AI. We need to make science progress faster rather than disrupting our economy without being ready.
ktallett 5 hours ago [-]
Astounding in what sense? I assume you are aware of the standard of Olympiad problems and that they are not particularly high. They are just challenging for the age range, but they shouldn't be for AI considering they aren't really anything but proofs and basic structured math problems.
Considering OpenAI can't currently analyse and provide real paper sources to cutting edge scientific issues, I wouldn't trust it to do actual research outside of generating matplotlib code.
saagarjha 4 hours ago [-]
I did competitive math in high school and I can confidently say that they are anything but "basic". I definitely can't solve them now (as an adult) and it's likely I never will. The same is true for most people, including people who actually pursued math in college (I didn't). I'm not going to be the next guy who unknowingly challenges a Putnam winner to do these but I will just say that it is unlikely that someone who actually understands the difficulty of these problems would say that they are not hard.
For those following along but without math specific experience: consider whether your average CS professor could solve a top competitive programming question. Not Leetcode hard, Codeforces hard.
samat 18 minutes ago [-]
Thanks for speaking sense. I think 99% of people saying IMO problems are not hard would not be able to solve basic district-level competition problems and are just not equipped to judge the problems.
And 1% here are those IMO/IOI winners who think everyone is just like them. I grew up with them and to you, my friends, I say: this is the reason why AI would not take over the world (and might even not be that useful for real world tasks), even if it wins every damn contest out there.
zug_zug 2 hours ago [-]
I feel like I've noticed you you making the same comment 12 places in this thread -- incorrectly misrepresenting the difficulty of this tournament and ultimately it comes across as a bitter ex.
Here's an example problem 5:
Let a1,a2,…,an be distinct positive integers and let
M=max1≤i<j≤n.
Find the maximum number of pairs (i,j) with 1≤i<j≤n for which (ai +aj )(aj −ai )=M.
causal 1 hours ago [-]
Where did you get this? Don't see it on the 2025 problem set and now I wanna see if I have the right answer
causal 2 hours ago [-]
What does max1≤i<j≤n mean? Wouldn't M always be j?
kelipso 21 minutes ago [-]
Guessing it should be M = max_{1≤i<j≤n} ai+aj or some other function M = max_{1≤i<j≤n} f(ai,aj).
46 minutes ago [-]
Aurornis 1 hours ago [-]
> I assume you are aware of the standard of Olympiad problems and that they are not particularly high.
Every time an LLM reaches a new benchmark there’s a scramble to downplay it and move the goalposts for what should be considered impressive.
The International Math Olympiad was used by many people as an example of something that would be too difficult for LLMs. It has been a topic of discussion for some time. The fact that an LLM has achieved this level of performance is very impressive.
You’re downplaying the difficulty of these problems. It’s called international because the best in the entire world are challenged by it.
Davidzheng 4 hours ago [-]
sorry but I don't think it's accurate to say "they are just challenging for the age range"
ktallett 4 hours ago [-]
I'm aware you believe they are impossible tasks unless you have specific training, I happen to disagree with that.
Davidzheng 4 hours ago [-]
you meaning specific IMO training or general math training? Latter is certainly needed, former being needed in my opinion is a general observation for example about the people who make it on the teams.
ktallett 4 hours ago [-]
I mean IMO training, as yes I agree you wouldn't be able to do this without a complete Math knowledge.
demirbey05 5 hours ago [-]
I mean progress speed, few months ago they released o3 it has 16 pt in imo 2025
ktallett 5 hours ago [-]
In that regards I would agree but that to me suggests that prior hype was unbased though.
ktallett 5 hours ago [-]
Tbh, the way everyone has been going out about the quality of Open ai, high school/early university maths problems should not have been a stretch at all for it. The fact that this unverified claim is only just being mentioned suggests their AI isn't quite as amazing as marketed. Especially considering fundamentally logic and following rules should be rather easy to do so and most Olympiad problems are rather easy to extract the key details from.
Aurornis 1 hours ago [-]
> high school/early university maths problems should not have been a stretch at all for it.
Either you are unfamiliar with the International Math Olympiad or you’re trying to be misleading.
Calling these problems high school/early university maths is a ridiculous characterization.
gametorch 4 hours ago [-]
> high school/early university maths problems should not have been a stretch at all for it
This is a ridiculous understatement of the difficulty of getting gold at the IMO.
ktallett 4 hours ago [-]
That is the level of math you need to do these problems with a little brief understanding of what certain concepts are. There is no calculus etc. The vast majority of IMO questions are applying the base rules to new problems.
Jcampuzano2 3 hours ago [-]
There are entire fields of math with exceptional people trying to solve impossibly hard problems that utilize quite literally 0 calculus.
Many of them are also questions that eventually end up with proofs or solutions that only require very high level understanding of basic principles. But when I say very high I mean like impossibly high for the average person and ability to combine simple concepts to solve complex problems.
I'd wager the majority of Math graduates from universities would struggle to answer most IMO questions.
Olympiad questions don't require advanced concepts except maybe some classical geometry techniques that you wouldn't normally encounter in modern research mathematics. But they're fundamentally designed as puzzles. You need to spot the tricks.
oytis 3 hours ago [-]
It's like saying getting a gold medal in boxing is not hard, because it doesn't involve any firearms
pragmatic 2 hours ago [-]
More fair comparison:
Military grade killbot enters ring with boxer and proceeds to fire pneumatic hammer at boxer until KO?
Davidzheng 4 hours ago [-]
You'd be surprised at how much math the people who actually get IMO gold know...
gametorch 4 hours ago [-]
Okay, let's see you try any one of the past IMOs and show us your score.
It's really hard.
See my other comment. I was voted the best at math in my entire high school by my teachers, completed the first two years of college classes while still in high school. I've tried IMO problems for fun. I'm very happy if I get one right. I'd be infinitely satisfied to score a perfect on 3 out of 6 problems and that's nowhere near gold.
quirino 1 hours ago [-]
I think equally impressive is the performance of the OpenAI team at the "AtCoder World Tour Finals 2025" a couple of days ago. There were 12 human participants and only one did better than OpenAI.
And yet when working on production code current LLMs are about as good as a poor intern. Not sure why the disconnect.
kenjackson 32 minutes ago [-]
Depends. I’ve been using it for some of my workflows and I’d say it is more like a solid junior developer with weird quirks where it makes stupid mistakes and other times behaves as a 30 year SME vet.
Jackson__ 52 minutes ago [-]
Also interesting takeaways from that tweet chain:
>GPT5 soon
>it will not be as good as this secret(?) model
another_twist 33 minutes ago [-]
I am quite surprised that Deepmind with MCTS wasnt able to figure out math performance itself.
tlb 2 hours ago [-]
I encourage anyone who thinks these are easy high-school problems to try to solve some. They're published (including this year's) at https://www.imo-official.org/problems.aspx. They make my head spin.
xpressvideoz 1 hours ago [-]
I didn't know there were localized versions of the IMO problems. But now that I think of it, having versions of multiple languages is a must to remove the language barrier from the competitors. I guess having that many language versions (I see ~50 languages?) may make keeping the security of the problems considerably harder?
orespo 5 hours ago [-]
Definitely interesting.
Two thoughts. First, are the IMO questions somewhat related to other openly available questions online, making it easier for LLMs that are more efficient and better at reasoning to deduce the results from the available content?
Second, happy to test it on open math conjectures or by attempting to reprove recent math results.
evrimoztamur 5 hours ago [-]
From what I've seen, IMO question sets are very diverse. Moreover, humans also train on all available set of math olympiad questions and similar sets too. It seems fair game to have the AI train on them as well.
For 2, there's an army of independent mathematicians right now using automated theorem provers to formalise more or less all mathematics as we know it. It seems like open conjectures are chiefly bounded by a genuine lack of new tools/mathematics.
ktallett 5 hours ago [-]
You mean as in the previous years questions will have been used to train it? Yes, they are the same questions and due to them limited format on math questions, there are repeats so LLMs should fundamentally be able to recognise a structure and similarities and use that.
samat 3 minutes ago [-]
you either completely misinformed on the topic or a troll
laurent_du 1 hours ago [-]
They are not the same question, why are you spreading so much misinformed takes in this thread? I know a guy who had one of the best scores in history at IMO and he's incredibly intelligent. Stop repeating that getting a gold medal at IMO is a piece of cake - it's not.
andrepd 2 hours ago [-]
Am I missing something or is this completely meaningless? It's 100% opaque, no details whatsoever and no transparency or reproducibility.
I wouldn't trust these results as it is. Considering that there are trillions of dollars on the line as a reward for hyping up LLMs, I trust it even less.
flappyeagle 1 hours ago [-]
Yes you are missing the entire boat
YeGoblynQueenne 58 minutes ago [-]
Guys, that's nothing. My new AI system is not LLM-based but neuro-symbolic and yet it just scored 100% on the IMO 2026 problems that haven't even been written yet, it is that good.
What? This is a claim with all the trust-worthiness of OpenAI's claim. I mean I can claim anything I want at this point and it would still be just as trust-worthy as OpenAI's claim, with exactly zero details about anything else than "we did it, promise".
davidguetta 3 hours ago [-]
Wait for the Chinese version
procgen 2 hours ago [-]
riding coattails
3 hours ago [-]
3 hours ago [-]
tester756 5 hours ago [-]
huh?
any details?
ktallett 5 hours ago [-]
It is able to solve some high school/early bsc maths problems.
Jcampuzano2 3 hours ago [-]
Calling these high school/early bsc maths questions is an understatement lol.
littlestymaar 5 hours ago [-]
Which would be impressive if we knew those problems weren't in the training data already.
I mean it is quite impressive how language models are able to mobilize the knowledge they have been trained on, especially since they are able to retrieve information from sources that may be formatted very differently, with completely different problem statement sentences, different variable names and so on, and really operate at the conceptual level.
But we must wary of mixing up smart information retrieval with reasoning.
ktallett 5 hours ago [-]
Considering these LLM utilise the entirety of the internet, there will be no unique problems that come up in the oLympiad. Even across the course of a degree, you will have likely been exposed to 95% of the various ways to write problems. As you say, retrieval is really the only skill here. There is likely no reasoning.
reactordev 3 hours ago [-]
The Final boss was:
Which is greater, 9.11 or 9.9?
/s
I kid, this is actually pretty amazing!! I've noticed over the last several months that I've had to correct it less and less when dealing with advanced math topics so this aligns.
Lionga 5 hours ago [-]
counting "R"s in strawberry now counts for a gold medal in math?
timbaboon 1 hours ago [-]
Haha no - then it wouldn't have got a gold medal ;)
ktallett 5 hours ago [-]
The Olympiad is a great thing for children for sure. This is not what I feel we should be wasting resources on though for AI. I question if it's even impressive.
baq 5 hours ago [-]
Velocity of AI progress in recent years is exceeded only by velocity of goalposts.
ktallett 5 hours ago [-]
The goalposts should focus on being able to make a coherent statement using papers on a subject with sources. At this point it can't do that for any remotely cutting edge topic. This is just a distraction.
mindwok 4 hours ago [-]
The idea of a computer being able to solve IMO problems it has not seen before in natural language even just 3 years ago would be completely science fiction. This is astounding progress.
zkmon 2 hours ago [-]
This is an awesome progress in human achievement to get these machines intelligent. And this is also a fast regress and decline on the human wisdom!
We are simply greasing the grooves and letting things slide faster and faster and calling it progress. How does this help to make the human and nature integration better?
Does this improve climate or make humans adapt better to changing climate? Are the intelligent machines a burning need for the humanity today? Or is it all about business and political dominance? At what cost? What's the fall out of all this?
jebarker 2 hours ago [-]
Nobody knows the answers to these questions. Relying on AGI solving problems like climate change seems like a risky strategy but on the other hand it’s very plausible that these tools can help in some capacity. So we have to build, study and find out but also consider any opportunity cost of building these tools versus others.
jfengel 37 minutes ago [-]
Solving climate change isn't a technical problem, but a human one. We know the steps we have to take, and have for many years. The hard part is getting people to actually do them.
No human has any idea how to accomplish that. If a machine could, we would all have much to learn from it.
jebarker 16 minutes ago [-]
I disagree with this assessment. We don’t know the steps we have to take. We know a set of steps we could take but they’re societally unpalatable. Technology can potentially offer alternative steps or introduce societal changes that make the first set of steps more palatable.
Rendered at 15:37:55 GMT+0000 (Coordinated Universal Time) with Vercel.
I've been reading this website for probably 15 years, its never been this bad. many threads are completely unreadable, all the actual educated takes are on X, its almost like there was a talent drain
I do not see that at all in this comment section.
There is a lot of denial and cynicism like the parent comment suggested. The comments trying to dismiss this as just “some high school math problem” are the funniest example.
I don’t think developers will be obsolete in five years. I don’t think AGI is around the corner. But I do think this is the biggest breakthrough in computer science history.
I worked on accelerating DNNs a little less than a decade ago and had you shown me what we’re seeing now with LLMs I’d say it was closer to 50 years out than 20 years out.
The "It will just get better" is bubble baiting the investors. The tech companies learned from the past and they are riding and managing the bubble to extract maximum ROI before it pops.
The reality is a lot of work done by humans can be replaced by an LLM with lower quality and nuance. The loss in sales/satisfaction/ect is more than offset by the reduced cost.
The current model of LLMs are enshitification accelerators and that will have real effects.
The Pro AI crowd, VC, tech CEOs etc have strong incentive to claim humans are obsolete. Tech employees see threats to their jobs and want to poopoo any way AI could be useful or competitive.
You say, "explaining away the increasing performance" as though that was a good faith representation of arguments made against LLMs, or even this specific article. Questionong the self-congragulatory nature of these businesses is perfectly reasonable.
So, as much I get the frustration comments like these don't really add much. Its complaining about others complaining. Instead this should be taken as a signal that maybe HN is not the right forum to read about these topics.
AI is of course a direct attack on the average HNers identity. The response you see is like attacking a Christian on his religion.
The pattern of defense is typical. When someone’s identity gets attacked they need to defend their identity. But their defense also needs to seem rational to themselves. So they begin scaffolding a construct of arguments that in the end support their identity. They take the worst aspects of AI and form a thesis around it. And that becomes the basis of sort of building a moat around their old identity as an elite programmer genius.
Tell tale sign you or someone else is doing this is when you are talking about AI and someone just comments about how they aren’t afraid of AI taking over their own job when it wasn’t even directly the topic.
If you say like ai is going to lessen the demand for software engineering jobs the typical thing you here is “I’m not afraid of losing my job” and I’m like bro, I’m not talking about your job specifically, I’m not talking about you or your fear of losing a job I’m just talking about the economics of the job market. This is how you know it’s an identity thing more than a technical topic.
Nice result but it's just another game humans got beaten at. This time a game which isn't even taken very seriously (in comparison to ones that have professional scene).
Billion dollar companies stealing not only the price, prestige, time and sleep of participants by brute-forcing their model through all illegally scraped Code via GitHub is a disgrace to humanity.
AI models should read the same materials to become proficient in coding, without having trillions of lines of code to ape through mindlessly. Otherwise the "AI" is no different than an elaborate Monte Carlo Tree Search (MCTS).
Yes I know AI is quite advanced. I know that quite well and study latest SOTA papers daily, have developed my own models aswell from the ground up, but it's despite all the advancements still far away from substantially being better than MCTS (see: https://icml.cc/virtual/2025/poster/44177 and https://allenai.org/blog/autods )
EDIT, adding proof:
This is the results of the last competition they tried to win and have LOST: https://arstechnica.com/ai/2025/07/exhausted-man-defeats-ai-...
(Looks like a pattern OpenAI Corp is scraping competitions to place themselves into the spotlight and headlines.)
OpenAI did not participate in the actual competition nor were they taking spots away from humans. OpenAI just gave the problems to their AI under the same time limit and conditions (no external tool use)
> nor put any engineers under duress of having to pull all-nighters.
Under duress? At a company like this, all of the people working on this project are there because they want to be and they’re compensated millions.
- "[AI is] far away from being substantially being better than MCTs"
^ pick only one
Whether or not they’re far away from being better than humans is up to debate, but the entire point of these types of benchmarks it to compare them to humans.
Yeah same way computers and robots should be able to win World Chess Championship, 100m dash and Wimbledon.
>>but the entire point of these types of benchmarks it to compare them to humans
The entire point of the competition is to fight against participants who are similar to you, have similar capabilities and go through similar struggles. If you want bot vs human competitions - great - organize it yourself instead of hijacking well established competitions out there.
It’s an annual human competition.
Now it is just doing a bunch of tweets?
And many other things
These models are trained on all old problem and their various solutions.For LLM models, solving thses problems are as impressive as writing code.
There is no high generalization.
In 2021 Paul Christiano wrote he would update from 30% to "50% chance of hard takeoff" if we saw an IMO gold by 2025.
He thought there was an 8% chance of this happening.
Eliezer Yudkowsky said "at least 16%".
Source:
https://www.lesswrong.com/posts/sWLLdG6DWJEy3CH7n/imo-challe...
The more prior predictive power of human agents imply the more a posterior acceleration of progress in LLMs (math capability).
Here we are supposing that the increase in training data is not the main explanatory factor.
This example is the gem of a general framework for assessing acceleration in LLM progress, and I think its application to many data points could give us valuable information.
(1) Bad prior prediction capability of humans imply that result does not provide any information
(2) Good prior prediction capability of humans imply that there is acceleration in math capabilities of LLMs.
We may certainly hope Eliezer's other predictions don't prove so well-calibrated.
It's interesting that it didn't solve the problem that was by far the hardest for humans too. China, the #1 team got only 21/42 points on it. In most other teams nobody solved it.
Edit: Fixed P4 -> P3. Thanks.
I am just happy the prize is so big for AI that there are enough money involve to push for all the hardware advancement. Foundry, Packaging, Interconnect, Network etc, all the hardware research and tech improvements previously thought were too expensive are now in the "Shut up and take my money" scenario.
Why waste time say lot word when few word do trick :)
Also worth pointing out that Alex Wei is himself a gold medalist at IOI.
Wei references scaling up test-time compute, so I have to assume they threw a boatload of money at this. I've heard talk of running models in parallel and comparing results - if OpenAI ran this 10000 times in parallel and cherry-picked the best one, this is a lot less exciting.
If this is legit, then we need to know what tools were used and how the model used them. I'd bet those are the 'techniques to make them better at hard to verify tasks'.
You don’t have to be hyped to be amazed. You can retain the ability to dream while not buying into the snake oil. This is amazing no matter what ensemble of techniques used. In fact - you should be excited if we’ve started to break out of the limitations of forcing NN to be load bearing in literally everything. That’s a sign of maturing technology not of limitations.
Half the internet is convinced that LLMs are a big data cheating machine and if they're right then, yes, boldly cheating where nobody has cheated before is not that exciting.
Certainly the emergent behaviour is exciting but we tend to jump to conclusions as to what it implies.
This means we are far more trusting with software that lacks formal guarantees than we should be. We are used to software being sound by default but otherwise a moron that requires very precise inputs and parameters and testing to act correctly. System 2 thinking.
Now with NN it's inverted: it's a brilliant know-it-all but it bullshits a lot, and falls apart in ways we may gloss over, even with enormous resources spent on training. It's effectively incredible progress on System 1 thinking with questionable but evolving System 2 skills where we don't know the limits.
If you're not familiar with System 1 / System 2, it's googlable .
This is rampant human chauvinism. There's absolutely no empirical basis for the statement that these models "cannot reason", it's just pseudoscientific woo thrown around by people who want to feel that humans are somehow special. By pretty much every empirical measure of "reasoning" or intelligence we have, SOTA LLMs are better at it than the average human.
According to the twitter thread, the model was not given access to tools.
That entirely depends on who did the cherry picking. If the LLM had 10000 attempts and each time a human had to falsify it, this story means absolutely nothing. If the LLM itself did the cherry picking, then this is just akin to a human solving a hard problem. Attempting solutions and falsifying them until the desired result is achieved. Just that the LLM scales with compute, while humans operate only sequentially.
We kind of have to assume it didn't right? Otherwise bragging about the results makes zero sense and would be outright misleading.
why would not they? what are the incentives not to?
> this isn’t an IMO-specific model. It’s a reasoning LLM that incorporates new experimental general-purpose techniques.
> it’s also more efficient [than o1 or o3] with its thinking. And there’s a lot of room to push the test-time compute and efficiency further.
> As fast as recent AI progress has been, I fully expect the trend to continue. Importantly, I think we’re close to AI substantially contributing to scientific discovery.
I thought progress might be slowing down, but this is clear evidence to the contrary. Not the result itself, but the claims that it is a fully general model and has a clear path to improved efficiency.
https://x.com/polynoamial/status/1946478249187377206
I’ll wait to see third party verification and/or use it myself before judging. There’s a lot of incentives right now to hype things up for OpenAI.
https://matharena.ai/imo/
It is totally fair to discount OpenAI's statement until we have way more details about their setup, and maybe even until there is some level of public access to the model. But you're doing something very different: implying that their results are fraudulent and (incorrectly) using the Matharena results as your proof.
Unlike seemingly most here on HN, I judge people's trustworthiness individually and not solely by the organization they belong to. Noam Brown is a well known researcher in the field and I see no reason to doubt these claims other than a vague distrust of OpenAI or big tech employees generally which I reject.
I'm open to actual evidence that OpenAI is untrustworthy, but again, I also judge people individually, not just by the organization they belong to.
BTW; “Gold medal performance “ looks a promotional term for me.
Professional mathematicians would not get this level of performance, unless they have a background in IMO themselves.
This doesn’t mean that the model is better than them in math, just that mathematicians specialize in extending the frontier of math.
The answers are not in the training data.
This is not a model specialized to IMO problems.
E.g here: https://pbs.twimg.com/media/GwLtrPeWIAUMDYI.png?name=orig
Frankly it looks to me like it's using an AlphaProof style system, going between natural language and Lean/etc. Of course OpenAI will not tell us any of this.
https://x.com/polynoamial/status/1946478249187377206?s=46&t=...
Anyway, that doesn't refute my point, it's just PR from a weaselly and dishonest company. I didn't say it was "IMO-specific" but the output strongly suggests specialized tooling and training, and they said this was an experimental LLM that wouldn't be released. I strongly suspect they basically attached their version of AlphaProof to ChatGPT.
We’re talking about Sam Altman’s company here. The same company that started out as a non profit claiming they wanted to better the world.
Suggesting they should be given the benefit of the doubt is dishonest at this point.
How do you know?
They are all outstanding mathematicians, but the IMO type questions are not something that mathematicians can universally solve without preparation.
There are of course some places that pride themselves on only taking “high scoring” mathematicians, and people will introduce themselves with their name and what they scored on the Putnam exam. I don’t like being around those places or people.
My second degree is in mathematics. Not only can I probably not do these but they likely aren’t useful to my work so I don’t actually care.
I’m not sure an LLM could replace the mathematical side of my work (modelling). Mostly because it’s applied and people don’t know what they are asking for, what is possible or how to do it and all the problems turn out to be quite simple really.
I agree with you. However, would a lot of working mathematicians score gold level without the IMO time constraints? Working mathematicians generally are not trying to solve a problem in the time span of one hour. I would argue that most working mathematicians, if given an arbitrary IMO problem and allowed to work on it for a week, would solve it. As for "gold level", with IMO problems you either solve one or you don't.
You could counter that it is meaningless to remove the time constraints. But we are comparing humans with OpenAI here. It is very likely OpenAI solved the IMO problems in a matter of minutes, maybe even seconds. When we talk about a chatbot achieving human-level performance, it's understood that the time is not a constraint on the human side. We are only concerned with the quality of the human output. For example: can OpenAI write a novel at the level of Jane Austen? Maybe it can, maybe it can't (for now) but Jane Austen was spending years to write such a novel, while our expectation is for OpenAI to do it at the speed of multiple words per second.
Addendum: Actually I am not sure the probability of solving it in a week is not much better than 6 hours for these questions because they are kind of random questions. But I agree with some parts of your post tbf.
Really? My expectation would have been the opposite, that time was a constraint for the AIs. OpenAI's highest end public reasoning models are slow, and there's only so much that you can do by parallelization.
Understanding how they dealt with time actually seems like the most important thing to put these results into context, and they said nothing about it. Like, I'd hope they gave the same total time allocation for a whole problem set as the human competitors. But how did they split that time? Did they work on multiple problems in parallel?
I did participate in math/informatics olympiads as a teenager and even taught it a little and from my experience, some type of people just _like_ that sort of problems naturally, they tickle their minds, and given time this people would develop to insane levels at it.
'Normal people', in my experience, even in math departments, don't like that type of problems, and would not fare well with them.
I grew up in a relatively underserved rural city. I skipped multiple grades in math, completed the first two years of college math classes while in high school, and won the award for being the best at math out of everyone in my school.
I've met and worked with a few IMO gold medalists. Even though I was used to scoring in the 99th percentile on all my tests, it felt like these people were simply in another league above me.
I'm not trying to toot my own horn. I'm definitely not that smart. But it's just ridiculous to shoot down the capabilities of these models at this point.
(Not to take away from the result, which I'm really impressed by!)
https://deepmind.google/discover/blog/alphago-zero-starting-...
There’s so much to do at inference time. This result could not have been achieved without the substrate of general models. Its not like Go or protein folding. You need the collective public global knowledge of society to build on. And yes, there’s enough left for ten years of exploration.
More importantly, the stakes are high. There may be zero day attacks, biological weapons, and more that could be discovered. The race is on.
If you looked at RLHF hiring over the last year, there was a huge hiring of IMO competitors to RLHF. This was a new, highly targeted, highly funded RLHF’ing.
However, I expect that geometric intuition may still be lacking mostly because of the difficulty of encoding it in a form which an LLM can easily work with. After all, Chatgpt still can't draw a unicorn [1] although it seems to be getting closer.
[1] https://gpt-unicorn.adamkdean.co.uk/
https://matharena.ai/imo/
Waiting for Terry Tao's thoughts, but these kind of things are good use of AI. We need to make science progress faster rather than disrupting our economy without being ready.
Considering OpenAI can't currently analyse and provide real paper sources to cutting edge scientific issues, I wouldn't trust it to do actual research outside of generating matplotlib code.
For those following along but without math specific experience: consider whether your average CS professor could solve a top competitive programming question. Not Leetcode hard, Codeforces hard.
And 1% here are those IMO/IOI winners who think everyone is just like them. I grew up with them and to you, my friends, I say: this is the reason why AI would not take over the world (and might even not be that useful for real world tasks), even if it wins every damn contest out there.
Here's an example problem 5:
Let a1,a2,…,an be distinct positive integers and let M=max1≤i<j≤n.
Find the maximum number of pairs (i,j) with 1≤i<j≤n for which (ai +aj )(aj −ai )=M.
Every time an LLM reaches a new benchmark there’s a scramble to downplay it and move the goalposts for what should be considered impressive.
The International Math Olympiad was used by many people as an example of something that would be too difficult for LLMs. It has been a topic of discussion for some time. The fact that an LLM has achieved this level of performance is very impressive.
You’re downplaying the difficulty of these problems. It’s called international because the best in the entire world are challenged by it.
Either you are unfamiliar with the International Math Olympiad or you’re trying to be misleading.
Calling these problems high school/early university maths is a ridiculous characterization.
This is a ridiculous understatement of the difficulty of getting gold at the IMO.
Many of them are also questions that eventually end up with proofs or solutions that only require very high level understanding of basic principles. But when I say very high I mean like impossibly high for the average person and ability to combine simple concepts to solve complex problems.
I'd wager the majority of Math graduates from universities would struggle to answer most IMO questions.
Take a look.
It's really hard.
See my other comment. I was voted the best at math in my entire high school by my teachers, completed the first two years of college classes while still in high school. I've tried IMO problems for fun. I'm very happy if I get one right. I'd be infinitely satisfied to score a perfect on 3 out of 6 problems and that's nowhere near gold.
Not sure there is a good writeup about it yet but here is the livestream: https://www.youtube.com/live/TG3ChQH61vE.
>GPT5 soon
>it will not be as good as this secret(?) model
Second, happy to test it on open math conjectures or by attempting to reprove recent math results.
For 2, there's an army of independent mathematicians right now using automated theorem provers to formalise more or less all mathematics as we know it. It seems like open conjectures are chiefly bounded by a genuine lack of new tools/mathematics.
I wouldn't trust these results as it is. Considering that there are trillions of dollars on the line as a reward for hyping up LLMs, I trust it even less.
What? This is a claim with all the trust-worthiness of OpenAI's claim. I mean I can claim anything I want at this point and it would still be just as trust-worthy as OpenAI's claim, with exactly zero details about anything else than "we did it, promise".
any details?
I mean it is quite impressive how language models are able to mobilize the knowledge they have been trained on, especially since they are able to retrieve information from sources that may be formatted very differently, with completely different problem statement sentences, different variable names and so on, and really operate at the conceptual level.
But we must wary of mixing up smart information retrieval with reasoning.
I kid, this is actually pretty amazing!! I've noticed over the last several months that I've had to correct it less and less when dealing with advanced math topics so this aligns.
We are simply greasing the grooves and letting things slide faster and faster and calling it progress. How does this help to make the human and nature integration better?
Does this improve climate or make humans adapt better to changing climate? Are the intelligent machines a burning need for the humanity today? Or is it all about business and political dominance? At what cost? What's the fall out of all this?
No human has any idea how to accomplish that. If a machine could, we would all have much to learn from it.