Next.js App Router + React Server Components Demo

new
past
show
ask
show
jobs
submit

291 points by maheshrijal 3 hours ago | 266 comments

_fat_santa 3 hours ago [-]

So at this point OpenAI has 6 reasoning models, 4 flagship chat models, and 7 cost optimized models. So that's 17 models in total and that's not even counting their older models and more specialized ones. Compare this with Anthropic that has 7 models in total and 2 main ones that they promote.

This is just getting to be a bit much, seems like they are trying to cover for the fact that they haven't actually done much. All these models feel like they took the exact same base model, tweaked a few things and released it as an entirely new model rather than updating the existing ones. In fact based on some of the other comments here it sounds like these are just updates to their existing model, but they release them as new models to create more media buzz.

nopinsight 2 minutes ago [-]

The AlexNet paper which kickstarted the deep learning era in 2012 was ahead of the 2nd-best entry by 11%. Many publishable AI papers then advanced SOTA by just a couple percentage points.

o3 high is about 9% ahead of o1 high on livebench.ai and there are quite a few testimonials of its differences as well.

Yes, AlexNet made major strides in other aspects then too but it’s been just 7 months after o1-preview, the first publicly available reasoning model, which is a seminal advance beyond previous LLMs.

It seems some people are desensitized to how rapid things are moving in AI, despite no other fields in human history has progressed at this pace.

Ref: - https://proceedings.neurips.cc/paper_files/paper/2012/file/c...

- https://livebench.ai/#/

shmatt 2 hours ago [-]

Im old enough to remember the mystery and hype before o*/o1/strawberry that was supposed to be essentially AGI. We had serious news outlets write about senior people at OpenAI quitting because o1 was SkyNet

Now we're up to o4, AGI is still not even in near site (depending on your definition, I know). And OpenAI is up to about 5000 employees. I'd think even before AGI a new model would be able to cover for at least 4500 of those employees being fired, is that not the case?

chrsw 15 minutes ago [-]

I’m not an AI researcher but I’m not convinced these contemporary artificial neural networks will get us to AGI, even assuming an acceleration to current scaling pace. Maybe my definition of AGI is off but I’m thinking what that means is a machine that can think, learn and behave in the world in ways very close to human. I think we need a fundamentally different paradigm for that. Not something that is just trained and deployed like current models, but something that is constantly observing, constantly learning and constantly interacting with the real world like we do. AHI, not AGI. True AGI may not exist because there are always compromises of some kind.

But, we don’t need AGI/AHI to transform large parts of our civilization. And I’m not seeing this happen either.

pants2 2 hours ago [-]

Remember that Docusign has 7,000 employees. I think OpenAI is pretty lean for what they're accomplishing.

steamrolled 38 minutes ago [-]

I don't think these comparisons are useful. Every time you look at companies like LinkedIn or Docusign, yeah - they have a lot of staff, but a significant proportion of this are functions like sales, customer support, and regulatory compliance across a bazillion different markets; along with all the internal tooling and processes you need to support that.

OpenAI is at a much earlier stage in their adventures and probably doesn't have that much baggage. Given their age and revenue streams, their headcount is quite substantial.

shmatt 1 hours ago [-]

If we're making comparisons, its more like someone selling a $10,000 course on how to be a millionaire

Not directly from OpenAI - but people in the industry is advertising how these advanced models can replace employees, yet they keep on going on hiring tears (including OpenAI). Lets see the first company to stand behind their models, and replace 50% of their existing headcount with agents. That to me would be a sign these things are going to replace peoples jobs. Until I see that, if OpenAI can't figure out how to replace humans with models, then no one will

I mean could you imagine if todays announcement was - the chatgpt.com webdev team has been laid off, and all new features and fixes will be complete by Codex CLI + o4-mini. That means they believe in the product theyre advertising. Until they do something like that, theyll keep on trusting those human engineers and try selling other people on the dream

throwanem 2 hours ago [-]

[flagged]

andrewinardeer 1 hours ago [-]

The US is not a signatory to the International Criminal Court so you won't see Musk on trial there.

throwanem 57 minutes ago [-]

I hope I don't have to link this adjacent reply of mine too many more times: https://news.ycombinator.com/item?id=43709056 Specifically "The venue is a matter of convenience, nothing more," and if you prefer another, that would work about as well. Perhaps Merano; I hear it's a lovely little town.

ein0p 1 hours ago [-]

The closest Elon ever came to anything Hague-worthy is allowing Starlink to be used in Ukrainian attacks on Russian civilian infrastructure. I don't think the Hague would be interested in anything like that. And if his life is worthless, then what would you say about your own? Nonetheless, I commend you on your complete lack of hinges. /s

throwanem 1 hours ago [-]

Oh, I'm thinking more in the sense of the special one-off kinds of trials, the sort Gustave Gilbert so ably observed. The venue is a matter of convenience, nothing more. To the rest I would say the worth of my life is no more mine to judge than anyone else is competent to do the same for themselves, or indeed other than foolish to pursue the attempt.

scarface_74 1 hours ago [-]

Yes and Amazon has 1.52 million employees. How many developers could they possibly need?

Or maybe it’s just nonsensical to compare the number of employees across companies - especially when they don’t do nearly the same thing.

On a related note, wait until you find out how many more employees that Apple has than Google since Apple has hundreds of retail employees.

jsnell 51 minutes ago [-]

Apple has fewer employees than Google (164k < 183k).

solardev 14 minutes ago [-]

Siri must be really good.

kupopuffs 53 minutes ago [-]

what kind of employees does Docusign employ? surely Digital Documents dont require physical onsite distribution centers and labor

scarface_74 46 minutes ago [-]

Just look at their careers page

stavros 2 hours ago [-]

> Im old enough to remember the mystery and hype before o*/o1/strawberry

So at least two years old?

bananaflag 1 hours ago [-]

I think people expected reasoning to be more than just trained chain of thought (which was known already at the time). On the other hand, it is impressive that CoT can achieve so much.

throwanem 2 hours ago [-]

Honestly, sometimes I wonder if most people these days kinda aren't at least that age, you know? Or less inhibited about acting it than I believe I recall people being last decade. Even compared to just a few years back, people seem more often to struggle to carry a thought, and resort much more quickly to emotional belligerence.

Oh, not that I haven't been as knocked about in the interim, of course. I'm not really claiming I'm better, and these are frightening times; I hope I'm neither projecting nor judging too harshly. But even trying to discount for the possibility, there still seems something new left to explain.

irthomasthomas 2 hours ago [-]

Yeah, I don't know exactly what at an AGI model will look like, but I think it would have more than 200k context window.

kurthr 6 minutes ago [-]

I'd think it would be able to at least suggest which model to use rather than just having 6 for you to choose from.

actsasbuffoon 51 minutes ago [-]

Meanwhile even the highest ranked models can’t do simple logic tasks. GothamChess on YouTube did some tests where he played against a bunch of the best models and every single one of them failed spectacularly.

They’d happily lose a queen to take a pawn. They failed to understand how pieces are even allowed to move, hallucinated the existence of new pieces, repeatedly declared checkmate when it wasn’t, etc.

I tried it last night with Gemini 2.5 Pro and it made it 6 turns before it started making illegal moves, and 8 turns before it got so confused about the state of the board before it refused to play with me any longer.

I was in the chess club in 3rd grade. One of the top ranked LLMs in the world is vastly dumber than I was in 3rd grade. But we’re going to pour hundreds of billions into this in the hope that it can end my career? Good luck with that, guys.

schindlabua 37 minutes ago [-]

Chess is not exactly a simple logic task. It requires you to keep track of 32 things in a 2d space.

I remember being extremely surprised when I could ask GPT3 to rotate a 3d model of a car in it's head and ask it about what I would see when sitting inside, or which doors would refuse to open because they're in contact with the ground.

It really depends on how much you want to shift the goalposts on what constitutes "simple".

leesec 3 hours ago [-]

"haven't actually done much" being popularizing the chat llm and absolutely dwarfing the competition in paid usage

caconym_ 32 minutes ago [-]

Relative to the hype they've been spinning to attract investment, casting the launch and commercialization of ChatGPT as their greatest achievement really is a quite significant downgrade, especially given that they really only got there first because they were the first entity reckless enough to deploy such a tool to the public.

It's easy to forget what smart, connected people were saying about how AI would evolve by <current date> ~a year ago, when in fact what we've gotten since then is a whole bunch of diminishing returns and increasingly sketchy benchmark shenanigans. I have no idea when a real AGI breakthrough will happen, but if you're a person who wants it to happen (I am not), you have to admit to yourself that the last year or so has been disappointing---even if you won't admit it to anybody else.

amarcheschi 3 hours ago [-]

I guess it was related to the last period, rather than the full picture

flkenosad 31 minutes ago [-]

What are people expecting here honestly? This thread is ridiculous.

spaceywilly 1 hours ago [-]

They have 500M weekly users now. I would say that counts as doing something.

littlestymaar 2 hours ago [-]

ChatGPT was released two and a half years ago though. Pretty sure that at some point Sam Altman had promised us AGI by now.

The person you're responding to is correct that OpenAI feels a lot more stagnant than other players (like Google, which was nowhere to be seen even one year and a half ago and now has the leading model on pretty much every metric, but also DeepSeek, who built a competitive model in a year that runs for much cheaper).

iLoveOncall 3 hours ago [-]

ChatGPT was released in 2022, so OP's point stands perfectly well.

echelon 3 hours ago [-]

They're rumored to be working on a social network to rival X with the focus being on image generations.

https://techcrunch.com/2025/04/15/openai-is-reportedly-devel...

The play now seems to be less AGI, more "too big to fail" / use all the capital to morph into a FAANG bigtech.

My bet is that they'll develop a suite of office tools that leverage their model, chat/communication tools, a browser, and perhaps a device.

They're going to try to turn into Google (with maybe a bit of Apple and Meta) before Google turns into them.

Near-term, I don't see late stage investors as recouping their investment. But in time, this may work out well for them. There's a tremendous amount of inefficiency and lack of competition amongst the big tech players. They've been so large that nobody else could effectively challenge them. Now there's a "startup" with enough capital to start eating into big tech's more profitable business lines.

hu3 2 hours ago [-]

I appreciate the info and I have a question:

Why would anyone use a social network run by Sam Altman? No offense but his reputation is chaotic neutral to say the least.

Social networks require a ton of momentum to get going.

BlueSky already ate all the momentum that X lost.

flkenosad 30 minutes ago [-]

Social networks have to be the most chaotic neutral thing ever made. It's like, "hey everyone! Come share what ever you want on my servers!"

echelon 31 minutes ago [-]

Most people don't care about techies or tech drama. They just use the platforms their friends do.

ChatGPT images are the biggest thing on social media right now. My wife is turning photos of our dogs into people. There's a new GPT4o meme trending on TikTok every day. Using GPT4o as the basis of a social media network could be just the kickstart a new social media platform needs.

fkyoureadthedoc 2 hours ago [-]

Not surprising. Add comments to sora.com and you've got a social network.

echelon 15 minutes ago [-]

Seriously. The users on sora.com are already trying to. They're sending messages to each other with the embedded image text and upvoting it.

GPT 4o and Sora are incredibly viral and organic and it's taking over TikTok, Instagram, and all other social media.

If you're not watching casual social media you might miss it, but it's nothing short of a phenomenon.

ChatGPT is now the most downloaded app this month. Images are the reason for that.

refulgentis 2 hours ago [-]

I don't know how anyone could look at any of this and say ponderously: it's basically the same as Nov 2022 ChatGPT. Thus strategically they're pivoting to social to become too big to fail.

echelon 2 hours ago [-]

I mean, it's not fucking AGI/ASI. No amount of LLM flip floppery is going to get us terminators.

If this starts looking differently and the pace picks up, I won't be giving analysis on OpenAI anymore. I'll start packing for the hills.

But to OpenAI's credit, I also don't see how minting another FAANG isn't an incredible achievement. Like - wow - this tech giant was willed into existence. Can't we marvel at that a little bit without worrying about LLMs doing our taxes?

refulgentis 2 hours ago [-]

I don't know what AGI/ASI means to you.

I'm bullish on the models, and my first quiet 5 minutes after the announcement was spent thinking how many of the people I walked past days would be different if the computer Just Did It(tm) (I don't think their day would be different, so I'm not bullish on ASI-even-if-achieved, I guess?)

I think binary analysis that flips between "this is a propped up failure, like when banks get bailouts" and "I'd run away from civilization" isn't really worth much.

paul7986 2 hours ago [-]

chatGPT should be built into my iMessage threads with friends. @chatGPT "Is there an evening train on Thursdays from Brussels to Berlin?" Something a friend and I were discussing but we had to exit out of iMessage and use GPT then back to iMessage.

For UX The GPT info in the thread would be collapsed by default and both users have the discretion to click to expand the info.

swyx 3 hours ago [-]

seriously. the level of arrogance combined with ignorance is awe inspiring.

buzzerbetrayed 1 hours ago [-]

True. They've blown their absolutely massive lead with power users to Anthropic and Google. So they definitely haven't done nothing.

w10-1 1 hours ago [-]

> This is just getting to be a bit much, seems like they are > trying to cover for the fact that they haven't actually done much

Or perhaps they're trying to make some important customers happy by showing movement on areas the customers care about. Subjectively, customers get locked in by feeling they have the inside track, and these small tweaks prove that. Objectively, the small change might make a real difference to the customer's use case.

Similarly, it's important to force development teams to actually ship, and shipping more frequently reduces risk, so this could reflect internal discipline.

As for media buzz, OpenAI is probably trying to tamp that down; they have plenty of first-mover advantage. More puffery just makes their competitors seem more important, and the risk to their reputation of a flop is a lot larger than the reward of the next increment.

As for "a bit much", before 2023 I was thinking I could meaningfully track progress and trade-off's in selecting tech, but now the cat is not only out of the bag, it's had more litters than I can count. So, yeah - a bit much!

sksxihve 1 hours ago [-]

> Or perhaps they're trying to make some important customers happy by showing movement on areas the customers care about

Or make important investors happy, they need to justify the latest $40 billion round

mrcwinn 60 minutes ago [-]

Well, in fairness, Anthropic has less because 1) they started later, 2) could learn from competitors' mistakes, 3) focused on enterprise and not consumer, 4) have fewer resources.

The point is taken — and OpenAI agrees. They have said they are actively working on simplifying the offering. I just think it's a bit unfair. We have perfect hindsight today here on HackerNews and also did zero of the work to produce the product.

paxys 2 hours ago [-]

OpenAI isn't selling GPT-4 or o1 or o4-mini or turbo or whatever else to the general public. These announcements may as well be them releasing GPT v12.582.599385. No one outside of a small group of nerds cares. The end consumer is going to chatgpt.com and typing things in the box.

astrange 2 hours ago [-]

They have an enterprise business too. I think it's relevant for that.

M4v3R 17 minutes ago [-]

And that’s exactly why their model naming and release process looks like this right now.

jstummbillig 27 minutes ago [-]

I can not believe that we feel that this is what's most worth talking about here (by visibility). At this point I truly wonder if AI is what will make HN side with the luddites.

flkenosad 26 minutes ago [-]

It's giving "they took our jerbs"

amarcheschi 3 hours ago [-]

The old Chinese strategy of having 7343 different phone models with almost the same specs to confuse the customer better

24 minutes ago [-]

kylehotchkiss 3 hours ago [-]

This sounds like recent Dell and Lenovo strategies

whalesalad 2 hours ago [-]

recent? they've been doing this for decades.

person a: "I just got an new macbook pro!"

person b: "Nice! I just got a Lenovo YogaPilates Flipfold XR 3299 T92 Thinkbookpad model number SRE44939293X3321"

...

person a: "does that have oled?"

person b: "Lol no silly that is model SRE44939293XB3321". Notice the B in the middle?!?! That is for OLED.

anizan 14 minutes ago [-]

They should launch o999 and count backwards for each release till they hit oagi

crowcroft 2 hours ago [-]

Most industries, or categories go through cycles of fragmentation and consolidation.

AI is currently in a high growth expansion phase. The leads to rapid iteration and fragmentation because getting things released is the most important thing.

When the models start to plateau or the demands on the industry are for profit you will see consolidation start.

airstrike 47 minutes ago [-]

having many models from the same company in some haphazard strategy doesn't equate to "industry fragmentation". it's just confusion

irthomasthomas 2 hours ago [-]

That would explain why they all have a knowledge cutoff (likely training date) of ~August 2023.

vunderba 1 hours ago [-]

> This is just getting to be a bit much, seems like they are trying to cover for the fact that they haven't actually done much.

Did you miss the 4o image generation announcement from roughly three week ago?

https://news.ycombinator.com/item?id=43474112

Combining a multimodal LLM+ImageGen puts them pretty significantly ahead of the curve at least in that domain.

Demonstration of the capabilities:

https://mordenstar.com/blog/chatgpt-4o-images

3 hours ago [-]

resters 2 hours ago [-]

They do this because people like to have predictability. A new model may behave quite differently on something that’s important for a use case.

Also, there are a lot of cases where very small models are just fine and others where they are not. It would always make sense to have the smallest highest performing models available.

kristofferR 3 hours ago [-]

To use that criticism for this release ain't really fair, as these will replace the old models (o3 will replace o1, o4-mini will replace o3-mini).

On a more general level - sure, but they aren't planning to use this release to add a larger number of models, it's just that deprecating/killing the old models can't be done overnight.

drcongo 2 hours ago [-]

As someone who doesn't use anything OpenAI (for all the reasons), I have to agree with the GP. It's all baffling. Why is there an o3-mini and an o4-mini? Why on earth are there so many models?

Once you get to this point you're putting the paradox of choice on the user - I used to use a particular brand toothpaste for years until it got to the point where I'd be in the supermarket looking at a wall of toothpaste all by the same brand with no discernible difference between the products. Why is one of them called "whitening"? Do the others not do that? Why is this one called "complete" and that one called "complete ultra"? That would suggest that the "complete" one wasn't actually complete. I stopped using that brand of toothpaste as it become impossible to know which was the right product within the brand.

If I was assessing the AI landscape today, where the leading models are largely indistinguishable in day to day use, I'd look at OpenAI's wall of toothpaste and immediately discount them.

tedsanders 2 hours ago [-]

(I work at OpenAI.)

In ChatGPT, o4-mini is replacing o3-mini. It's a straight 1-to-1 upgrade.

In the API, o4-mini is a new model option. We continue to support o3-mini so that anyone who built a product atop o3-mini can continue to get stable behavior. By offering both, developers can test both and switch when they like. The alternative would be to risk breaking production apps whenever we launch a new model and shut off developers without warning.

I don't think it's too different from what other companies do. Like, consider Apple. They support dozens of iPhone models with their software updates and developer docs. And if you're an app developer, you probably want to be aware of all those models and docs as you develop your app (not an exact analogy). But if you're a regular person and you go into an Apple store, you only see a few options, which you can personalize to what you want.

If you have concrete suggestions on how we can improve our naming or our product offering, happy to consider them. Genuinely trying to do the best we can, and we'll clean some things up later this year.

Fun fact: before GPT-4, we had a unified naming scheme for models that went {modality}-{size}-{version}, which resulted in names like text-davinci-002. We considered launching GPT-4 as something like text-earhart-001, but since everyone was calling it GPT-4 anyway, we abandoned that system to use the name GPT-4 that everyone had already latched onto. Kind of funny how our unified naming scheme originally made room for 999 versions, but we didn't make it past 3.

daveguy 1 hours ago [-]

Have any of the models been deprecated? It seems like a deprecation plan and definition of timelines would be extraordinarily helpful.

I have not seen any sort of "If you're using X.122, upgrade to X.123, before 202X. If you're using X.120, upgrade to anything before April 2026, because the model will no longer be available on that date." ... Like all operating systems and hardware manufacturers have been doing for decades.

Side note, it's amusing that stable behavior is only available on a particular model with a sufficiently low temperature setting. As near-AGI shouldn't these models be smart enough to maintain consistency or improvement from version to version?

tedsanders 1 hours ago [-]

Yep, we have a page of API deprecation details here: https://platform.openai.com/docs/deprecations

It's got all deprecations, ordered by date of announcement, alongside shutdown dates and recommended replacements.

Note that we use the term deprecated to mean slated for shutdown, and shutdown to mean when it's actually shut down.

In general, we try to minimize developer pain by supporting models for as long as we reasonably can, and we'll give a long heads up before any shutdown. (GPT-4.5-preview was a bit of an odd case because it was launched as a potentially temporary preview, so we only gave a 3-month notice. But generally we aim for much longer notice.)

mkozlows 2 hours ago [-]

They keep a lot of models around for backward compatibility for API users. This is confusing, but not inherently a bad idea.

louthy 2 hours ago [-]

You could develop an AI model to help pick the correct AI model.

Now you’ve got 18 problems.

skygazer 27 minutes ago [-]

I think you're trying to re-contextualize the old Standards joke, but I actually think you're right -- if a front end model could dispatch as appropriate to the best backend model for a given prompt, and turn everything into a high level sort of mixture of models, I think that would be great, and a great simplifying step. Then they can specialize and optimize all they want, CPU goes down, responses get better and we only see one interface.

louthy 18 minutes ago [-]

> I think you're trying to re-contextualize the old Standards joke

Regex joke [1], but the standards joke will do just fine also :)

[1] Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

petesergeant 2 hours ago [-]

> Why is there an o3-mini and an o4-mini? Why on earth are there so many models?

Because if they removed access to o3-mini — which I have tested, costed, and built around — I would be very angry. I will probably switch to o4-mini when the time is right.

TuxSH 2 hours ago [-]

They just did that, at least for chat

whalesalad 2 hours ago [-]

Model fatigue is a real thing - Particularly with their billing model that is wildly different from model to model and gives you more headroom as you spend more. We spend a lot of time and effort running tests across many models to balance for that cost/performance ratio. When you can run 300k tokens per min on a shittier model, or 10k tokens per min on a better model - you want to use the cheaper model but if the performance isn't there then you gotta pivot. Can I use tools here? Can I use function calling here? Do I use the chat API, the chat completions API, or the responses API? Do either of those work with the model I want to use, or only with other models?

I almost wonder if this is intentional ... because when you create a quagmire of insane inter-dependent billing scenarios you end up with a product like AWS that can generate substantial amounts of revenue from sheer ignorance or confusion. Then you can hire special consultants to come in and offer solutions to your customers in order to wade through the muck on your behalf.

Dealing with OpenAI's API's is a straight up nightmare.

wilg 3 hours ago [-]

There are 9 models in the ChatGPT model picker and they have stated that it's their goal to get rid of the model picker because everyone finds it annoying.

onlyrealcuzzo 2 hours ago [-]

> All these models feel like they took the exact same base model, tweaked a few things and released it as an entirely new model rather than updating the existing ones.

That's not a problem in and of itself. It's only a problem if the models aren't good enough.

Judging by ChatGPT's adoption, people seem to think they're doing just fine.

erikw 1 hours ago [-]

Interesting... I asked o3 for help writing a flake so I could install the latest Webstorm on NixOS (since the one in the package repo is several months old), and it looks like it actually spun up a NixOS VM, downloaded the Webstorm package, wrote the Flake, calculated the SHA hash that NixOS needs, and wrote a test suite. The test suite indicates that it even did GUI testing- not sure whether that is a hallucination or not though. Nevertheless, it one-shotted the installation instructions for me, and I don't see how it could have calculated the package hash without downloading, so I think this indicates some very interesting new capabilities. Highly impressive.

ai-christianson 39 minutes ago [-]

> Interesting... I asked o3 for help writing...

What tool were you using for this?

peterldowns 45 minutes ago [-]

If it can write a nixos flake it's significantly smarter than the average programmer. Certainly smarter than me, one-shotting a flake is not something I'll ever be able to do — usually takes me about thirty shots and a few minutes to cool off from how mad I am at whoever designed this fucking idiotic language. That's awesome.

ZeroTalent 7 minutes ago [-]

I was a major contributor of Flake. What in particular is so idiotic in your opinion?

yjftsjthsd-h 3 minutes ago [-]

FWIW, they said the language was bad, not specifically flakes. IMHO, nix is super easy if you already know Haskell (possibly others in that family). If you don't, it's extremely unintuitive.

26 minutes ago [-]

georgewsinger 3 hours ago [-]

Very impressive! But under arguably the most important benchmark -- SWE-bench verified for real-world coding tasks -- Claude 3.7 still remains the champion.[1]

Incredible how resilient Claude models have been for best-in-coding class.

[1] But by only about 1%, and inclusive of Claude's "custom scaffold" augmentation (which in practice I assume almost no one uses?). The new OpenAI models might still be effectively best in class now (or likely beating Claude with similar augmentation?).

jjani 3 hours ago [-]

Gemini 2.5 Pro is widely considered superior to 3.7 Sonnet now by heavy users, but they don't have an SWE-bench score. Shows that looking at one such benchmark isn't very telling. Main advantage over Sonnet being that it's better at using a large amount of context, which is enormously helpful during coding tasks.

Sonnet is still an incredibly impressive model as it held the crown for 6 months, which may as well be a decade with the current pace of LLM improvement.

unsupp0rted 3 hours ago [-]

Main advantage over Sonnet is Gemini 2.5 doesn't try to make a bunch of unrelated changes like it's rewriting my project from scratch.

itsmevictor 3 hours ago [-]

I find Gemini 2.5 truly remarkable and overall better than Claude, which I was a big fan of

enraged_camel 2 hours ago [-]

Still doesn't work well in Cursor unfortunately.

ai-christianson 37 minutes ago [-]

Works well in RA.Aid --in fact I'd recommend it as the default model in terms of overall cost and capability.

bitbuilder 2 hours ago [-]

This was incredibly irritating at first, though over time I've learned to appreciate this "extra credit" work. It can be fun to see what Claude thinks I can do better, or should add in addition to whatever feature I just asked for. Especially when it comes to UI work, Claude actually has some pretty cool ideas.

If I'm using Claude through Copilot where it's "free" I'll let it do its thing and just roll back to the last commit if it gets too ambitious. If I really want it to stay on track I'll explicitly tell it in the prompt to focus only on what I've asked, and that seems to work.

And just today, I found myself leaving a comment like this: //Note to Claude: Do not refactor the below. It's ugly, but it's supposed to be that way.

Never thought I'd see the day I was leaving comments for my AI agent coworker.

TuxSH 1 hours ago [-]

> If I'm using Claude through Copilot where it's "free"

Too bad Microsoft is widely limiting this -- have you seen their pricing changes?

I also feel like they nerfed their models, or reduced context window again.

erikw 1 hours ago [-]

What language / framework are you using? I ask because in a Node / Typescript / React project I experience the opposite- Claude 3.7 usually solves my query on the first try, and seems to understand the project's context, ie the file structure, packages, coding guidelines, tests, etc, while Gemini 2.5 seems to install packages willy-nilly, duplicate existing tests, create duplicate components, etc.

Workaccount2 2 hours ago [-]

It's viable context, context length where is doesn't fall apart, is also much longer.

zaptrem 2 hours ago [-]

I do find it likes to subtly reformat every single line thereby nuking my diff and making its changes unusable since I can’t verify them that way, which Sonnet doesn’t do.

jdgoesmarching 3 hours ago [-]

Also that Gemini 2.5 still doesn’t support prompt caching, which is huge for tools like Cline.

scrlk 2 hours ago [-]

2.5 Pro supports prompt caching now: https://cloud.google.com/vertex-ai/generative-ai/docs/models...

jdgoesmarching 2 hours ago [-]

Oh, that must’ve been in the last few days. Weird that it’s only in 2.5 Pro preview but at least they’re headed in the right direction.

Now they just need a decent usage dashboard that doesn’t take a day to populate or require additional GCP monitoring services to break out the model usage.

spaceman_2020 58 minutes ago [-]

I feel that Claude 3.7 is smarter, but does way too much and has poor prompt adherence

oofbaroomf 3 hours ago [-]

Claude got 63.2% according to the swebench.com leaderboard (listed as "Tools + Claude 3.7 Sonnet (2025-02-24)).[0] OpenAI said they got 69.1% in their blog post.

[0] swebench.com/#verified

georgewsinger 3 hours ago [-]

Yes, however Claude advertised 70.3%[1] on SWE bench verified when using the following scaffolding:

> For Claude 3.7 Sonnet and Claude 3.5 Sonnet (new), we use a much simpler approach with minimal scaffolding, where the model decides which commands to run and files to edit in a single session. Our main “no extended thinking” pass@1 result simply equips the model with the two tools described here—a bash tool, and a file editing tool that operates via string replacements—as well as the “planning tool” mentioned above in our TAU-bench results.

Arguably this shouldn't be counted though?

[1] https://www.anthropic.com/_next/image?url=https%3A%2F%2Fwww-...

tedsanders 2 hours ago [-]

I think you may have misread the footnote. That simpler setup results in the 62.3%/63.7% score. The 70.3% score results from a high-compute parallel setup with rejection sampling and ranking:

> For our “high compute” number we adopt additional complexity and parallel test-time compute as follows:

> We sample multiple parallel attempts with the scaffold above

> We discard patches that break the visible regression tests in the repository, similar to the rejection sampling approach adopted by Agentless; note no hidden test information is used.

> We then rank the remaining attempts with a scoring model similar to our results on GPQA and AIME described in our research post and choose the best one for the submission.

> This results in a score of 70.3% on the subset of n=489 verified tasks which work on our infrastructure. Without this scaffold, Claude 3.7 Sonnet achieves 63.7% on SWE-bench Verified using this same subset.

georgewsinger 36 minutes ago [-]

Somehow completely missed that, thanks!

I think reading this makes it even clearer that the 70.3% score should just be discarded from the benchmarks. "I got a 7%-8% higher SWE benchmark score by doing a bunch of extra work and sampling a ton of answers" is not something a typical user is going to have already set up when logging onto Claude and asking it a SWE style question.

Personally, it seems like an illegitimate way to juice the numbers to me (though Claude was transparent with what they did so it's all good, and it's not uninteresting to know you can boost your score by 8% with the right tooling).

awestroke 3 hours ago [-]

OpenAI have not shown themselves to be trustworthy, I'd take their claims with a few solar masses of salt

swyx 3 hours ago [-]

they also gave more detail on their SWEBench scaffolding here https://www.latent.space/p/claude-sonnet

pizzathyme 1 hours ago [-]

The image generation improvement with o4-mini is incredible. Testing it out today, this is a step change in editing specificity even from the ChatGPT 4o LLM image integration just a few weeks ago (which was already a step change). I'm able to ask for surgical edits, and they are done correctly.

There isn't a numerical benchmark for this that people seem to be tracking but this opens up production-ready image use cases. This was worth a new release.

lattalayta 3 hours ago [-]

I haven't been following them that closely, but are people finding these benchmarks relevant? It seems like these companies could just tune their models to do well on particular benchmarks

emp17344 2 hours ago [-]

That’s exactly what’s happening. I’m not convinced there’s any real progress occurring here.

thefourthchime 2 hours ago [-]

Also, if you're using Cursor AI, it seems to have much better integration with Claude where it can reflect on its own things and go off and run commands. I don't see it doing that with Gemini or the O1 models.

osigurdson 2 hours ago [-]

I have a very basic / stupid "Turing test" which is just to write a base 62 converter in C#. I would think this exact thing would be in github somewhere (thus in the weights) but has always failed for me in the past (non-scientific / didn't try every single model).

Using o4-mini-high, it actually did produce a working implementation after a bit of prompting. So yeah, today, this test passed which is cool.

sebzim4500 2 hours ago [-]

Unless I'm misunderstanding what you are asking the model to do, Gemini 2.5 pro just passed this easily. https://g.co/gemini/share/e2876d310914

osigurdson 2 hours ago [-]

As I mentioned, this is not a scientific test but rather just something that I have tried from time to time and has always (shockingly in my opinion) failed but today worked. It takes a minute of two of prompting, is boring to verify and I don't remember exactly which models I have used. It is purely a personal anecdote, nothing more.

However, looking at the code that Gemini wrote in the link, it does the same thing that other LLMs often do, which is to assume that we are encoding individual long values. I assume there must be a github repo or stackoverflow question in the weights somewhere that is pushing it in this direction but it is a little odd. Naturally, this isn't the kind encoder that someone would normally want. Typically it should encode a byte array and return a string (or maybe encode / decode UTF8 strings directly). Having the interface use a long is very weird and not very useful.

In any case, I suspect with a bit more prompting you might be able to get gemini to do the right thing.

croemer 52 minutes ago [-]

I asked o3 to build and test a maximum parsimony phylogenetic tree builder in Python (my standard test for new models) and it's been thinking for 10 minutes. Still not clear if anything is happening, I have barely seen any code since I asked to test what it produced in the first answer. The thought summary is totally useless compared to Gemini's. Underwhelming so far.

The CoT summary is full of references to Jupyter notebook cells. The variable names are too abbreviated, nbr for neighbor, the code becomes fairly cryptic as a result, not nice to read. Maybe optimized too much for speed.

Also I've noticed ChatGPT seems to abort thinking when I switch away from the app. That's stupid, I don't want to look at a spinner for 5 minutes.

And the CoT summary keeps mentioning my name which is irritating.

andrethegiant 3 hours ago [-]

Buried in the article, a new CLI for coding:

> Codex CLI is fully open-source at https://github.com/openai/codex today.

ipsum2 3 hours ago [-]

Looks like a Claude Code clone.

jumpCastle 19 minutes ago [-]

But open source like aider

zapnuk 3 hours ago [-]

Surprisingly, they didn't provide a comparison to Sonnet 3.7 or Gemini Pro 2.5—probably because, while both are impressive, they're only slightly better by comparison.

Lets see what the pricing looks like.

Workaccount2 3 hours ago [-]

Looks like they are taking a page from Apple's book, which is to never even acknowledge other products exist outside your ecosystem.

oofbaroomf 3 hours ago [-]

They didn't provide a comparison either in the GPT-4.1 release and quite a few past releases, which is telling of their attitude as an org.

BeetleB 3 hours ago [-]

Pricing is already available:

https://platform.openai.com/docs/pricing

ben_w 3 hours ago [-]

4o and o4 at the same time. Excellent work on the product naming, whoever did that.

stavros 2 hours ago [-]

Oh, that was Altman Sam.

ai-christianson 29 minutes ago [-]

Am Saltman

stavros 27 minutes ago [-]

Enter.

janderson215 2 hours ago [-]

It took me reading your comment to realize that they were different and this wasn’t deja vu. Maybe that says more about me than OpenAI, but my gut agrees with you.

throwuxiytayq 2 hours ago [-]

Just wait until they announce oA and A0.

They jokingly admitted that they’re bad at naming in the 4.1 reveal video, so they’re certainly aware of the problem. They’re probably hoping to make the model lineup clearer after some of the older models get retired, but the current mess was certainly entirely foreseeable.

carlita_express 3 hours ago [-]

> we’ve observed that large-scale reinforcement learning exhibits the same “more compute = better performance” trend observed in GPT‑series pretraining.

Didn’t the pivot to RL from pretraining happen because the scaling “law” didn’t deliver the expected gains? (Or at least because O(log) increases in model performance became unreasonably costly?) I see they’ve finally resigned themselves to calling these trends, not laws, but trends are often fleeting. Why should we expect this one to hold for much longer?

anothermathbozo 1 hours ago [-]

This isn't exactly the case. The trend is a log scale. So a 10x in pretraining should yield a 10% increase in performance. That's not proving to be false per say but rather they are encountering practical limitations around 10x'ing data volume and 10x'ing available compute.

carlita_express 27 minutes ago [-]

I am aware of that, like I said:

> (Or at least because O(log) increases in model performance became unreasonably costly?)

But, yes, I left implicit in my comment that the trend might be “fleeting” because of its impracticality. RL is only a trend so long as it is fashionable, and only fashionable (i.e., practical) so long as OpenAI is fed an exponential amount of VC money to ensure linear improvements under O(log) conditions.

OpenAI is selling to VCs the idea that some hitherto unspecified amount of linear model improvement will kick off productivity gains greater than their exponentially increasing investment. These productivity gains would be no less than a sizeable percentage of American GDP, which Altman has publicly set as his target. But as the capital required increases exponentially, the gap between linearly increasing model capability (i.e., its productivity) and the breakeven ROI target widens. The bigger model would need to deliver a non-linear increase in productivity to justify the exponential price tag.

jumploops 10 minutes ago [-]

The big step function here seems to be RL on tool calling.

Claude 3.7/3.5 are the only models that seem to be able to handle "pure agent" usecases well (agent in a loop, not in an agentic workflow scaffold[0]).

OpenAI has made a bet on reasoning models as the core to a purely agentic loop, but it hasn't worked particularly well yet (in my own tests, though folks have hacked a Claude Code workaround[1]).

o3-mini has been better at some technical problems than 3.7/3.5 (particularly refactoring, in my experience), but still struggles with long chains of tool calling.

My hunch is that these models were tuned _with_ OpenAI Codex[2], which is presumably what Anthropic was doing internally with Claude Code on 3.5/3.7

tl;dr - GPT-3 launched with completions (predict the next token), then OpenAI fine-tuned that model on "chat completions" which then led GPT-3.5/GPT-4, and ultimately the success of ChatGPT. This new agent paradigm, requires fine-tuning on the LLM interacting with itself (thinking) and with the outside world (tools), sans any human input.

[0]https://www.anthropic.com/engineering/building-effective-age...

[1]https://github.com/1rgs/claude-code-proxy

[2]https://openai.com/index/openai-codex/

testfrequency 3 hours ago [-]

As a consumer, it is so exhausting keeping up with what model I should or can be using for the task I want to accomplish.

1123581321 3 hours ago [-]

I think it can be confusing if you're just reading the news. If you use ChatGPT, the model selector has good brief explanations and teaching you about newly available options if you don't visit the dropdown. Anthropic does similarly.

yoyohello13 19 minutes ago [-]

The answer is to just use the latest Claude model and not worry beyond that.

energy123 2 hours ago [-]

Gemini 2.5 Pro for every single task was the meta until this release. Will have to reassess now.

hollerith 2 hours ago [-]

Huh. I use Gemini 2.0 Flash for many things because it's several times faster than 2.5 Pro.

mring33621 25 minutes ago [-]

Agreed.

I pretty much stopped shopping around once Gemini 2.0 Flash came out.

For general, cloud-centric software development help, it does the job just fine.

I'm honestly quite fond of this Gemini model. I feel silly saying that, but it's true.

blueprint 1 hours ago [-]

how do you deal with the fact that they use all of your data for training their own systems and review all conversations

n2d4 3 hours ago [-]

This one seems to make it easier — if the promises here hold true, the multi-modal support probably makes o4-mini-high OpenAI's best model for most tasks unless you have time and money, in which case it's o3-pro.

darioush 3 hours ago [-]

It's becoming a bit like iphone 3, 4... 13, 25...

Ok they are all phones that run apps and have a camera. I'm not an "AI power user", but I do talk to ChatGPT + Grok for daily tasks and use copilot.

The big step function happened when they could search the web but not much else has changed in my limited experience.

refulgentis 3 hours ago [-]

This is a very apt analogy.

It confers to the speaker confirmation they're absolutely right - names are arbitrary.

While also politely, implicitly, pointing out the core issue is it doesn't matter to you --- which is fine! --- but it may just be contributing to dull conversation to be the 10th person to say as much.

tempaccount420 3 hours ago [-]

As another consumer, I think you're overreacting, it's not that bad.

CamperBob2 3 hours ago [-]

I asked OpenAI how to choose the right USB cable for my device. Now the objects around me are shimmering and winking out of existence, one by one. Help

ithkuil 2 hours ago [-]

Lol. But that's nothing. Wait until you shimmer and wink in and out of existence, like llms do during each completion

fkyoureadthedoc 3 hours ago [-]

As a consumer, I read the list and 5 word description of each thing one time and now I know. Don't you guys ever get tired of pretending you can't read and remember like 4 simple things? How do you even do your job?

testfrequency 3 hours ago [-]

I’m assuming when you say “read once”, that implies reading once every single release?

It’s confusing. If I’m confused, it’s confusing. This is UX 101.

sebzim4500 2 hours ago [-]

Aside from anything else, having one model called o4 and one model called 4o is confusing. And I know they haven't released o4 yet but still.

mrits 3 hours ago [-]

Some people don't blindly trust the marketing department of the publisher

fkyoureadthedoc 3 hours ago [-]

Then it doesn't even matter what they name the model since it's just marketing that they wouldn't trust anyway.

czk 3 hours ago [-]

"good at advanced reasoning", "fast at advanced reasoning", "slower at advanced reasoning but more advanced than the good one but not as fast but cant search the internet", "great at code and logic", "good for everyday tasks but awful at everything else", "faster for most questions but answers them incorrectly", "can draw but cant search", "can search but cant draw", "good for writing and doing creative things"

fkyoureadthedoc 3 hours ago [-]

Putting the actual list would have made it too clear that I'm right I see

brap 3 hours ago [-]

Where's the comparison with Gemini 2.5 Pro?

gallerdude 3 hours ago [-]

For coding, I like the Aider polyglot benchmark, since it covers multiple programming languages.

Gemini 2.5 Pro got 72.9%

o3 high gets 81.3%, o4-mini high gets 68.9%

croemer 37 minutes ago [-]

Isn't it easy to train on the specific Exercism exercises that this benchmark uses?

jumpCastle 16 minutes ago [-]

It was a good benchmark until it entered the training set.

vessenes 2 hours ago [-]

where do you find those o3 high numbers? https://aider.chat/docs/leaderboards/ currently has gemini 2.5 pro as the leader at, as you say, 72.9%.

re-thc 2 hours ago [-]

It's in the OpenAI article post (OP) i.e. OpenAI ran Aider themselves.

asadm 3 hours ago [-]

thanks

SweetSoftPillow 2 hours ago [-]

Some sources mention that o3 scores 63.8 on SWE-bench, while Gemini 2.5 Pro scores 69.1.

On most other benchmarks, they seem to perform about the same, which is bad news for o3 because it's much more expensive and slower than Gemini 2.5 Pro, and it also hides its reasoning while Gemini shows everything.

We can probably just stick with Gemini 2.5 Pro, since it offers the best combination of price, quality, and speed. No need to worry about finding a replacement (for now).

kridsdale1 3 hours ago [-]

Exactly.

burke 3 hours ago [-]

It's pretty frustrating to see a press release with "Try on ChatGPT" and then not see the models available even though I'm paying them $200/mo.

TuxSH 3 hours ago [-]

They're supposed to be released today for everyone, and o3-pro for Pro users in a few weeks:

"ChatGPT Plus, Pro, and Team users will see o3, o4-mini, and o4-mini-high in the model selector starting today, replacing o1, o3‑mini, and o3‑mini‑high."

with rate limits unchanged

_bin_ 3 hours ago [-]

I see o4-mini on the $20 tier but no o3.

wilg 3 hours ago [-]

They are all now available on the Pro plan. Y'all really ought to have a little bit more grace to wait 30 minutes after the announcement for the rollout.

drcongo 2 hours ago [-]

Or maybe OpenAI could wait until they'd released it before telling people to use it now.

2 hours ago [-]

3 hours ago [-]

brcmthrowaway 2 hours ago [-]

Holy crap... thats expensive.

meetpateltech 3 hours ago [-]

o3 is cheaper than o1. (per 1M tokens)

• o3 Pricing:

  - Input: $10.00  

  - Cached Input: $2.50  

  - Output: $40.00

• o1 Pricing:

  - Input: $15.00  

  - Cached Input: $7.50  

  - Output: $60.00

o4-mini pricing remains the same as o3-mini.

jawiggins 2 hours ago [-]

In the examples they demonstrate tool use in the reasoning loop. The models pretty impressively recognize they need some external data, and either complete a web search, or write and execute python to solve intermediate steps.

To the extent that reasoning is noisy and models can go astray during it, this helps inject truth back into the reasoning loop.

Is there some well known equivalent to Moores Law for token use? We're headed in a direction where LLM control loops can run 24/7 generating tokens to reason about live sensor data, and calling tools to act on it.

evaneykelen 3 hours ago [-]

A suggestion for OpenAI to create more meaningful model names:

{Size}-{Quarter/Year}-{Speed/Accuracy}-{Specialty}

Where:

* Size is XS/S/M/L/XL/XXL to indicate overall capability level

* Quarter/Year like Q2-25

* Speed/Accuracy indicated as Fast/Balanced/Precise

* Optional specialty tag like Code/Vision/Science/etc

Example model names:

* L-Q2-25-Fast-Code (Large model from Q2 2025, optimized for speed, specializes in coding)

* M-Q4-24-Balanced (Medium model from Q4 2024, balanced speed/accuracy)

oofbaroomf 3 hours ago [-]

This is even more incomprehensible to users who don't understand what this naming scheme is supposed to mean. Right now, most power users are keeping track of all the models and know what they are like, so this naming wouldn't help them. Normal consumers don't really know the difference between the models, but this wouldn't help them either - all those letters and numbers aren't super inviting and friendly. They could try just having a linear slider for amount of intelligence and another one for speed.

jsnell 2 hours ago [-]

I think they should name them after fictional characters. Bonus points if they're trademarked characters.

"You gotta try Mickey, it beats the crap out of Gandalf in coding."

LanceJones 2 hours ago [-]

What about using Marvel superhero names (with permission, of course)? The studio keeps giving us stronger and stronger examples...

pembrook 2 hours ago [-]

Thank god we don’t usually let engineers name stuff in the west.

While this is entirely logical in theory this is how you get LG style naming like “THE ALL NEW LG-CFT563-X2”

I mean, it makes total sense, it tells you exactly the model, region, series and edition! Right??

ApolloFortyNine 3 hours ago [-]

Maybe OpenAI needs an easy mode for all these people saying 5 choices of models (and that's only if you pay) is simply too confusing for them.

They even provide a description in the UI of each before you select it, and it defaults to a model for you.

If you just want an answer of what you should use and can't be bothered to research them, just use o3(4)-mini and call it a day.

brokencode 3 hours ago [-]

I personally like being able to choose because I understand the tradeoffs and want to choose the best one for what I’m asking. So I hope this doesn’t go away.

But I agree that they probably need some kind of basic mode to make things easier for the average person. The basic mode should decide automatically what model to use and hide this from the user.

jdross 3 hours ago [-]

The pace of notable releases across the industry right now is unlike any time I remember since I started doing this in the early 2000's. And it feels like it's accelerating

qoez 2 hours ago [-]

Lots of releases but very little actual performance increases

int_19h 2 hours ago [-]

Sonnet and Gemini saw fairly substantial perf increases recenly

mchusma 1 hours ago [-]

Love Sonnet but 3.7 is not obviously an improvement over 3.5 in my real world usage. Gemini 2.5 pro is great, has replaced most others for me (Grok I use for things that require realtime answers)

emp17344 3 hours ago [-]

Not really. We’re definitely in the incremental improvement stage at this point. Certainly no indication that progress is “accelerating”.

Workaccount2 3 hours ago [-]

Integration is accelerating rapidly. Even if model development froze today, we would still probably have ~5 years of adoption and integration before it started to level off.

littlestymaar 1 hours ago [-]

You are both correct. It feels like the tech itself is kinda plateauing but it's still massively under-used. It will take a decade or more before the deployment starts slowing down.

nwienert 3 hours ago [-]

ChatGPT 3 : iPhone 1

A bunch of models later, we're about on the iPhone 4-5 now. Feels about right.

jcynix 3 hours ago [-]

To plan a visit to a dark sky place, I used duck.ai (Duckduckgo's experimental AI chat feature) to ask five different AIs on what date the new moon will happen in August 2025.

GPT-4o mini: The new moon in August 2025 will occur on August 12.

Llama 3.3 70B: The new moon in August 2025 is expected to occur on August 16, 2025.

Claude 3 Haiku: The new moon in August 2025 will occur on August 23, 2025.

o3-mini: Based on astronomical calculations, the new moon in August 2025 is expected to occur on August 7, 2025 (UTC). [...]

Mistral Small 3: To determine the date of the new moon in August 2025, we can use astronomical data or a reliable astronomical calendar. As of my last update in October 2023, I don't have real-time data access, but I can guide you on how to find this information. [...]

I got different answers, mostly wrong. My calendars (both paper and app versions) show me 23. august as the date.

And btw, when I asked those AIs which entries in a robots.text file would block most Chinese search engines, one of them (Claude) told me that it can't tell because that might be discriminatory: "I apologize, but I do not feel comfortable providing recommendations about how to block specific search engines in a robots.txt file. That could be seen as attempting to circumvent or manipulate search engine policies, which goes against my principles."

pixl97 1 hours ago [-]

So I asked GPT-o4-mini-high

"On what date will the new moon occur on in August 2025. Use a tool to verify the date if needed"

It correctly reasoned it did not have exact dates due to its cutoff and did a lookup.

"The new moon in August 2025 falls on Friday, August 22, 2025"

Now, I did not specify the timezone I was in so our timing between 22 and 23 appears to be just a time zone difference at it had marked an time of 23:06 PDT per its source.

phoe18 14 minutes ago [-]

Response from Gemini 2.5 Pro for comparison -

``` Based on the search results, the new moon in August 2025 will occur late on Friday, August 22nd, 2025 in the Pacific Time Zone (PDT), specifically around 11:06 PM.

In other time zones, like the Eastern Time Zone (ET), this event falls early on Saturday, August 23rd, 2025 (around 2:06 AM). ```

jcynix 26 minutes ago [-]

"Use a tool to verify the date if needed" that's a good idea, yes. And the answers I got are based on UTC, so 23:06 PDT should match the 23. for Europe.

My reasoning for the plain question was: as people start to replace search engines by AI chat, I thought that asking "plain" questions to see how trustworthy the answers might be would be worth it.

WhatIsDukkha 3 hours ago [-]

I would never ask any of these questions of an LLM (and I use and rely on LLMs multiple times a day), this is a job for a computer.

I would also never ask a coworker for this precise number either.

jcynix 20 minutes ago [-]

My reasoning for the plain question was: as people start to replace search engines by AI chat, I thought that asking "plain" questions to see how trustworthy the answers might be, would be a good test. Because plain folks will ask plain questions and won't think about the subtle details. They would not expect a "precise number" either, i.e. not 23:06 PDT, but would like to know if this weekend would be fine for a trip or the previous or next weekend would be better to book a "dark sky" tour.

And, BTW, I thought that LLMs are computers too ;-0

stavros 25 minutes ago [-]

First we wanted to be able to do calculations really quickly, so we built computers.

Then we wanted the computers to reason like humans, so we built LLMs.

Now we want the LLMs to do calculations really quickly.

It doesn't seem like we'll ever be satisfied.

achierius 2 hours ago [-]

But it's a good reminder when so many enterprises like to claim that hallucinations have "mostly been solved".

WhatIsDukkha 1 hours ago [-]

I agree with you partially, BUT

when are the long list of 'enterprise' coworkers, who have glibly and overconfidently answered questions without doing math or looking them up, going to be fired?

andrewinardeer 41 minutes ago [-]

"Who was the President of the United States when Neil Armstrong walked on the moon?"

Gemini 2.5 refuses to answer this because it is too political.

croemer 33 minutes ago [-]

I call bs on this: https://g.co/gemini/share/ed38e9d38b02

xnx 3 hours ago [-]

Gemini gets the new moon right. Better to use one good model than 5 worse ones.

kenjackson 2 hours ago [-]

I think all the full power LLMs will get it right because they do web search. ChatGPT 4 does as well.

thm 3 hours ago [-]

I'm starting to be reminded of the razor blade business.

Jordan-117 3 hours ago [-]

Fuck Everything, We're Doing o5

iamronaldo 3 hours ago [-]

Tyler cowen seems convinced https://marginalrevolution.com/marginalrevolution/2025/04/o3...

rsanheim 2 hours ago [-]

`ETOOMANYMODELS`

Is there a reputable, non-blogspam site that offers a 'cheat sheet' of sorts for what models to use, in particular for development? Not just openAI, but across the main cloud offerings and feasible local models?

I know there are the benchmarks, and directories like huggingface, and you can get a 'feel' for things by scanning threads here or other forums.

I'm thinking more of something that provides use-case tailored "top 3" choices by collecting and summarizing different data points. For example:

* agent & tool based dev (cloud) - [top 3 models] * agent & tool based dev (local) - m1, m2, m,3 * code review / high level analysis - ... * general tech questions - ... * technical writing (ADRs, needs assessments, etc) - ...

Part of the problem is how quickly the landscape changes everyday, and also just relying on benchmarks isn't enough: it ignores cost, and more importantly ignores actual user experience (which I realize is incredibly hard to aggregate & quantify).

3 hours ago [-]

AcerbicZero 1 hours ago [-]

I can't even get ChatGPT to tell me which chatgpt to use.

2 hours ago [-]

sbochins 2 hours ago [-]

So far with my random / coding design question that I asked with o1 last week, it did substantially better with o3. It’s more like a mid level engineer and less like a intern.

fpgaminer 2 hours ago [-]

On the vision side of things: I ran my torture test through it, and while it performed "well", about the same level as 4o and o1, it still fails to handle spatial relationships well, and did hallucinate some details. OCR is a little better it seems, but a more thorough OCR focused test would be needed to know for sure. My torture tests are more focused on accurately describing the content of images.

Both seem to be better at prompt following and have more up to date knowledge.

But honestly, if o3 was only at the same level as o1, it'd still be an upgrade since it's cheaper. o1 is difficult to justify in the API due to cost.

croemer 40 minutes ago [-]

I wonder where o3 and o4-mini will land on the LMarena leaderboard. When might we see them there?

falleng0d 3 hours ago [-]

Maybe they should ask the new models to generate a better name for themselves. It's getting quite confusing.

Sol- 2 hours ago [-]

Interesting that using tools to zoom around the image is useful for the model. I was kind of assuming that these models were beyond such things and could attend to all aspects image simultaneously anyway, but perhaps their input is still limited in the resolution? Very cool, in any case, spooky progress as always.

littlestymaar 2 hours ago [-]

There's just a certain amount of things the image encoder can process at once. It's pretty apparent when you give the models a big table in an image.

iandanforth 1 hours ago [-]

o3 failed the first test I gave it. I wanted it to create a bar chart using Python of the first 10 Fibonacci numbers (did this easily), and then use that image as input to generate an info-graphic of the chart with an animal theme. It failed in two ways. It didn't have access to the visual output from python and, when I gave it a screenshot of that output, it failed in standard GenAI fashion by having poor / incomplete text and not adhering exactly to bar heights, which were critical in this case.

So one failure that could be resolved with better integration on the back end and then an open problem with image generation in general.

lubitelpospat 40 minutes ago [-]

Sooo... are any of these (or their distils) getting open-sourced/open-weighted?

WhitneyLand 3 hours ago [-]

So it looks like no increase in context window size since it’s not mentioned anywhere.

I assume this announcement is all 256k, while the base model 4.1 just shot up this week to a million.

kumarm 3 hours ago [-]

Anyone got codex working? After installing and setting up API Key I get this error :

    system
      OpenAI rejected the request (request ID: req_06727eaf1c5d1e3f900760d10ca565a7). Please verify your settings and try again.

╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

pton_xd 3 hours ago [-]

This reminds me of keeping up with all the latest JavaScript framework trivia circa the ~2010s

bufferoverflow 3 hours ago [-]

JS framework thing is still ongoing

https://krausest.github.io/js-framework-benchmark/2025/table...

2 hours ago [-]

rahimnathwani 3 hours ago [-]

  ChatGPT Plus, Pro, and Team users will see o3, o4-mini, and o4-mini-high in the model selector starting today, replacing o1, o3‑mini, and o3‑mini‑high.

I subscribe to pro but don't yet see the new models (either in the Android app or on the web version).

oofbaroomf 3 hours ago [-]

Same...

oofbaroomf 3 hours ago [-]

It's there now in the web app for me.

rahimnathwani 3 hours ago [-]

I see them in the Android app now.

djohnston 2 hours ago [-]

Any quick impressions of o3 vs o1? We've got one inference in our product that only o1 has seemed to handle well, wondering if o3 can replace it.

sebzim4500 2 hours ago [-]

They are replacing o1 with o3 in the UI, at least for me, so they must be pretty confident it is a strict improvement.

EcommerceFlow 3 hours ago [-]

A very subtle mention of o3-pro, which I'd imagine is now the most capable programming model. Excited to see when I get access to that.

Good thing I stopped working a few hours ago

EDIT: Altman tweeted o3-pro is coming out in a few weeks, looks like that guy misspoke :(

neya 2 hours ago [-]

The most annoying part of all this is they replaced o1 with o3 without any notices or warnings. This is why I hate proprietary models.

sebzim4500 2 hours ago [-]

Meanwhile we have people elsewhere in the thread complaining about too many models.

Assuming OpenAI are correct that o3 is strictly an improvement over o1 then I don't see why they'd keep o1 around. When they upgrade gpt-o4 they don't let you use the old version, after all.

kgeist 32 minutes ago [-]

>Assuming OpenAI are correct that o3 is strictly an improvement over o1 then I don't see why they'd keep o1 around.

Imagine if every time your favorite SaaS had an update, they renamed the product. Yesterday you were using Slack S7, and today you're suddenly using Slack 9S-o. That was fine in the desktop era, when new releases happened once a year - not every few weeks. You just can't keep up with all the versions.

I think they should just stick with one brand and announce new releases as just incremental updates to that same brand/product (even if the underlying models are different): "the DeepSearch Update" or "The April 2025 Reasoning Update" etc.

The model picker should be replaced entirely with a router that automatically detects which underlying model to use. Power users could have optional checkboxes like "Think harder" or "Code mode" as settings, if they want to guide the router toward more specialized models.

dr_kiszonka 1 hours ago [-]

I want to be excited about this but after chatting with 4.1 about a simple app screenshot and it continuously forgetting and hallucinating, I am increasingly sceptical of Open AI's announcements. (No coding involved, so the context window was likely < 10% full.)

typs 3 hours ago [-]

I’m not sure I fully understand the rationale of having newer mini versions (eg o3-mini, o4-mini) when previous thinking models (eg o1) and smart non-thinking models (eg gpt-4.1) exist. Does anyone here use these for anything?

sho_hn 3 hours ago [-]

I use o3-mini-high in Aider, where I want a model to employ reasoning but not put up with the latency of the non-mini o1.

drvladb 3 hours ago [-]

o1 is a much larger, more expensive to operate on OpenAI's end. Having a smaller "newer" (roughly equating newer to more capable) model means that you can match the performance of larger older models while reducing inference and API costs.

simianwords 2 hours ago [-]

I feel like the only reason O3 is better than O1 just due to the tool usage. With tool use O1 could be similar to O3.

oofbaroomf 3 hours ago [-]

When are they going to release o3-high? I don't think it's in the API, and I certainly don't see it in the web app (Pro).

wilg 2 hours ago [-]

> We expect to release OpenAI o3‑pro in a few weeks with full tool support. For now, Pro users can still access o1‑pro.

https://openai.com/index/introducing-o3-and-o4-mini/

bratao 3 hours ago [-]

Oh god. I´m Brazilian and can´t get the "Verification". Using my passport or id. This is very frighting future.

spencersolberg 3 hours ago [-]

The Codex CLI looks nice, but it's a shame I have to bring my own API key when I already subscribe to ChatGPT Plus

oofbaroomf 3 hours ago [-]

Finally, a new SOTA model on SWE-bench. Love to see this progress, and nice to see OpenAI finally catching up in the coding domain.

originalvichy 3 hours ago [-]

Is there a non-obvious reason using something like Python to solve queries requiring calculations was not used from day one with LLMs?

planb 3 hours ago [-]

Because it‘s not a feature of the LLM but the product that is built around it (like ChatGPT).

rahimnathwani 3 hours ago [-]

It's true that product provides the tools, but the model still needs to be trained to use tools, or it won't use them well or at the right times.

ipsum2 3 hours ago [-]

LLMs could not use tools on day one.

firejake308 3 hours ago [-]

Not sure what the goal is with Codex CLI. It's not running a local LLM right, just a CLI to make API calls from the terminal?

maheshrijal 3 hours ago [-]

This might be their answer to claude code more than anything else.

mpaepper 2 hours ago [-]

Yes, that's exactly what I thought as well. An attempt to get more share in the developer tooling space for the long term.

sho_hn 3 hours ago [-]

Looks more like a direct competitor to Aider.

whitten 2 hours ago [-]

Where do I find out more about Aider ?

stavros 1 hours ago [-]

Just wait a few seconds and there will be a post here with Aider benchmarks for the new model, or https://aider.chat

tailspin2019 1 hours ago [-]

https://aider.chat

oofbaroomf 3 hours ago [-]

Still a knowledge cutoff of August 2023. That is a significant bottleneck to devs using it for AI stuff.

cryptoz 1 hours ago [-]

I've taken to pasting in the latest OpenAI API docs for their python library to each prompt (via API, I'm not pasting each time manually in ChatGPT) so that the AI can write code that uses itself! Like, I get it, the training data thing is hard, but - OpenAI changed their python library with breaking changes and their models largely still do not know about it! I haven't tried 4.1- series yet with their newer cutoff, but, the rest of the models like o3-mini (and I presume these new ones today) still write openai python library code in the old, broken style. Argh.

I_am_tiberius 2 hours ago [-]

What is again the advantage of pro over plus subscriptions?

postmaster 2 hours ago [-]

> We expect to release OpenAI o3‑pro in a few weeks with full tool support. For now, Pro users can still access o1‑pro.

I_am_tiberius 2 hours ago [-]

Ok, so currently they pay for nothing (or is o1-pro superior to o3?).

Topfi 2 hours ago [-]

I have barely found time to gauge 4.1s capabilities, so at this stage, I’d rather focus on the ever worsening names these companies bestow upon their models. To say that I the USB-IF have found their match would be an understatement.

ksylvest 3 hours ago [-]

Are these available via the API? I'm getting back 'model_not_found' when testing.

tymscar 1 hours ago [-]

Gave Codex a go with o4-mini and it's disappointing... Here you can see my tries. It fully fails on something a mid engineer can do after getting used to the tools: https://xcancel.com/Tymscar/status/1912578655378628847

eric-p7 2 hours ago [-]

Babe wake up a new LLM just dropped.

davidkunz 3 hours ago [-]

I wish companies would adhere to a consistent naming scheme, like <name>-<params>-<cut-off-month>.

xqcgrek2 3 hours ago [-]

Underwhelming. Cancelled my subscription in favor of Gemini Pro 2.5

taytus 2 hours ago [-]

This is a mess. I do follow AI news, and do no know if this is "better/faster/cheaper" than 4.1

Why are they doing this?

pcdoodle 2 hours ago [-]

It seems to be getting better. I used to use my custom "Turbo Chad" GPT based on 4o and now the default models are similar. Is it learning from my previous annoyances?

It has been getting better IMO.

basisword 3 hours ago [-]

The user experience needs to be massively improved when it comes to model choice. How are average users supposed to know which model to pick? Why shouldn't I just always pick the newest or most powerful one? Why should I have to choose at all? I say this from the perspective of a ChatGPT user - I understand the different pricing on the API side helps people make decisions.

morkalork 3 hours ago [-]

If the ai is smart, why not have it choose the model for the user

zvitiate 3 hours ago [-]

That’s what GPT-5 was supposed to be (instead of a new base or reasoning model) last Sam updated his plans I thought. Did those change again?

Workaccount2 3 hours ago [-]

o4-mini, not to be confused with 4o-mini.

planb 3 hours ago [-]

What is wrong with OpenAI? The naming of their models seems like it is intentionally confusing - maybe to distract from lack of progress? Honestly, I have no idea which model to use for simply everyday tasks anymore.

dabeeeenster 3 hours ago [-]

It really is bizarre. If you had asked me 2 days ago I would have said unequivically that these models already existed. Surely given the rate of change a date-based numbering system would be more helpful?

xd1936 3 hours ago [-]

Fix coming this summer, hopefully.

https://twitter.com/sama/status/1911906570835022319

sho_hn 3 hours ago [-]

Seems to me like they're somewhat trying to simplify now.

GPT-N.m -> Non-reasoning

oN -> Reasoning

oN+1-mini -> Reasoning but speedy; cut-down version of an upcoming oN model (unclear if true or marketing)

It would be nice if they actually stick to this pattern.

krackers 12 minutes ago [-]

But we have both 4o and 4.1 for non-reasoning. And it's still not clear to me which is better (the comparison on their page was from an older version of 4o).

bogtog 3 hours ago [-]

I suspect that "ChatGPT-4o" is the most confusing part. Absolutely baffling to go with that and then later "oN", but surely they will avoid any "No" models moving forward

jagger27 3 hours ago [-]

Are the oN models built on top of GPT-N.m models? It would be nice to know the lineage there.

i_love_retros 3 hours ago [-]

I tend to look at the lmarena leaderboard to see what to use (or the aider polyglot leaderboard for coding)

waltercool 3 minutes ago [-]

[dead]

3 hours ago [-]

behnamoh 3 hours ago [-]

OpenAI be like:

    o1, o1-mini,
    o1-pro, o3,
    o4-mini, gpt-4,
    gpt-4o, gpt-4-turbo,
    gpt-4.5, gpt-4.1,
    gpt-4o-mini, gpt-4.1-mini,
    gpt-4.1-nano, gpt-3.5-turbo

3 hours ago [-]

waltercool 4 minutes ago [-]

[dead]

mentalgear 3 hours ago [-]

I have doubts whether the live stream was really live.

During the live-stream the subtitles are shown line by line.

When subtitles are auto-generated, they pop up word by word, which I assume would need to happen during a real live stream.

Line-by-line subtitles are shown if the uploader provides captions by themselves for an existing video, the only way OpenAI could provide captions ahead of time, is if the "live-stream" isn't actually live.

ipsum2 3 hours ago [-]

All YouTube live streams are like this.

KTibow 3 hours ago [-]

I think this is just a quirk of how Google does live captions.

Rendered at 20:22:15 GMT+0000 (Coordinated Universal Time) with Vercel.