The latest deepseek v4 pro model is 2-5x cheaper than Claude Sonnet 4.6. Cursor's Compose 2.5 that was just recently released is 6x cheaper than Sonnet.
The state of the art models are going to get better and more expensive and smaller models are going to get cheaper.
There will be a point where the intelligence of both the cheap and state of the art models are indistinguishable by humans like it is indistinguishable for me to understand the difference the difference between Terrance Tao and my university math professor.
I don't always need the smartest and most expensive models. I will need it every once in awhile and will gladly pay that price if I had to. What I do need is the model that will solve the current problem I have in a reasonable amount of time.
clhodapp 4 hours ago [-]
I know it comes off as pedantic to point this out but: Those are open weight models not open source models.
Closed weight models are the equivalent of SaaS. Open weight models are the equivalent of binary driver blobs or Windows software. We don't really have actual open source LLMs, which would need to publicly release their training data and technique so you could train a similar model yourself, or use their work as a baseline for your own model.
This distinction matters because an actual open source LLM would be extremely important from an ecosystem point of view, if someone ever actually released one.
NitpickLawyer 3 hours ago [-]
I know this is highly contested, but I'll try explaining it anyway, because I keep seeing this and it's ... wrong.
Your comment is wrong both theoretically and practically.
First, the theory. The idea that model weights are "binary driver blobs" is technically wrong. I don't know why this is so common on a technical site, but anyway. An LLM model consists of 3 main parts: The architecture, the inference code, and some values. All of these, combined, make an LLM.
Another important aspect, that is widely misunderstood and will become apparent later is that a model is created by deciding the architecture, and then initialised with some values. Those values can be all 0s, all 1s, or random. (in practice it's random but that's irrelevant). Technically, once a model is initialised, that's it. That is a model. If released, that would be, even for the most pedantic absolutists, undoubtably open source.
Then, that model is being adapted. The most important thing to understand here, is that this is the preferred way of modifying a model. Actually, the only way. You can't (yet) come later and decide to change something in the architecture. Youc an only change the values. That process is called training (pre, mid, post, etc). The process itself is the same for the model creators, as it is for you. The technical process. The means, know-how, etc. is different.
Now, what licensing does, and the only thing that licensing can do is to give you rights to inspect, modify and release that model. That's it. A license will never give you (it cannot) the right to have the internal IP, knowledge, know-how or the "why's" on how the model was edited. That's on you. You have the right to modify, but you can't get the right to know how others have modified it, from a license file. Never had, never will.
(a simplified version of this is to think about an algorithm to control a drone. Usually that'd be a pid controller. Imagine someone releases under an open source license, an algorithm. That algorithm consists of architecture, loop code, and some values. Even if those values are all set to 0.5 (in which case your drone might crash) or any other values, the values themselves do not change the status of the code. It's still open source, even if the values are fixed, or random, or dreampt up by the original coder, or received from the aliens themselves)
I mentioned above that editing the values of a model is the preffered way of modifying the model, and that's exactly what Apache 2.0 defines as "source code".
> "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files.
----
Now, the practice. In practice, we do have fully open (open data, open training code, open source models) models. Apertus, from Switzerland and Olmo from the US. Don't get me wrong, it's absolutely great that we have these models, they are very important for the community, and they do help inform everyone about what works, what doesn't, and so on. But ... no-one uses them. Because they are not at the top, compared to other models.
And, on a technical note, the idea that "dataset" + training code = bit-for-bit recreation is also not true. Anyone that has done any large scale training can tell you that. Between the randomness inherent in the process, the occasional training run re-starts and so on, you will never get the same model twice (at reasonable scales), even if you'd have the available compute. Which, let's be serious, no-one at home has. So... yeah. It's a pointless aspect to care for anyway.
clhodapp 25 minutes ago [-]
| Technically, once a model is initialised, that's it. That is a model. If released, that would be, even for the most pedantic absolutists, undoubtably open source.
That is true. But it is not the same model as the LLM created by combining the released weights with the released architecture. The thing that is the "binary blob" is the weights. It is pretty much exactly akin to a Linux driver that depends on linux-firmware. It is wonderful that it exists! But it is only partly open.
| Now, what licensing does, and the only thing that licensing can do is to give you rights to inspect, modify and release that model. That's it. A license will never give you (it cannot) the right to have the internal IP, knowledge, know-how or the "why's" on how the model was edited. That's on you. You have the right to modify, but you can't get the right to know how others have modified it, from a license file. Never had, never will.
| In practice, we do have fully open (open data, open training code, open source models) models. Apertus, from Switzerland and Olmo from the US. Don't get me wrong, it's absolutely great that we have these models, they are very important for the community, and they do help inform everyone about what works, what doesn't, and so on.
You seem to contradict yourself here. That said: I appreciate the correction of my perception that there aren't truly open large language models.
tlb 2 hours ago [-]
There are still things you can't do with an open-weight model without the training data, like modifying the architecture and training from scratch. That's different from true open-source code, where you can do anything the authors could do.
dmbche 3 hours ago [-]
Good read thanks
charcircuit 1 hours ago [-]
The inference code is not part of a LLM and there can be multiple different implementations of it. The model, code to train the model, and code to run the modal are different things.
NitpickLawyer 40 minutes ago [-]
> The inference code is not part of a LLM
While that might be true in a majority of cases, it's not necessarily universal. Recently model providers have worked with inference libraries to support their models at launch, but say in transformers you can include code for a new architecture, and if you load it with "trust_remote_code=True" it will still work. You can modify the forward pass or whatever you want to do. In that sense, code can be part of a model.
yogthos 3 hours ago [-]
There are absolutely fully open source models. These are not frontier models, but they very much do exist. OLMo is one of the models explicitly mentioned as having passed the OSI's validation phase. Pythia was also validated by the OSI as meeting its requirements for an open-source AI system. Lucie-7B is a multilingual model is one of the first LLM compliant with the OSI AI definition. Its creators explicitly state that the training dataset, data preparation code, and model weights are all publicly available under open licenses.
greenmilk 4 hours ago [-]
> The state of the art models are going to get better and more expensive and smaller models are going to get cheaper.
Why do you think this will be true?
Right now I see the major US labs betting on gaining an advantage from having way more compute, and I see Chinese labs competing with one another in a resource-scarce environment, so they place much more emphasis on compute-efficiency.
But the supply chains that feed into the massive data center growth in the US are strained; there are energy, memory, and logistical bottlenecks to name a few.
In the medium-long run, compute capacity will not grow exponentially forever. Somehow it has for decades, but there can be no infinite exponential growth, and that point may be when the planet really starts to cook itself.
Maybe the US labs will become more compute-constrained, and then have to compete on efficiency.
Or maybe things change fundamentally in some other way I'm not thinking of.
nightski 4 hours ago [-]
The labs have a perverse incentive to make things as expensive compute wise as possible. The only thing keeping this somewhat in check is competition, but it's intentionally being gatekept by locking up the supply of computing infrastructure. With 3 players it's pretty easy to collude even if indirectly. They can't burn trillions forever. Nvidia's 75% profit margins are not sustainable forever.
Things will normalize, but it will take time.
gruez 3 hours ago [-]
>The labs have a perverse incentive to make things as expensive compute wise as possible. The only thing keeping this somewhat in check is competition, but it's intentionally being gatekept by locking up the supply of computing infrastructure. With 3 players it's pretty easy to collude even if indirectly.
By all accounts the AI capex boom is justified up by actual usage, rather than some nefarious plan for "locking up the supply of computing infrastructure". Just look at people complaining about claude availability and anthropic adding various load-shedding measures a few months ago.
nightski 2 hours ago [-]
Right but that could be more evenly distributed. There is a circular trade right now giving these few players near infinite resources that is blocking that from happening.
kxkdkdisjsn 3 hours ago [-]
[dead]
theodorewiles 3 hours ago [-]
Commoditize your complement - I expect to see this most in consumer AI (after that starts actually working...)
It will be important for Apple to have good enough, cheap local LLM models that run on-device.
If the barrier to performance shifts from fundamental model capability to context collection and management I would expect to see folks focused on that problem continuing to drive open-weight LLM model development in some shape or form.
gruez 3 hours ago [-]
>so they place much more emphasis on compute-efficiency.
Maybe on training, but on inference they use more tokens than comparable western models.
>The latest deepseek v4 pro model is 2-5x cheaper than Claude Sonnet 4.6. Cursor's Compose 2.5 that was just recently released is 6x cheaper than Sonnet.
It's ironic how in a thread about "AI subsidies" that people don't think free model releases from AI don't count as subsidies. Whatever AI winter that would cause AI companies to stop subsiding tokens, would probably cause other AI labs to stop doing free model releases. They might not be able to un-release the current crop of open models, but assuming proprietary model development still happens, they'll quickly go obsolete.
zozbot234 3 hours ago [-]
The currently-released models don't really go away. Even if they collectively only release a new model every few years for the sake of influence and public image, that's plenty enough to keep the competitive aspect going.
gruez 3 hours ago [-]
>Even if they collectively only release a new model every few years for the sake of influence and public image, that's plenty enough to keep the competitive aspect going.
This is unpersuasive. Why would AI companies (American or Chinese) stop subsidizing tokens, but keep doing open model releases? At least for the former you can argue it's a lead generation tool for enterprise contracts (eg. hobbyist uses claude code personal plan, then asks the company to buy claude code enterprise, which are billed at API rates), but what's the business case for doing open model releases? You might get some mindshare, but are also arming your competitors in the process. Moreover what makes you think the model releases will be at all competitive to frontier models? Google released gemma 4 a few weeks ago to acclaim, but it's in no way competitive to even GPT-5.4 or Opus 4.6.
throwa356262 3 hours ago [-]
Are Chinese companies subsidizing tokens?
M2.7 is 230B and was designed to run inference on two (2!) shity Ascend GPUs (Huawei's first GPU manufactured in China). That's why they can offer a plan at 1/2 of the price of Antropic and probably still make a revenue.
squidbeak 4 hours ago [-]
Deepseek V4 Flash is far cheaper still, and a better model to compare to Sonnet 4.6. I'm finding it a reliable workhorse.
anonzzzies 4 hours ago [-]
Yep, people who never used it say it is not good.
sometimelurker 3 hours ago [-]
sorry to nitpick (I totally agree with what ur saying btw, I run Ministral-3b on my hardware as my go-to bc I don't usually need the "smartest and most expensive models")
> This is where open source models are important
open-weights, the training data isn't public
lmeyerov 3 hours ago [-]
oss models don't directly matter when multiple at-scale frontier API providers have to compete on price: they are limited in defensible margin
They do matter in that oss researchers enable faster cross-pollination of good inferencing efficiency improvements to help the big boys adapt ideas from the community
Long-term local ai may matter more, but imo not there until models + hw get way better (1-2 years?) . Reasoning grade quality at speed is still $$$: we need fast opus, not slow sonnet.
jplusequalt 3 hours ago [-]
>The latest deepseek v4 pro model is 2-5x cheaper than Claude Sonnet 4.6. Cursor's Compose 2.5 that was just recently released is 6x cheaper than Sonnet.
The only way you're running Deepseek V4 with comparable quality/performance is through OpenRouter, at which point you're still susceptible to being price gouged in the future, or by spending >$20k on hardware.
driese 2 hours ago [-]
There is still a difference though. If some company decides to raise prices on OR, you can just switch to any other provider of the same model since there is no moat.
throwaway613746 3 hours ago [-]
[dead]
4 hours ago [-]
_fat_santa 4 hours ago [-]
I wonder how much of Uber blowing their AI budget and MSFT pulling their claude code licenses can be attributed to "tokenmaxxing".
When Meta announced token leaderboards and other followed, I could see this being the logical conclusion. That whole trend is so dumb because it leads to this.
Company announces they will measure developer performance by how many tokens they burn and constantly talks about how the best developers burn the most tokens. Developers see the message and start burning tokens. And then the company acts surprised when their bills go through the roof.
I personally use my OpenAI subscription pretty heavily, 2-3 agents running practically all day on various tasks but I never even get close to running into limits while I hear about others blowing through limits on multiple accounts in the same time period. I'm convinced that most of those folks and their elaborate workflows aren't really for productivity but for bragging rights about how much they use AI.
bdcravens 4 hours ago [-]
The same here, where I haven't come close to hitting any of my CC limits. Even though I'm more productive than I've ever been (as measured by finished, valuable tasks running in production) and I'm clearing out months of backlog, I have either one of two conclusions when I hear about others who suggest they need more:
1. I'm doing it wrong. Apparently I'm supposed to give it a vague paragraph about what the business does, and I can run off and sip margaritas and wake up to a fully fleshed business
2. They don't know what they're doing, and they're sending the LLM off on a wild goose chase that it does a reasonable job of working it's way out of, so they consider it success despite the waste.
cayleyh 4 hours ago [-]
> I personally use my OpenAI subscription pretty heavily, 2-3 agents running practically all day on various tasks but I never even get close to running into limits
Same. But if I was working for an organization that measured token usage, you can bet I would be doing things like creating a cron job that uses claude to create a customized bespoke report update of the current status of all my open assigned tickets and message that to myself 4 times a day... token burn for zero purpose whatsoever.
rirze 3 hours ago [-]
> I'm convinced that most of those folks and their elaborate workflows aren't really for productivity but for bragging rights about how much they use AI.
This is quite the reductive, charged statement. Can I ask what subscription plan you're using?
My personal experience is unlike this at all-- I work on ever-expanding codebases so I can easily burn tokens. Not to mention, structured agentic coding with adverserial reviews & task organization is not token-efficient. Additionally, for the problems I'm working on, only xhigh or high reasoning gives me worthwhile results while saving time. There are definitely configurations where default consumption doesn't work.
For reference, I used 15 billion tokens (most of it cached) last month on my day job's enterprise plan. That doesn't include my personal plans' usage.
_fat_santa 3 hours ago [-]
I'm on OpenAI's Pro (200/mo) plan.
rirze 15 minutes ago [-]
That's a super generous plan. Indeed, it's hard to normally saturate the limits of the OpenAI's $200 plan.
ai_fry_ur_brain 4 hours ago [-]
I make like 2 prompts a week to gemini flash on the weband get more done than all the people that are exhibiting literal manic behavior in the way they use LLMs.
pydry 3 hours ago [-]
I really wish the management behind these dumb ideas couldnt just quietly pretend they never did it once it goes out of fashion.
The fact that somebody established a leaderboard for tokenmaxxing ought to follow you around like a black cloud for the rest of your career once the collective hallucination lifts and people realize just how monumentally stupid it was.
Alas they do all these stupid things together which makes it seem more defensible and then everybody forgets.
tekacs 3 hours ago [-]
> Anthropic’s CFO testified under oath this March that the company spent $10 billion on compute and made $5 billion in revenue (Ed Zitron has the math). The labs are underwater on inference. They’re raising prices to keep the lights on.
'The labs are underwater on inference' is an absurd thing to say whilst not separating the cost of _compute_ out into training and inference.
JimDabell 3 hours ago [-]
According to Dario Amodei, Anthropic are even profitable when including inference as long as you look at it on a per-model basis; it’s just that every model is more expensive to train than the last one.
For instance, if you have already spent $n to train a model and are currently earning $2n selling inference with it; but are concurrently spending $3n training the next model in anticipation of earning $6n with it, then you are already in the hole for $n and are currently also losing $n – but you are doubling your money with each model because your $n investment in the first model returns $2n and your $3n investment in the second model returns $6n.
How is training vs inference any different than other product spaces, where all the costs of bringing a product to market have to be considered for profitability? You can't just look at marginal production cost. You are still underwater if the other development costs are not being recouped by the final sales revenue.
The whole commercial AI enterprise is not economically viable if the inference revenue will not cover both inference and the amortized training costs. Given how fast they are churning through models to compete, you cannot act like the training is an asymptotically low cost.
tekacs 3 hours ago [-]
Saying 'underwater' would have been reasonable, but 'underwater on inference' is a nonsense way to say it.
dismalaf 3 hours ago [-]
I mean, I guess they could just stop training new models and coast, but they ARE training models so you have to include those costs.
extr 4 hours ago [-]
What is the OP talking about. $/unit intelligence is going down rapidly. You can achieve what would have been considered miracles in 2022 with < $10.
3 hours ago [-]
bdcravens 4 hours ago [-]
Absolutely, though I think the expectations are being set by those who have watched too many "OpenClaw business on autopilot" videos.
abtinf 4 hours ago [-]
Insofar as I can tell, inference is on a certain path toward becoming "free". The models are now extremely powerful on high-end consumer hardware, and the efficiency trend seems likely to continue.
Here is a recent non-rigorous benchmark I ran against a bunch of models. Qwen3.6 35B A3B fine-tuned with opus data runs plenty fast on my local machine and produce outstanding results - easily in the top 5, comparable to GPT 5.5 Pro (which is $180/mtok).
I've predicted for years now that the industry will head down the path of the virus scanning vendors: selling subscriptions to be able to download the latest versions of models. I simply don't see how any other business model is remotely viable, except at the very highest end of inference or video gen.
anonzzzies 4 hours ago [-]
That local hardware is not consumer though but prosumer. Consumer is a 500$ laptop running that and that is not currently the case.
mingqiz 40 minutes ago [-]
Deepseek dropped their price permanently.Now v4 pro costs 3.48% of the output token price of opus 4.7. (now that opus 4.7 has a more costly tokenizer). Is it fair to say that Anthropic/OpenAI is marching to bankruptcy as opensource models like deepseek improves?
Note that deepseek also supports 500 concurrent requests per account for any individuals for v4 pro.
koliber 3 hours ago [-]
EDIT: [ IGNORE THIS COMMENT -- IT IS WRONG - I had a "bad math moment" myself ]
The math seems off. How is 7.8 million vs 4 million 95% more expensive. Article makes good points but I doubt the numbers as they don’t add up.
Still agree with the conclusion though.
pikminguy 3 hours ago [-]
7.8 is higher than 4 by 3.8
3.8 is 95% of 4
The price went up by 3.8 or in other words the price went up by 95%.
I'm not sure what math you're not mathing.
koliber 2 hours ago [-]
Absolutely right. I had a "math moment" myself. I was thinking margin instead of markup.
shay_ker 3 hours ago [-]
In the three options OP presents, I wonder if there's a fourth: BYO model
Customers give vendors metered access to their model. They can budget tokens per vendor. Vendors selling "AI products" can have a cleaner story and win on the margin.
The first step to is to iron out a reasonable protocol, basically authorizing a, access token, and then the model providers (OpenAI, Anthropic, etc.) do the rate limiting. Theoretically this could be done by OpenRouter too.
But even so - do customers want an "AI product" packaged cleanly, or do they want to manage token capacity? They may be forced to do the latter....
arnon 3 hours ago [-]
It could happen, but it seems "regressive" almost as most companies are completely not ready to build this muscle.
alligatorplum 3 hours ago [-]
I seldom use my PC anymore ever since i got a laptop. with the cost per token increasing along with the random "features" where models will just eat through your tokens in one hour. I really have been tempted to turn my PC into a server to run local models on there
mark_l_watson 3 hours ago [-]
> "which use cases earn the inference cost they burn?"
That is the question. I love using OpenCode with paid inference providers and seeing the cost of every little thing I do. On the other hand, right now I am flipping between Antigravity CLI and the two Antigravity apps burning Claude Opus tokens like crazy, knocking off a ton of work. Google must be losing money on me.
ibtheory 59 minutes ago [-]
also will bring up some good opportunities in the optimization space. Smaller and cheaper models + optimization can bring performance up, especially in certain domain specific applications of ai.
infecto 4 hours ago [-]
Has this not been true for a long time now? Most companies have had enterprise/business level prices that was highly connected to usage for a what feels like at least a year.
throwa356262 4 hours ago [-]
This is only true if your world is limited to openai, antropic and alike.
There are a whole bunch of companies somewhere else in the world that are getting better and cheaper every month, hardware side included. all without the infinite VC money
anonymousiam 3 hours ago [-]
Not mentioned in the article/blog was the local alternative. Many applications will run just fine locally and not in the cloud. This is also more secure. Running local will probably eventually become the norm. It makes me wonder about the future of all these VC funded AI companies...
hereme888 2 hours ago [-]
NVIDIA’s published specs imply much larger gains in NVFP4 inference compute and GPU memory bandwidth than in BOM cost.
That said, more intelligence and automation = higher costs.
energy123 3 hours ago [-]
Capex and revenue should not be compared like this, unless revenue is small and not growing.
2 hours ago [-]
dtagames 5 hours ago [-]
Some of these coming price increases will move dev work back to dedicated shops and teams when individuals and non-devs won't want to pay the AI bill to finish and ship their projects.
An outside small dev shop or internal dev team can pay these prices and spread the cost over several customers or departments, but the era of giving everyone AI and telling them to dev stuff is about to be over.
xnx 2 hours ago [-]
Lost me at "Ed Zitron has the math"
Havoc 4 hours ago [-]
Inference costs absolutely did fall. And even more so when looking at intelligence it buys you.
eg compare say gpt 3.5 to latest deepseek. Both cheaper and more at more capable
arnon 3 hours ago [-]
but are you using less of them because they're cheaper, or are you using more of the more advanced models and running them harder?
MarkusQ 3 hours ago [-]
This is just wrong.
The pricing so far has been a classic case of loss-leader to build market share and ramp up until you can find a moat. Normally, the huge cost of training would provide such a moat, or the amount of training data required, but both of those seem to have been overcome by enough players to keep the ball in play. The next target to keep out the riffraff seems to be "Gigawatts of Data Center" (gack, I hate that metric!) and you might think that it would hold, given the finite size of the planet.
But in space, no one can hear you bleed cash, so...
pacman1337 4 hours ago [-]
I get similar results for deepseek and opus but opus is way faster. I guess deepseek streams thinking and makes it slower?
plaidfuji 4 hours ago [-]
kind of sobering to realize that whether your job can be profitably automated away comes down to what $/token some hyperscale AI provider can deliver… I suppose it’s nice that this article highlights some upward pressure on that number.
DeathArrow 2 hours ago [-]
I use cheap Chinese models. For all I care, both OpenAI and Anthropic can raise their prices until they'll have no customers left.
kittikitti 2 hours ago [-]
Thank you for sharing this article. I think the graphs in it were useful in understanding the different pricing structures. One thing that I would have included is pricing based on AI that I own, through capital expenditure (CapEx).
However, it's much harder to compare. For one, the cost per token is difficult to measure until a sufficient amount of time has passed so that an extrapolation is more accurate. Also, there are performance considerations where a local solution might be more or less accurate than an equivalent online AI. In addition, the reduced compliance risk is hard to quantify or it makes online AI practically useless.
I don't understand how people got buy-in for a business model that assumed token costs would go down indefinitely. All tech startups follow a blitz-scaling pattern where they practically give away their services for free, trap customers in a moat, and then extort as much money as they can.
This... is not a reliable AI detection method at all.
fallpeak 4 hours ago [-]
You are incorrect. There, now we've both made unsupported assertions. Care to provide any evidence for your position?
For what it's worth, when I provide a Pangram link it's because I can already tell something is AI and I'm attempting to provide objective third-party confirmation so the conversation doesn't just degrade into me asserting that I have superior taste to you.
extr 4 hours ago [-]
Pangram is highly reliable.
arnon 3 hours ago [-]
sorry, but i wrote this myself
anthonypasq 3 hours ago [-]
Guys, we are the in the mainframe era of AI. People in the 60's thought computing was expensive too and the idea of having a computer on every desk, nevermind every pocket, nevermind every single piece of electronics in the world basically seemed like a complete pipe dream.
if you told someone in the 70's their toaster would have a supercomputer it in, they would think you were crazy. in 10 years your doorknob is going to have a local AI model it in.
This is computing 2.0 not the dot com bubble. 90% of inference will be at the edge in the future and there will still be super-computers and giant clusters doing cutting edge science and research, but for 90% of use cases youll just need a tiny local model, same reason you dont need a giant GPU in your smart tv.
stephc_int13 3 hours ago [-]
The main issue with this reasoning is that the hardware substrate for AI and good old computing is the same.
All governed by Moore's Law, what happened then seems extremely unlikely to happen again, the curve is a sigmoid and we're much closer to the flat end now.
anthonypasq 3 hours ago [-]
I think this has some truth to it I would say, but im pretty confident that small models will continue to get better and runnable on dedicated consumer hardware
yogthos 3 hours ago [-]
My expectation is that local models will be the default for coding within a year or two. You can already run Qwen 3.6 with MTP at a pretty reasonable speed without needing a huge amount of VRAM. And while it's not as good as current frontier models, it's already quite competent for a lot of tasks.
And there's no sign that people are running out of ideas for how to optimize models further. You see a bunch of papers come out literally every few weeks right now. So, it's entirely plausible to me that we'll see models that are superior to current frontier ones in a year or two that will run on your machine.
Once we get to that point, I don't think it's even going to matter if frontier models keep improving for most people. Being able to run the model on your machine, use it as much as you want in any way you want, without having to worry about it changing from under you or the company changing pricing, and not have to send all your data to the vendor are going to be the deciding factors.
At some point the models are just good enough to do what you need to do. On top of that, I expect tooling around models and coding patterns will evolve as well. That could compensate significantly for the capabilities of the model. We already see this happening with two prime examples here:
> Did we collectively forget second order thinking?
I bought 2x 16Gb NVIDIA cards this week because I don’t see hardware getting cheaper anytime soon, and because of that I totally don’t see the point of “waiting until prices go lower for graphics cards” because that might not for a long time yet!
In fact, if you include factoring in world events (and the ones that haven’t happened yet but eventually will e.g. China’s 2027 long planned take of Taiwan), then there’s no way graphics prices are going to be accessible to mere mortals until at least 2028.
But my real reasoning is that you’re going to see a flood of OpenAI and Anthropic users leave because of a) increasing pricing plans, and b) impeding business laws on the horizon about protecting sovereign data from AI (i.e data in cloud for training is a no no).
So what happens when people and companies one by one start leaving the SOTA AI cloud for from-good-enough-to-wow models? RAM and graphics cards become the new toilet paper, which is going to double again current prices.
Upgrade now before it’s too late folks!
YetAnotherNick 4 hours ago [-]
You are comparing two different model. It's like saying roadster is more expensive than model S. No model pricing actually increased, and I am using GPT-4o in the same price as it was before.
You can see price vs performance in artificial analysis and the the pareto optimal is all just 6 months old model.
adamesque 4 hours ago [-]
It's hard to take this piece seriously if he's citing _Ed Zitron's_ math, and equally hard to make the blanket statement that flat-rate plans = "the current AI pricing". But yes, those pricing models were pretty silly and unsustainable.
kimixa 4 hours ago [-]
Get back to me when there's an AI company that's actually profitable and we can compare their service and pricing.
Claiming that there's some small subset of their services (like inference per token) that's "profitable" doesn't mean anything when it relies on everything else that company is still paying for. If you could make money from it at current prices - why aren't they?
Otherwise it's just "how much they're willing to subsidize".
fallpeak 3 hours ago [-]
On OpenRouter there are 11 third-party inference providers hosting DeepSeek V4 Pro right now, of which 8 are US-based and 7 of those have zero data-retention policies (which I mention to rule out any claims of "oh they're making up the money by logging all your data"). This is a 1.6T-A49B model, so a bit bigger than Sonnet (~2/3rds the size) and a bit smaller than Opus (~3x as large). These third parties are almost perfectly interchangeable via OpenRouter as a marketplace, so they have no incentive to offer any sort of "growth pricing" below costs, and they serve it at $3.48/Mtok out.
Kimi K2.6 is 1T-A32B with a slightly less computationally efficient architecture, and is served at around $3.50/Mtok out by 9 US ZDR providers.
Unless you think that either the generally accepted size estimates for Anthropic/OpenAI models are wildly off or those companies are a lot worse at serving models efficiently, Anthropic and OpenAI are probably making around 5-8x margins on their API costs.
The cost of training new models is of course a major factor not counted here. Depending on how you want to think about that this may or may not make them net profitable. I remember one of those CEOs gave an interview a while back where they described it as a series of independent investments, where each model they train is net-positive in revenue by EOL just from its own inference, but I don't know whether that's still true or not.
Regardless, the point is that if they stopped training new models today, both Anthropic and OpenAI are making incredibly generous profits on their API inference.
energy123 3 hours ago [-]
The problem with fixating on earnings as you're doing is that it's a bad metric for a growth company. COGS is much more important. What you're doing is setting it up so every growth company is terrible until they've matured into a 20 year old company. That's obviously dumb.
kimixa 3 hours ago [-]
From what I've seen pretty much every company is limited by hardware supply, to the level where's there complaints from current customers about the speed of new customer growth is exceeding their ability to service them properly.
And "growth at all costs" makes sense if there's lock in and you can monetize those "now locked-in users" later - but that doesn't really seem true on the consumer side. It seems pretty trivial to switch out which model and provider on the consumer side.
Any "lock in" has then to be on the model or inference side, and that's still advancing in multiple areas from so many different sources I'm not sure I'm comfortable saying that will also be a "winner takes all" situation either.
My approach is generally "enjoy using it while it's cheap and subsidized, but understand that might not last forever". If it does remain cheap after the subsidies end, great, you can just keep using it. But if it doesn't and you've lost the ability to work without you'll be in for a world of hurt.
energy123 3 hours ago [-]
Your concern about their business is that demand for their products is growing so stratospherically that they cannot meet that demand easily? I mean that's like an A+ scorecard for any business. Everyone in business would dream of such a scenario. That's called a luxury problem.
kimixa 3 hours ago [-]
But each new customer is still losing money. As I said, subsidized growth only matters if you can recoup those subsidies afterwards - and that's what I'm not sure will be true.
I think the idea of "all growth is good no matter the cost" has been taken to an extreme.
energy123 3 hours ago [-]
Losing money based on what? COGS? It really sounds like you're hating on a growth company for not being an airliner.
dragontamer 3 hours ago [-]
There is probably going to be a quarter or two of profits when the prices dramatically increase. Vibe coding techbros are hooked on the Iron Lung and may not want to get off.
At my work are multiple developers bragging about overnight AI usage to solve problems hands off. Yes they are wasting money and resources but the fad is here. People be vibe coding for now.
In like 6 months when all the costs need to be paid and the prices go up, we will see if these companies stay profitable. But I'm of the opinion that the vibe coding tech bros are more than enough to sustain a short or even medium term profit for these companies. Just on fad-energy alone (see OpenClaw)
The fad probably collapses soon after. I hope anyway, the waste I see is nauseating.
------------
I dunno where this is all going. But I do have faith in human ingenuity still. Things are changing, possibly for the worse, but we need to make the best of it.
The worst of behaviors is wasteful and blatant fraud. There's something useful here though.
paralleliq 3 hours ago [-]
[dead]
vitalysemenov 4 hours ago [-]
[flagged]
vdelpuerto 7 hours ago [-]
[dead]
Rendered at 19:33:41 GMT+0000 (Coordinated Universal Time) with Vercel.
The latest deepseek v4 pro model is 2-5x cheaper than Claude Sonnet 4.6. Cursor's Compose 2.5 that was just recently released is 6x cheaper than Sonnet.
The state of the art models are going to get better and more expensive and smaller models are going to get cheaper.
There will be a point where the intelligence of both the cheap and state of the art models are indistinguishable by humans like it is indistinguishable for me to understand the difference the difference between Terrance Tao and my university math professor.
I don't always need the smartest and most expensive models. I will need it every once in awhile and will gladly pay that price if I had to. What I do need is the model that will solve the current problem I have in a reasonable amount of time.
Closed weight models are the equivalent of SaaS. Open weight models are the equivalent of binary driver blobs or Windows software. We don't really have actual open source LLMs, which would need to publicly release their training data and technique so you could train a similar model yourself, or use their work as a baseline for your own model.
This distinction matters because an actual open source LLM would be extremely important from an ecosystem point of view, if someone ever actually released one.
Your comment is wrong both theoretically and practically.
First, the theory. The idea that model weights are "binary driver blobs" is technically wrong. I don't know why this is so common on a technical site, but anyway. An LLM model consists of 3 main parts: The architecture, the inference code, and some values. All of these, combined, make an LLM.
Another important aspect, that is widely misunderstood and will become apparent later is that a model is created by deciding the architecture, and then initialised with some values. Those values can be all 0s, all 1s, or random. (in practice it's random but that's irrelevant). Technically, once a model is initialised, that's it. That is a model. If released, that would be, even for the most pedantic absolutists, undoubtably open source.
Then, that model is being adapted. The most important thing to understand here, is that this is the preferred way of modifying a model. Actually, the only way. You can't (yet) come later and decide to change something in the architecture. Youc an only change the values. That process is called training (pre, mid, post, etc). The process itself is the same for the model creators, as it is for you. The technical process. The means, know-how, etc. is different.
Now, what licensing does, and the only thing that licensing can do is to give you rights to inspect, modify and release that model. That's it. A license will never give you (it cannot) the right to have the internal IP, knowledge, know-how or the "why's" on how the model was edited. That's on you. You have the right to modify, but you can't get the right to know how others have modified it, from a license file. Never had, never will.
(a simplified version of this is to think about an algorithm to control a drone. Usually that'd be a pid controller. Imagine someone releases under an open source license, an algorithm. That algorithm consists of architecture, loop code, and some values. Even if those values are all set to 0.5 (in which case your drone might crash) or any other values, the values themselves do not change the status of the code. It's still open source, even if the values are fixed, or random, or dreampt up by the original coder, or received from the aliens themselves)
I mentioned above that editing the values of a model is the preffered way of modifying the model, and that's exactly what Apache 2.0 defines as "source code".
> "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files.
----
Now, the practice. In practice, we do have fully open (open data, open training code, open source models) models. Apertus, from Switzerland and Olmo from the US. Don't get me wrong, it's absolutely great that we have these models, they are very important for the community, and they do help inform everyone about what works, what doesn't, and so on. But ... no-one uses them. Because they are not at the top, compared to other models.
And, on a technical note, the idea that "dataset" + training code = bit-for-bit recreation is also not true. Anyone that has done any large scale training can tell you that. Between the randomness inherent in the process, the occasional training run re-starts and so on, you will never get the same model twice (at reasonable scales), even if you'd have the available compute. Which, let's be serious, no-one at home has. So... yeah. It's a pointless aspect to care for anyway.
That is true. But it is not the same model as the LLM created by combining the released weights with the released architecture. The thing that is the "binary blob" is the weights. It is pretty much exactly akin to a Linux driver that depends on linux-firmware. It is wonderful that it exists! But it is only partly open.
| Now, what licensing does, and the only thing that licensing can do is to give you rights to inspect, modify and release that model. That's it. A license will never give you (it cannot) the right to have the internal IP, knowledge, know-how or the "why's" on how the model was edited. That's on you. You have the right to modify, but you can't get the right to know how others have modified it, from a license file. Never had, never will.
| In practice, we do have fully open (open data, open training code, open source models) models. Apertus, from Switzerland and Olmo from the US. Don't get me wrong, it's absolutely great that we have these models, they are very important for the community, and they do help inform everyone about what works, what doesn't, and so on.
You seem to contradict yourself here. That said: I appreciate the correction of my perception that there aren't truly open large language models.
While that might be true in a majority of cases, it's not necessarily universal. Recently model providers have worked with inference libraries to support their models at launch, but say in transformers you can include code for a new architecture, and if you load it with "trust_remote_code=True" it will still work. You can modify the forward pass or whatever you want to do. In that sense, code can be part of a model.
Why do you think this will be true?
Right now I see the major US labs betting on gaining an advantage from having way more compute, and I see Chinese labs competing with one another in a resource-scarce environment, so they place much more emphasis on compute-efficiency.
But the supply chains that feed into the massive data center growth in the US are strained; there are energy, memory, and logistical bottlenecks to name a few.
In the medium-long run, compute capacity will not grow exponentially forever. Somehow it has for decades, but there can be no infinite exponential growth, and that point may be when the planet really starts to cook itself.
Maybe the US labs will become more compute-constrained, and then have to compete on efficiency.
Or maybe things change fundamentally in some other way I'm not thinking of.
Things will normalize, but it will take time.
By all accounts the AI capex boom is justified up by actual usage, rather than some nefarious plan for "locking up the supply of computing infrastructure". Just look at people complaining about claude availability and anthropic adding various load-shedding measures a few months ago.
It will be important for Apple to have good enough, cheap local LLM models that run on-device.
If the barrier to performance shifts from fundamental model capability to context collection and management I would expect to see folks focused on that problem continuing to drive open-weight LLM model development in some shape or form.
Maybe on training, but on inference they use more tokens than comparable western models.
https://artificialanalysis.ai/?output-tokens=intelligence-vs...
It's ironic how in a thread about "AI subsidies" that people don't think free model releases from AI don't count as subsidies. Whatever AI winter that would cause AI companies to stop subsiding tokens, would probably cause other AI labs to stop doing free model releases. They might not be able to un-release the current crop of open models, but assuming proprietary model development still happens, they'll quickly go obsolete.
This is unpersuasive. Why would AI companies (American or Chinese) stop subsidizing tokens, but keep doing open model releases? At least for the former you can argue it's a lead generation tool for enterprise contracts (eg. hobbyist uses claude code personal plan, then asks the company to buy claude code enterprise, which are billed at API rates), but what's the business case for doing open model releases? You might get some mindshare, but are also arming your competitors in the process. Moreover what makes you think the model releases will be at all competitive to frontier models? Google released gemma 4 a few weeks ago to acclaim, but it's in no way competitive to even GPT-5.4 or Opus 4.6.
M2.7 is 230B and was designed to run inference on two (2!) shity Ascend GPUs (Huawei's first GPU manufactured in China). That's why they can offer a plan at 1/2 of the price of Antropic and probably still make a revenue.
> This is where open source models are important
open-weights, the training data isn't public
They do matter in that oss researchers enable faster cross-pollination of good inferencing efficiency improvements to help the big boys adapt ideas from the community
Long-term local ai may matter more, but imo not there until models + hw get way better (1-2 years?) . Reasoning grade quality at speed is still $$$: we need fast opus, not slow sonnet.
The only way you're running Deepseek V4 with comparable quality/performance is through OpenRouter, at which point you're still susceptible to being price gouged in the future, or by spending >$20k on hardware.
When Meta announced token leaderboards and other followed, I could see this being the logical conclusion. That whole trend is so dumb because it leads to this.
Company announces they will measure developer performance by how many tokens they burn and constantly talks about how the best developers burn the most tokens. Developers see the message and start burning tokens. And then the company acts surprised when their bills go through the roof.
I personally use my OpenAI subscription pretty heavily, 2-3 agents running practically all day on various tasks but I never even get close to running into limits while I hear about others blowing through limits on multiple accounts in the same time period. I'm convinced that most of those folks and their elaborate workflows aren't really for productivity but for bragging rights about how much they use AI.
1. I'm doing it wrong. Apparently I'm supposed to give it a vague paragraph about what the business does, and I can run off and sip margaritas and wake up to a fully fleshed business
2. They don't know what they're doing, and they're sending the LLM off on a wild goose chase that it does a reasonable job of working it's way out of, so they consider it success despite the waste.
Same. But if I was working for an organization that measured token usage, you can bet I would be doing things like creating a cron job that uses claude to create a customized bespoke report update of the current status of all my open assigned tickets and message that to myself 4 times a day... token burn for zero purpose whatsoever.
This is quite the reductive, charged statement. Can I ask what subscription plan you're using?
My personal experience is unlike this at all-- I work on ever-expanding codebases so I can easily burn tokens. Not to mention, structured agentic coding with adverserial reviews & task organization is not token-efficient. Additionally, for the problems I'm working on, only xhigh or high reasoning gives me worthwhile results while saving time. There are definitely configurations where default consumption doesn't work.
For reference, I used 15 billion tokens (most of it cached) last month on my day job's enterprise plan. That doesn't include my personal plans' usage.
The fact that somebody established a leaderboard for tokenmaxxing ought to follow you around like a black cloud for the rest of your career once the collective hallucination lifts and people realize just how monumentally stupid it was.
Alas they do all these stupid things together which makes it seem more defensible and then everybody forgets.
'The labs are underwater on inference' is an absurd thing to say whilst not separating the cost of _compute_ out into training and inference.
For instance, if you have already spent $n to train a model and are currently earning $2n selling inference with it; but are concurrently spending $3n training the next model in anticipation of earning $6n with it, then you are already in the hole for $n and are currently also losing $n – but you are doubling your money with each model because your $n investment in the first model returns $2n and your $3n investment in the second model returns $6n.
Also:
> Ed Zitron has the math
Ed Zitron is constantly wrong about AI economics:
https://www.theargumentmag.com/p/ais-biggest-critic-has-lost...
That's a big ask. No thanks.
The whole commercial AI enterprise is not economically viable if the inference revenue will not cover both inference and the amortized training costs. Given how fast they are churning through models to compete, you cannot act like the training is an asymptotically low cost.
Here is a recent non-rigorous benchmark I ran against a bunch of models. Qwen3.6 35B A3B fine-tuned with opus data runs plenty fast on my local machine and produce outstanding results - easily in the top 5, comparable to GPT 5.5 Pro (which is $180/mtok).
https://gistpreview.github.io/?31d66ef69e4aed3efae1aec69d86c...
I've predicted for years now that the industry will head down the path of the virus scanning vendors: selling subscriptions to be able to download the latest versions of models. I simply don't see how any other business model is remotely viable, except at the very highest end of inference or video gen.
The math seems off. How is 7.8 million vs 4 million 95% more expensive. Article makes good points but I doubt the numbers as they don’t add up.
Still agree with the conclusion though.
Customers give vendors metered access to their model. They can budget tokens per vendor. Vendors selling "AI products" can have a cleaner story and win on the margin.
The first step to is to iron out a reasonable protocol, basically authorizing a, access token, and then the model providers (OpenAI, Anthropic, etc.) do the rate limiting. Theoretically this could be done by OpenRouter too.
But even so - do customers want an "AI product" packaged cleanly, or do they want to manage token capacity? They may be forced to do the latter....
That is the question. I love using OpenCode with paid inference providers and seeing the cost of every little thing I do. On the other hand, right now I am flipping between Antigravity CLI and the two Antigravity apps burning Claude Opus tokens like crazy, knocking off a ton of work. Google must be losing money on me.
There are a whole bunch of companies somewhere else in the world that are getting better and cheaper every month, hardware side included. all without the infinite VC money
That said, more intelligence and automation = higher costs.
An outside small dev shop or internal dev team can pay these prices and spread the cost over several customers or departments, but the era of giving everyone AI and telling them to dev stuff is about to be over.
eg compare say gpt 3.5 to latest deepseek. Both cheaper and more at more capable
The pricing so far has been a classic case of loss-leader to build market share and ramp up until you can find a moat. Normally, the huge cost of training would provide such a moat, or the amount of training data required, but both of those seem to have been overcome by enough players to keep the ball in play. The next target to keep out the riffraff seems to be "Gigawatts of Data Center" (gack, I hate that metric!) and you might think that it would hold, given the finite size of the planet.
But in space, no one can hear you bleed cash, so...
However, it's much harder to compare. For one, the cost per token is difficult to measure until a sufficient amount of time has passed so that an extrapolation is more accurate. Also, there are performance considerations where a local solution might be more or less accurate than an equivalent online AI. In addition, the reduced compliance risk is hard to quantify or it makes online AI practically useless.
I don't understand how people got buy-in for a business model that assumed token costs would go down indefinitely. All tech startups follow a blitz-scaling pattern where they practically give away their services for free, trap customers in a moat, and then extort as much money as they can.
For what it's worth, when I provide a Pangram link it's because I can already tell something is AI and I'm attempting to provide objective third-party confirmation so the conversation doesn't just degrade into me asserting that I have superior taste to you.
if you told someone in the 70's their toaster would have a supercomputer it in, they would think you were crazy. in 10 years your doorknob is going to have a local AI model it in.
This is computing 2.0 not the dot com bubble. 90% of inference will be at the edge in the future and there will still be super-computers and giant clusters doing cutting edge science and research, but for 90% of use cases youll just need a tiny local model, same reason you dont need a giant GPU in your smart tv.
All governed by Moore's Law, what happened then seems extremely unlikely to happen again, the curve is a sigmoid and we're much closer to the flat end now.
And there's no sign that people are running out of ideas for how to optimize models further. You see a bunch of papers come out literally every few weeks right now. So, it's entirely plausible to me that we'll see models that are superior to current frontier ones in a year or two that will run on your machine.
Once we get to that point, I don't think it's even going to matter if frontier models keep improving for most people. Being able to run the model on your machine, use it as much as you want in any way you want, without having to worry about it changing from under you or the company changing pricing, and not have to send all your data to the vendor are going to be the deciding factors.
At some point the models are just good enough to do what you need to do. On top of that, I expect tooling around models and coding patterns will evolve as well. That could compensate significantly for the capabilities of the model. We already see this happening with two prime examples here:
https://github.com/itigges22/ATLAS
https://arxiv.org/abs/2509.16198
> Did we collectively forget second order thinking?
I bought 2x 16Gb NVIDIA cards this week because I don’t see hardware getting cheaper anytime soon, and because of that I totally don’t see the point of “waiting until prices go lower for graphics cards” because that might not for a long time yet!
In fact, if you include factoring in world events (and the ones that haven’t happened yet but eventually will e.g. China’s 2027 long planned take of Taiwan), then there’s no way graphics prices are going to be accessible to mere mortals until at least 2028.
But my real reasoning is that you’re going to see a flood of OpenAI and Anthropic users leave because of a) increasing pricing plans, and b) impeding business laws on the horizon about protecting sovereign data from AI (i.e data in cloud for training is a no no).
So what happens when people and companies one by one start leaving the SOTA AI cloud for from-good-enough-to-wow models? RAM and graphics cards become the new toilet paper, which is going to double again current prices.
Upgrade now before it’s too late folks!
You can see price vs performance in artificial analysis and the the pareto optimal is all just 6 months old model.
Claiming that there's some small subset of their services (like inference per token) that's "profitable" doesn't mean anything when it relies on everything else that company is still paying for. If you could make money from it at current prices - why aren't they?
Otherwise it's just "how much they're willing to subsidize".
Kimi K2.6 is 1T-A32B with a slightly less computationally efficient architecture, and is served at around $3.50/Mtok out by 9 US ZDR providers.
Unless you think that either the generally accepted size estimates for Anthropic/OpenAI models are wildly off or those companies are a lot worse at serving models efficiently, Anthropic and OpenAI are probably making around 5-8x margins on their API costs.
The cost of training new models is of course a major factor not counted here. Depending on how you want to think about that this may or may not make them net profitable. I remember one of those CEOs gave an interview a while back where they described it as a series of independent investments, where each model they train is net-positive in revenue by EOL just from its own inference, but I don't know whether that's still true or not.
Regardless, the point is that if they stopped training new models today, both Anthropic and OpenAI are making incredibly generous profits on their API inference.
And "growth at all costs" makes sense if there's lock in and you can monetize those "now locked-in users" later - but that doesn't really seem true on the consumer side. It seems pretty trivial to switch out which model and provider on the consumer side.
Any "lock in" has then to be on the model or inference side, and that's still advancing in multiple areas from so many different sources I'm not sure I'm comfortable saying that will also be a "winner takes all" situation either.
My approach is generally "enjoy using it while it's cheap and subsidized, but understand that might not last forever". If it does remain cheap after the subsidies end, great, you can just keep using it. But if it doesn't and you've lost the ability to work without you'll be in for a world of hurt.
I think the idea of "all growth is good no matter the cost" has been taken to an extreme.
At my work are multiple developers bragging about overnight AI usage to solve problems hands off. Yes they are wasting money and resources but the fad is here. People be vibe coding for now.
In like 6 months when all the costs need to be paid and the prices go up, we will see if these companies stay profitable. But I'm of the opinion that the vibe coding tech bros are more than enough to sustain a short or even medium term profit for these companies. Just on fad-energy alone (see OpenClaw)
The fad probably collapses soon after. I hope anyway, the waste I see is nauseating.
------------
I dunno where this is all going. But I do have faith in human ingenuity still. Things are changing, possibly for the worse, but we need to make the best of it.
The worst of behaviors is wasteful and blatant fraud. There's something useful here though.