When discussing LLM pricing, people are missing the plot. The subscription token price is 10x-40x cheaper than API pricing. Your 90$ Claude subscriptions give you close to $1000 to $4000 in equivalent API token pricing.
The second issue is that the quality of the model “operator” makes a massive difference in the outcomes. Highly skilled senior devs who know how to prompt and have high agency will outperform team people that lack motivation and foundational skills.
Lastly, there is a massive difference in capabilities, determinism, and error handling between 5T SOTA models like Opus and tiny distillations from DeepSeek that perform well only in benchmarks.
simonw 3 hours ago [-]
I learned today that the Anthropic "Enterprise" plan - the one big companies use because they need governance features and audit logs and all of that jazz - is billed at API token rates (plus $20/seat/month).
So large companies are getting billed a lot more than those discount subscription plans.
alexriddle 3 hours ago [-]
Anything over 150 seats means you need to pay at token rates plus the $20/user. My day job is operational (no coding at all) and I'm spending ~$300 a month on a few chats with Claude/Cowork a day over the course of a month.
m_kos 2 minutes ago [-]
$300 is my employer's monthly cap on Claude Enterprise. It lasts me at most a week of moderate use. I would much rather get Codex Pro and Claude Pro or Max, which would cost ≤ $200. For $300, one could also add Gemini Ultra to the mix so I could have all three review each other's code, etc.
Claude can be very good but enterprise prices don't make sense to me.
stymaar 3 hours ago [-]
I hope your company is keeping the input/response pair in case they need to break free at some point.
dd8601fn 51 minutes ago [-]
Wouldn’t people mostly just want any artifacts?
zackify 1 hours ago [-]
We are on it at my job. It saves money due to other parts of the org not using as many tokens.
The real cost effective way is giving a team $20 cursor $20-100 Claude $20-200 codex.
I'm spending 1k on Claude enterprise easily and that's with trying to spread it on codex and cursor using pi.
jgreid 2 hours ago [-]
Governance and audit trail are incredibly valuable to large enterprise organizations, especially those working in regulated spaces. Companies will pay a premium if the security/privacy/compliance issues are handled effectively.
datadrivenangel 3 hours ago [-]
I've heard that the $20/seat gets waved if you have large enough committed spend.
isoprophlex 2 hours ago [-]
Would they even care at that scale, if the average employee spends $3000 every month because mgmt mandates slopmaxxing?
stymaar 3 hours ago [-]
> Lastly, there is a massive difference in capabilities, determinism, and error handling between 5T SOTA models like Opus
What's your source for Opus being a 5T model?
> and tiny distillations from DeepSeek that perform well only in benchmarks.
I don't think you know what you're talking about. Local models aren't “distillations from Deepseek”.
And they don't perform well “only in benchmarks”, Qwen 3.6 is a very decent model (obviously it's not Opus, but it's also much faster and speed is a quality of its own).
Sigh, it's year 2026 and there are still people believing something Musk says…
ramesh31 16 minutes ago [-]
People can simultaeneously be reprehensible idiots while being a reliable expert on something they have personally invested billions of dollars into and operate at scale.
awkwardpotato 4 minutes ago [-]
He's also invested billions of dollars in SpaceX and Tesla... which he regularly makes wild claims about that are untrue.
While this source's reliability is certainly debatable, the size matches the results of this paper, in which researchers estimated the parameter count from model knowledge. https://01.me/research/ikp/
stymaar 2 hours ago [-]
> While this source's reliability is certainly debatable
Massive understatement. Nowadays it has become hard to find a single Musk statement that doesn't contain at least one lie.
> the size matches the results of this paper, in which researchers estimated the parameter count from model knowledge. https://01.me/research/ikp/
Thanks for the pointer. This estimation has Grok 6 times bigger than Musk claims it is, so maybe that's where the lie is.
(I'm quite skeptical about that number though, it would be quite disappointing for the US tech if their flagship models had to be that much larger than the Chinese ones for such a small edge in performance. Because I don't think US labs are incompetent, I'd bet that US flagships aren't more than 2/3 times bigger than Chinese flagship. Otherwise it really doesn't bode well.)
striking 2 hours ago [-]
In tiny gray text right above the table is written "90% PI ≈ ±3.00× either side." Is GPT-5.5-Pro 3.4T or 30.8T in size, or somewhere in between? We just don't know.
runtime_terror 2 hours ago [-]
> The subscription token price is 10x-40x cheaper than API pricing
This is a temporary phenomenon. Expect either drastic price increases or draconian throttling or both in the coming months.
These companies are operating at huge loses and have hundreds of billions in liabilities and commitments. They need to turn on the money faucet sooner than later.
Npovview 2 hours ago [-]
Even with increased prices, AI enables velocity both in development and bugs fixing. Would companies want that? If prices are biting the company, I think companies will route all development and bugs fixing requests through few superperfomer developers with complete knowledge of the different components within the company (they will be the Queen Bees holding the company on their head). The rest of the company will be tasked with requirment gathering, specs cleaning, deambiguation and so on (worker bees).
alfiedotwtf 2 hours ago [-]
Incentives matter…
If prices keep going up, watch for companies to exit frontier models and go to local llama.cpp instances for 6-month-ago SOTA, with the flex of being housed within the office - no more privacy leakage, no more price gouging.
To be honest, I’m not sure why a Y-Combinator backed company hasn’t come out yet flooding the market with highly capable OPAI (pronounced “Oh-pah” as in what Greeks shout as the drink shots), which stands for “On-Prem AI”
… yes, I just made up OPAI right now lol
anthonypasq 2 hours ago [-]
Theres recent reporting that Anthropic will be profitable this quarter...
edit: I see in other comments on this thread you think Ed Zitron is a reliable pundit so that explains everything.
xbmcuser 2 hours ago [-]
Its not like the non frontier are not improving. If someone can use deepseek to get 90% of the work done for $100 then pay another $100 to anthropic or openai to complete it I think they will rather do that than pay anthropic or openai for $1000.
lelanthran 4 hours ago [-]
> When discussing LLM pricing, people are missing the plot. [ ... snipped ...] Your 90$ Claude subscriptions give you close to $1000 to $4000 in equivalent API token pricing.
And you think it is unreasonable to consider this unsustainable?
z2 3 hours ago [-]
And the direction is definitely towards removing that subsidy really soon. We can see it with OpenAI's shift to API-equivalent pricing for enterprise customers last month. Anecdotally my company saw OpenAI credit usage grow 2x with stable use across the ChatGPT platform, which is pretty terrifying considering just 2% of the company uses Codex.
For context, ChatGPT business subscriptions give you a fixed pool of credits to use, after which you get billed a la carte at inflated 1.75x rates vs API, or if you don't want to pay, you get access to anything but the non-reasoning models turned off for the month.
We also tried Claude Enterprise, which was unusable as people blew through their monthly limits in a matter of hours.
wongarsu 3 hours ago [-]
Depends on what their actual costs are. Either they are losing lots of money on subscriptions, or they make absolute bank on API pricing.
Looking at the pricing of 1-2T models like Kimi or DeepSeek on the open market, I'm tempted to assume that inference costs are closer to subscription pricing than to API pricing.
Especially considering that subscriptions a) distribute load over time via rate limits, and b) will include a lot of users who get only a fraction of the possible value, whether they are on a personal account where they are on the rate limit on the weekend but barely use it during the week, or are corporate users who were issued an account they rarely use. Subscription prices are usually measured on the average case, not the most extreme value a power user can get out of it
runtime_terror 2 hours ago [-]
> I'm tempted to assume that inference costs are closer to subscription pricing than to API pricing
So just going on vibes?
While some people don't like his content, Ed Zitron shows a lot of evidence for your assumption being very wrong.
These companies are bleeding cash at ungodly rates. It's likely their API pricing is still subsidized if you look at their overall financial picture.
Related, there's a good reason those API prices keep going up a lot every new version and it's not just because the models are better.
wongarsu 2 hours ago [-]
Selling inference for more than inference costs is not incompatible with bleeding cash at ungodly rates. They do in fact pay ungodly amounts of cash for other things, like training, marketing, etc. Heck, you can bleed cash while being profitable (in the accounting sense)
Also, API prices going up a lot every new version is more an OpenAI thing, and even there it's a recent trend: GPT 5.0 was a big price drop compared to 4.1, and 4.1 was cheaper than 4o, which itself got a price cut at some point and is cheaper than 4. Meanwhile Anthropic's API pricing stayed stable for many versions, then got slashed to a third with the 4.2 release and have stayed at that level since.
Forgeties79 3 hours ago [-]
Considering not one company is in the black yet I don’t really know how we can say anyone is making bank, unless we want to count absurd levels of VC funding (now slowing down) I guess.
wongarsu 3 hours ago [-]
I am conveniently not counting training costs (since they add no marginal costs, selling more tokens doesn't impact them), and hardware and DC costs only amortized
Of course they do have to "make bank" in some way to offset the insane training costs. But whether they go for high prices or high volume, or offer some services as a loss leader to drive profits elsewhere is somewhat orthogonal to that
Let’s see it first. And without omitting training/infrastructure costs at that. Until then my comment is still accurate.
anthonypasq 1 hours ago [-]
its a private company, what exactly do you expect to 'see'?
Forgeties79 10 minutes ago [-]
Anthropic IPO's in less than 5 months and I guarantee you any company that officially is in the black will proudly shout it from the rooftops.
stingraycharles 3 hours ago [-]
Also, your local hardware is in no way capable of running the types of models that the cloud providers do, it’s just not economically feasible, and it never will be.
adrian_b 3 minutes ago [-]
Depends on what you mean by "economically feasible".
Even very cheap mini-PCs and laptops can run any of the models run by cloud providers, albeit at a much lower speed (i.e. with the weights stored on SSDs).
Whether such a low speed is useful, depends on the application. For something like a coding assistant or bug scanning, an instant response is desirable, but certainly not necessary.
bachmeier 2 hours ago [-]
Very much dependent on the situation. For many business tasks, local hardware is good enough. But what a lot of folks overlook when saying these things is that (a) workers do more than run AI models on a piece of hardware, (b) significant computer hardware is already sitting idle outside normal work hours, when it can be running batch jobs, and (c) employees can share local hardware.
zozbot234 3 hours ago [-]
It can run open-weight models that are roughly as capable. It's going to be slow unless you're using actual datacenter hardware, but they'll run.
colonCapitalDee 3 hours ago [-]
"roughly" is doing a lot of heavy lifting there
cortesoft 1 hours ago [-]
NEVER will be is a pretty big leap. Never is a long time.
devmor 2 hours ago [-]
> it never will be.
Giving strong “640k is enough for anyone” vibes here.
cyanydeez 3 hours ago [-]
Isn't the plot that it's like an infinite bikeshed but 10% of the biksheds are actually trailer parks and when you finally realize it's a trailer park and not a bike shed you're down 10-100$ because it's token gen is faster than you can actually validate?
Some might say the price wouldn't be great if you could actually process and validate it...
kelseyfrog 3 hours ago [-]
> The quality of the model “operator” makes a massive difference in the outcomes.
My hunch is that this is the source of much of the variability in outcomes upstream of HN commenters claiming extremes of, "This model changes everything!" to "This[same] model is crap."
We haven't operationalized what it means to "be good at prompting," nor developed proxies/heuristics/shibboleths for accessing prompting skill. There's community skepticism over whether prompting skill even exists. Besides even if prompting skill is real, who wants to hear, "Actually you kinda suck at prompting."
danielmarkbruce 2 hours ago [-]
It's 100% this. Many people suck at prompting. It's likely that habits from search are ingrained. But in general some people are just so bad at it .
freediddy 4 hours ago [-]
My friend is an exec at a US software company and they are preparing to lay off a few teams of programmers in their Eastern European locations and replacing them with a small number of US programmers + AI. He said they are much more productive and produce new features much faster.
repeekad 3 hours ago [-]
I think the article is right about outsourcing but not from cheap offshored contractors, good experts will become more independent and be more enabled to support more clients with AI, meaning small and medium businesses won’t need internal as many engineers, finance, marketing, etc
treis 4 hours ago [-]
I think this misses the forest for the trees. Working with ChatGPT is eerily similar to working with offshore Indian devs back in my enterprise days. Productive if guided explicitly but if let run wild there's lots of WTF moments.
LLMs are likely to replace outsourced devs because your employees that know the context can use LLMs to do what offshore devs did before.
goosejuice 2 hours ago [-]
There are developers outside of your country that are talented, speak your language competently, and willing to work for less pay. There are plenty of reasons to believe that such devs will increase in numbers.
spprashant 4 hours ago [-]
"offshore Indian devs" are no slouches. They have access to the same GPT models and likely cost a tenth of the median US salary. Businesses are always looking to lower marginal cost. They will hire 1 software architect in US to write specs and 10 software developers in India to babysit 100 agents.
ern_ave 2 minutes ago [-]
> "offshore Indian devs" are no slouches.
What evidence is there of the quality of Indian devs specifically?
One signal I'd expect to see, for example, would be success in programming competitions. Here's the list of winners of the IOI competition [1] - India has won 3 times.
Meanwhile, Turkey has won 4 times, Estonia has won 5 times, and Vietnam has won 22 times!
Why should we suspect that there are more or better developers in Indian than in any of the countries that has produced more winners??
This is short-sighted. The problem with offshore Indian devs is the communication friction/overhead. You're 9 hours offset, with people who have decent-but-not-great English skills and wildly different cultural priors. If the product people/decision makers are in the US, you're getting a ~50% savings to suffer all those issues, while the cost of tokens remains unchanged. That 50% savings doesn't look very impressive when you're taking a 20% productivity hit from comms friction and crossed wire, and 35% of your total cost is from tokens anyhow. Then it comes out to be a very marginal savings, at the cost of a VASTLY worse hiring experience and VERY high variance of outcomes.
Offshore Indian devs make sense when you can have a large Indian division so you can amortize communication infrastructure/process management over a lot of heads, and you're building for international customers so you're not paying an English -> X tax inherently.
mikeocool 3 hours ago [-]
"They will hire 1 software architect in US to write specs and 10 software developers in India" is exactly what everyone said was going to happen in 2004 as software engineering outsourcing really started to gain traction. Malcolm Gladwell's The Earth Is Flat basically made the argument that software engineering in the US was going the way of manufacturing.
And outsourcing certainly became a thing though not in the way everyone predicted. There are far more software engineers in the US today than there were in 2004.
runtime_terror 2 hours ago [-]
Obviously this is just anecdotal but over my 20+ year career I've worked with a lot of outsourced teams in India and my experience has nearly always been that they require a frustratingly specific degree of direction to product anything of quality.
Just recently I asked a dev there for a POC of a feature with decent specificity and ended up with about 8k LOC of spaghetti. I re-wrote it later in a few hundred. This is about in-line with my career experience.
I've had a few standout devs there but it does feel like a lot are putting in the bare minimum or are just working really far outside of their abilities.
gedy 2 hours ago [-]
While people will do what they need for money, that is a miserable type of role and the quality of architect will suffer from that.
zwischenzug 4 hours ago [-]
Certainly tracks with the number of outsourced teams begging for work on LinkedIn.
lumost 4 hours ago [-]
How many of those wtf moments are simply from not “being in the room when it happened?” Most enterprise software is riddled with wtf moments demanded as one compromise or another.
karl_gluck 3 hours ago [-]
At least some, but let me give an example.
Request: “manual step X should not be part of the automated build script”
Fulfilled as: build script is now split in two. X is still done as a manual step in between. Rather than prompting and waiting for it to be done, the documentation and scripts no longer mention X.
Part poorly written requirements, part implementing under pressure, and part lack of engineering discipline.
The main issue is catching stuff like this early enough to course-correct. Differences in time zone, language and cultural norms can make that a challenge, all of which LLMs have the advantage in.
xcskier56 4 hours ago [-]
There's always wtf, why did we add this feature, but at least in my experience, once a week or so I run into something in this category. Me: "AI, please cleanup/refactor/improve this thing" AI: "Roger that! I deleted the file so now it's perfectly clean" ... insert W.T.F.
ofjcihen 2 hours ago [-]
Yeah unfortunately you have to be careful with words like “clean up”.
I’ve had it assume I meant the folder multiple times :/
dboreham 3 hours ago [-]
Never seen that once.
throwaway613746 4 hours ago [-]
[dead]
marcusholt 4 hours ago [-]
[dead]
economistbob 2 minutes ago [-]
Deliberately combining hallucinations with a smaller fund of localized knowledge with which to spot said hallucinations seems like a bad business decision.
ZeroCool2u 41 minutes ago [-]
A crucial factor tech industry folks tend to ignore is how much executives value predictable costs. Cloud migrations got away with this, but still had to argue fiercely, because 'the cloud' and its serverless tech had the potential to significantly decrease overall spend for unpredictable, bursty workloads.
The usual counter-argument is the operational burden, but human capital is also a relatively fixed cost. A dedicated team of 3-5 FTEs could probably handle inference ops for a F500 company.
Meanwhile, the capability delta is shrinking fast. We have more evidence that local open-source is viable with the release of DeepSeek v4, and the industry is only trending further in this direction. Especially as we rely more on test-time compute and task-specific harnesses rather than model size.
So, if you're an executive looking at a marginal but fixed operations cost, added flexibility, and a rapidly closing gap in capability, why wouldn't you just run open-source models on your own infrastructure to get those highly predictable costs? Plus, you decrease the risk of one of the frontier
bitmasher9 37 minutes ago [-]
Do you really want to buy the 3rd or 4th most intelligent AI?
There’s so much uncertainty, it seems like the safe option is to give everyone a Claude or OpenAI subscription/api key until the frontier isn’t changing every six months.
24 minutes ago [-]
lowbloodsugar 2 minutes ago [-]
If IT is a cost center, then a company has likely already outsourced (and if it's called IT it probably is). If you are a software development company, that makes money from software, then a local team of SDEs using what-ever AI they want is a competetive advantage vs local team of SDEs trying to deal with an 11.5 hour gap to India. AI is coming for software developer jobs, and its coming for: a/ the low skill ones and b/ the high skill ones where turn-around and iteration matters. I've worked with great engineers in India, but the time difference was brutal for our fast moving business.
ecshafer 3 hours ago [-]
I have really been trying to get local models to work. I have tried different harnesses, tooling, skills, prompts, etc. But when I compare claude code with anthropic models or codex with gpt 5.5, vs qwen, glm or gemma and the same harnesses, the frontier models come out massively ahead. I am at the point where I just don't see the point of the non-frontier models, they waste more time than they save.
henry2023 3 hours ago [-]
local models are 3 to 6 months behind SOTA models with the huge benefit of not needing to send all your IP to a shady third party.
If inference cost comes down (as it has been for the last few years) you’ll be able to run today’s SOTA in your laptop by the end of the year.
ghrl 2 hours ago [-]
I would say that is highly unlikely if by SOTA models you are not just referring to coding benchmarks but more general purpose ability and domain-specific knowledge. For example Kimi 2.6, which is comparable to Opus 4.6, is roughly 500+GB large, and I don't see how that would run on consumer hardware anytime soon.
Besides, this is not just about the technical feasibility, but also economically not viable whatsoever. Why should consumer laptops be capable of running such models, when they would be massively underutilized most of the time, when inference providers can produce the same results faster, cheaper and a lot more viable economically?
sourcecodeplz 1 hours ago [-]
It runs right now on 512gb RAM Macs and PCs.
henry2023 1 hours ago [-]
Because privacy has perceived value.
datadrivenangel 3 hours ago [-]
For agentic coding I 100% agree with you, it's worse and slower and more expensive for LARGE coding with local models. Narrow coding (like writing a specific function) is slow but viable. Regular LLM chat usage on high-end consumer hardware is competitive except on cost though. 0
The hosted frontier models are massively subsidized, right? I think the point of local non-frontier models is just learning at this point, so you’ll be skilled if/when the market starts comparing the actual price of the two different models.
joka88xj 3 hours ago [-]
[dead]
mark_l_watson 2 hours ago [-]
Great article that reinforces my own opinion but adding the cleverness of adding low cost human labor into the equation. Nice.
I spent a month comparing Gemini Ultra plan to using much lower cost DeepSeek v4 with open source coding harnesses and, spoiler alert: I was happier using the much cheaper and more environmentally friendly open models: https://marklwatson.substack.com/p/my-evaluation-of-ai-agent...
swader999 47 minutes ago [-]
I'm finding sound judgment, common sense, technical depth and breadth, a feel for the UX are skills that amplify Agentic coding. Deep knowledge of the problem domain and time with the customer (or SME's or end users) are what build these. Outsourcing this will never work, you can't put someone 12 hours ahead of the timezone your serving in front of the customer.
zuzululu 2 hours ago [-]
I keep seeing this narrative involving Deepseek as an example of OSS LLMs but they are subsidizing a huge amount of tokens at cost and one can easily understand why they are doing it if one is not lazy and think critically.
It's still far too costly and not effective to use Local AI that can match what the frontier models can offer, especially when the inference hardware is being heavily restricted due to geopolitical risks. Claims about local LLMs somehow putting these frontier companies a run for their money I find especially doubtful in the long run.
Tokens are getting expensive because they are beginning to corner the market and will use that advantage to limit hardware distribution within and beyond the borders.
It's more likely that some workflows will see more local LLMs but those will never be the ones that require frontier model level or beat the price that a lighter smaller version of frontier model will offer to capture that tail end
throwa356262 1 hours ago [-]
Do you have a source for your first claim?
My impression is that deepseek designed v4 specifically for cheap inference and they are not loosing money even at 75% lower price.
sourcecodeplz 2 hours ago [-]
Don't think so, from what i've heard deepseek isn't loosing money on inference.
logicchains 1 hours ago [-]
>they are subsidizing a huge amount of tokens at cost
This is absolutely false, because other providers serving the Deepseek models on OpenRouter are also able to offer very low prices, and they don't have the money to subsidize anything.
regexorcist 1 hours ago [-]
I've been saying this for a couple months now since I got decent hardware and started using my local Qwen 3.6 exclusively. I have no doubt the future for individuals and medium-sized companies is local private AI.
jillesvangurp 4 hours ago [-]
I've been pretty happy sticking with codex 5.4 medium. I don't see a good case for switching to 5.5 at the cost of going through my token budget quicker.
There are misaligned incentives here between users just trying to get stuff done and AI companies competing on having the "smartest" model that passes benchmarks and continuously does some nobel peace price winning stuff. It's mostly overkill for the more mundane stuff normal people actually do with them. It's nice to have the option when you need that. But defaulting to that is not economical and a bit unnecessary.
There's also a difference between smart models and bigger context windows. Most of the progress in the last year was simply the context windows getting big enough to fit all/most of the stuff needed to solve issues. Before then, you had to carefully manage the context to not run out of space and they wouldn't fit much more than small hobby projects.
With sub agents, the parent agent doesn't need to be a frontier model. It can delegate to smarter agents. And most stuff it delegates shouldn't need a frontier model. Wouldn't it be nice if it could decide on a case by case basis.
The walled gardens offered by OpenAI, Antrhopic, and others currently default to one size fits all "frontier" models. This is not sustainable. They should evolve to using smaller and effective models most of the time with complexity based escalation as needed based on either estimated complexity or when the small models fail. I'm guessing some open source based alternatives to these walled gardens are probably already heading that direction.
The irony here is that with a walled garden, these companies are selling a premium experience. But in the current market that boils down to burning billions of investor cash to keep the GPUs going without much hope on profitability. Eventually surviving companies are going to have to compete on quality, cost and margins. The smart approach would be to dynamically adapt token and context window sizes instead of blindly defaulting everything to the best possible. Don't boil the oceans for a simple email summary or a simple web UI. That stuff already worked well enough with models even a few years ago.
prasoon2211 4 hours ago [-]
I used to be on 5.4 high for most of my work. I have switched completely to 5.5 medium now. I would highly recommend trying it out
- 5.5 is significantly more token efficient than 5.4 - the same task takes often a third of the tokens
- because of this, is it also much faster to do the task
- you get high "intelligence" per token even after accounting for token efficiency - 5.5 medium is just under 5.4 pro levels of intelligence (imo). It has found tricky bugs for me that all other models failed at
So overall, ideally you will end up with more intelligent, faster model for slightly cheaper.
thisisembar 3 hours ago [-]
This is embarrassing but I find 5.4-mini on Low covers a substantial part of my and my colleagues work.
Back when it became expensive I learned to live with it and I find my "AI skills" (mainly communication) have a substantial impact on the efficiency of the model. Not saying my work is difficult, it's not, but I find there is quite a bit of wiggle room. Smaller models can still perform useful work, but you have to do the heavy lifting yourself. It saves a ton of money.
I used to burn through 75% of my tokens in an hour or two. Now I can work all day and hit maybe 50-60% if I use it heavily.
dawnerd 3 hours ago [-]
We trialed 5.5 and the same queries produced worse results. Not worth the cost increase. Even if there’s a token efficiency gain the higher cost wipes that out.
jmull 2 hours ago [-]
> (Human + an almost frontier LLM) vs Frontier LLM
I'm curious, who/what is operating the frontier LLM in this scenario?
The rest of the article is equally incoherent.
lmeyerov 3 hours ago [-]
Fwiw, the cost per answer, which is what ultimately matters, is going down. In a competitive market with oss and multiple frontier labs, it is hard to maintain a premium long-term.
The big question is how subsidies vs technology improvement will play out. As we saw with Uber, selling at a loss can happen for a very long time, and technology improves relentlessly.
For reference, we publish https://botsbench.com/ that shows time and cost per answer are going down while quality is going up.
domrdy 4 hours ago [-]
For sure true for specialized ones like MedGemma (healthcare). In my testing, the 27b model is at least on the same level as frontier, and in some cases outperforms them. 4B is insanely good too for some lighter workloads. Thanks G for working on this!
rightlane 3 hours ago [-]
I disagree with every part of this.
Local LLMs are great and very useful but if you are claiming that their code quality is in the same ballpark as Claude Code or Codex with their best models I cannot consider you a serious person. I feel like this is analogous to the folks arguing that The Cloud is "someone else's computer." As if billions of dollars of spend gives these companies zero benefit over a Mac mini.
Regarding offshore, at least in my experience, better coding agent output is down to two factors. First, is subject matter expertise. Providing the right context to the coding agent based on the tech you are building for is beyond critical. That's the issue with the Vibe Coded slop projects. No expertise in a technology means no awareness of gotchas, React is the most obvious because the LLM default is to useEffect endlessly.
The bigger issue is that by their very nature LLMs are very sensitive to quality prompting in English. I have seen offshore devs fail endlessly because they don't have the English skills to successfully prompt the machine. That has caused more work for my US based devs to either carefully tune the work ticket so it is basically a coding agent prompt. Or to go through multi day exercises to enforce better prompting.
A single US dev with Claude Code is orders of magnitude better than typical offshore. Adding local models into the mix would make offshore completely useless. I'm sure many companies will see ballooning AI bills and expensive onshore devs and be very tempted to go to TCS or similar. I hope so, because that will give startups plenty of easy targets to disrupt.
AI will become a commodity technology the same way virtual machines are a commodity.
samtheprogram 4 hours ago [-]
$1100/m for an outsourced engineer… am I missing something? That’s far too low. Even juniors in South America tend to ask for at least double that number before factoring in the DeepSeek cost.
Shalomboy 3 hours ago [-]
I thought the same thing. The author's reference point for LCOL developer seems a bit outdated. With what we pay our teammates in Colombia, the model pushed out to 22 months before crossover.
hmokiguess 2 hours ago [-]
I think the biggest pull is yet to come, legislation around sovereignty and the US Cloud Act is sort of a challenge for the US hyperscalers, these local models may have more than just a price advantage against frontier labs but also policy and lobbying.
nyxtom 2 hours ago [-]
I've seen the $1000/mo engineer salary thrown around a bit and I'm not even sure where it comes from.
ianhxu 4 hours ago [-]
>frontier models are more capable than the latest from DeepSeek. But is the capability difference enough to justify a 30x price difference?
The contradiction here is that without frontier models, there'd be no foundation for models like DeepSeek to reference and catch up to. Is there an economic model that captures this kind of dynamic?
aftbit 4 hours ago [-]
Free market competition? This is a pretty classic pattern. Leaders capture market with quality but run into trouble scaling, followers compete on price and availability. Given time, leaders eventually run out of upgrade runway and find themselves swallowed up by followers. Or alternatively, leaders think their lead is inevitable and miss a sea change or iterative upgrade path. Think IBM PCs before Compaq and other cheap clones ate their lunch.
throwa356262 50 minutes ago [-]
Hold on mate, do you realize that a significant number of recent major advances in AI came from deepseek?
bee_rider 3 hours ago [-]
I guess they’d be hoping for very protective IP laws in that case.
alansaber 4 hours ago [-]
Always has been. People pay for the (not so) marginal performance gains.
the_arun 4 hours ago [-]
Premium services need to allow enterprises to self host the services to reduce cost of inference. Another advantage is data doesn't leave the VPNs.
rastrojero2000 3 hours ago [-]
It's particularly funny to me, but a minor point, that this post requires me to go through some kind of cloudflare armed checkpoint to dare read about AI.
A bigger issue is this thing calls AIs better coders than people and I have tried for the past 4 months to get one of the several I looked into to consistently produce a simple event-bus backed Java monorepo going with exactly zero success. Claude even repeatedly wanted to put my login logic at the actual event bus, for some reason.
What does "better coder" _exactly_ mean at this point?
cautiouscat 4 hours ago [-]
The dark mode version of the site makes the tables unreadable.
the_arun 4 hours ago [-]
Agreed, but same data is listed right below the table.
GodelNumbering 4 hours ago [-]
Thanks for flagging, fixed
Stevvo 3 hours ago [-]
I don't see local AI taking off. Memory costs make it impractical. Deepseek API pricing is not a suitable analogue because it's not local.
NitpickLawyer 4 hours ago [-]
> But is the capability difference enough [..]
This is the (m/b)illion dollar question, isn't it? I think there's also a question of what do you think capability is exactly, and how the difference manifests itself.
On the one hand, when something becomes "good enough" that's a clear capability threshold. On the other hand, what's the limit of those capabilities, and equally as important, how does capability reflect on reliability?
We've seen "local models" lately improve on capabilities where they're "good enough" for some tasks. Reliability of solving those tasks is a bit harder to measure/benchmark/test. It'll get better as more people work with those models. But, something I've noticed in the past ~6months is that the frontier models are gaining a lot in both the breadth of capabilities, as well as the reliability of solving those tasks that they're capable of solving. I think this is where scaling (both compute and data) is showing, and where having more compute is simply better (more parallel exploration, more training data output, more broad data, etc).
There's also the problem of benchmarking true capabilities. The popular ones are getting old, and aren't as reliable as they used to be (not even touching on the subject of benchmaxxing, just thinking about their saturation, even with honest intentions).
So the question then becomes what will users prefer? Do you get the best of the best, or the one that's good enough? There might be a market for both, honestly. Not everyone does SotA stuff. And a lot of what people used to do in a company is probably mundane enough that a "good enough" model with "good enough" reliability can probably handle (w/ some supervision ofc).
What I'm more interested in is if things like Thaalas succeed and they get to provide local hardware that runs models "burned in silicon". That would be interesting, because speed and all the advantages of local models are a "quality" on their own. For example, right now I'd pay ~1k$ for an external hdd-sized block that can run a ~32B model that's popular right now, even knowing that it can only run that model. I have no idea if that's feasible or not, if it makes sense from a financial pov. But I'd buy one. And local inference on dedicated chips doesn't need to be "oss only". I'm sure oAI / etc would probably take the risk of licensing one of their -mini / -lite models provided that the risk of the weights leaking is small enough (and it probably is).
> This keeps a ceiling on how much or how fast the frontier labs can raise prices.
I generally agree, but from a different perspective. Up till now we've seen that the 3 labs influence each other's price points. When gpt5 came out at a radically smaller price, the others lowered them as well. Now with opus being SotA for coding, w/ 5.5 close behind, they've raised them back. Google seems to follow slowly. But there's hope that, being 3 top labs + 2 trailing (xAI & Meta), there'll be pressure once again. If any of those trailing labs manage to get to SotA again, the prices will drop once more. Some people say that open source also provides a pressure here, but I'm not yet convinced of this. There's still a question of who'll serve the models, at what scales, etc.
jqpabc123 6 hours ago [-]
The current closed source frontier models are more capable than the latest from DeepSeek. But is the capability difference enough to justify a 30x price difference?
"Frontier models" are caught in a financial dilemma of their own making --- they have spent such huge sums on development and as a result, they may have inadvertently priced themselves out of the market.
Energy costs are a huge factor for AI. He who has the lowest energy costs will likely be able to dictate market prices. And fossil fuels dependence doesn't look to be advantageous for AI.
treis 4 hours ago [-]
Historically the winners in software have a flywheel that turns faster with more users. Facebook the more of your friends on it the better the product was. Google tracked how long users were on pages to improve search.
The frontier models are going to win that way. They won't feed your code back into the system but they will track which code you keep and what code gets a "try again claude".
They're not going to lose on price. No consumer software ever has because ultimately it's not that expensive relative to salary and the marginal cost is 0.
aftbit 4 hours ago [-]
The marginal cost of AI is not 0. That's one of the big differences between this and older SaaS software. Inference costs a lot of money. Even if you're looking at just capital depreciation, it's quite expensive. I suppose it's more accurate to say marginal cost is stepwise - adding 1 new user is 0 cost if and only if your existing inference hardware covers that user's usage. As soon as you need a new server, adding _that_ new user costs ~$20k/year (assuming 100k server and 5 year depreciation).
This is true for traditional SaaS too, but the number of concurrent users that could be served by one machine and the cost of the hardware were both at least an order of magnitude better.
jqpabc123 3 hours ago [-]
The marginal cost of AI is not 0.
In other words, AI is not your daddy's software. Comparing AI with old school software markets simply does not compute.
throwfaraway4 4 hours ago [-]
>They're not going to lose on price. No consumer software ever has
Lists examples of software that are free to the users
Npovview 2 hours ago [-]
I want AI to go the way of Linux. I hope we see that future.
treis 4 hours ago [-]
Go on...
benfortuna 4 hours ago [-]
httpd
Npovview 2 hours ago [-]
Exactly the CC sessions flywheel is a treasure trove of data and they all know that. The reason we went to stackoverflow was because there was data (upvotes/downvotes, comments, workarounds) discussed under the answers. That is a very high quality signal from the field.
Aurornis 4 hours ago [-]
> they may have inadvertently priced themselves out of the market.
Last week we were all talking about how Anthropic has too much demand, how they had to rent a data center from a competitor, and how the limits they’ve put on their service to deal with the demand are making users angry.
DeepSeek is cheap because they’re working hard to attract users.
The open weights models released for free weren’t free to train. It’s a loss leader to get attention to try to sell you something in the future.
The prices we pay for tokens right now are set by supply and demand, with some being sold at high premiums and others at a loss. Some models are given away for free after the companies spent money on researchers and compute.
aftbit 3 hours ago [-]
Yes and no. Just take a look at the OpenRouter providers page:
Deepseek v4 Pro is much cheaper when provided by Deepseek itself, likely as a combination of the loss leader strategy you mention and the desire to have more data flow through their pipeline for training. However, the same open weights model, provided by other providers, is somewhere in the $2-3/1M output-tokens range. Compare Opus 4.7 at $25/1M output-tokens.
Unless you mean that releasing open weights models is the loss leader, in which case, you might be right but I hope you're wrong. We've seen some of this from Qwen at least - their latest model is closed only. I hope there's always someone willing to make this bet and release better and better open models.
Aurornis 3 hours ago [-]
> Unless you mean that releasing open weights models is the loss leader, in which case, you might be right but I hope you're wrong.
This is specifically what I meant.
DeepSeek’s official service is trying to recoup some of the training and engineering costs too.
The other providers only have to recoup their hardware costs and the cost of a team to run it.
Even though DeepSeek’s official service is more expensive per token, they’re running at a lower profit than the OpenRouter providers because they had to pay for the R&D.
This is a deliberate choice. We already see it with Qwen splitting their releases between open weight and hosted only models. The open weights are a loss leader to get attention. Without them you’d almost never hear about their hosted models.
Sebb767 3 hours ago [-]
> I hope there's always someone willing to make this bet and release better and better open models.
What would this bet be? Training is expensive and open weights mean that for hosting you compete on price with people that don't have this item on their bill.
aftbit 3 hours ago [-]
"Attention is all you need" - the larger bet is that by releasing your models open-weight, you'll get more attention and mindshare than if you tried to jump in to compete with the major closed providers, and the value of that attention will outweigh the cost of the training run.
So far, it's really only the Chinese labs (and FAIR or whatever Meta's project is called now) that are doing this. Oh yeah, and Google's Gemma.
At the moment, this is all massively distorted by the prestige and investment money flowing into the space. None of the labs have to charge the real cost of inference let alone the marginal cost of training because they are instead lighting investment money on fire to cover that.
One imagines (though I have not investigated in detail) that there's a degree of national prestige work going on too. The Chinese labs are trying to show that they can build better and more efficient models and are releasing open to undercut the US labs.
GodelNumbering 4 hours ago [-]
> lowest energy costs will likely be able to dictate market prices
This is a good insight. I think everyone has seen that chart China's electricity generation going parabolic vs the US. That combined with cheaper yet equally good talent means at least in that segment, the closed labs won't catch up anytime soon
rgbrenner 4 hours ago [-]
> China's electricity generation going parabolic
Even if we all switch to Chinese models, the west isn't going to be running the model on Chinese servers... and the majority of costs are from inference.
> cheaper yet equally good talent
China has tech talent, but this isn't a 3rd world developing nation. Chinese AI researchers are getting paid $10M+ USD/year salaries.
Also they're equally good, but somehow consistently behind?
CuriouslyC 4 hours ago [-]
Training models is as much art as science at this point. There's no gap in scientific acumen at Chinese labs, but the US has more real world experience in the art of training large models, and the US has the capital allocation lead.
Npovview 2 hours ago [-]
Yes but when the Heads of CCP make something their target they chase it with all their might. Read the recent news of the fact that Chinese AI researchers can't leave China. China is now going after the Diamond industry of India.
andsoitis 4 hours ago [-]
> the closed labs won't catch up anytime soon
Which closed labs won’t catch up to whom?
GodelNumbering 4 hours ago [-]
I should have expanded, but basically, the OSS models becoming more and more capable to solve all day to day SWE coding needs will take a cut from frontier labs revenue.
Not to say that frontier labs won't make progress, but the bar for a sufficiently capable agent is all the OSS models need to meet to make this happen. I imagine a lot of hybrid setups where something like Opus is used only for planning/architecture, and anecdotally, the real token consuming part is implementation not architecture.
frank_nitti 4 hours ago [-]
Not my comment, but I’d venture to guess they’re referring to the likes of DeepSeek et al, who are/will be able to host their top-tier inference infra more efficiently
seniorivn 4 hours ago [-]
right now the most likely outcome is that they are going to host locally produced much more power hungry chips, and even if the lead on electricity production will stay, it will be eaten by inefficiency of the hardware.
CuriouslyC 4 hours ago [-]
Unlikely. We have a big lead in terms of general computing devices, but China can leapfrog us with ASICs. They might still lag in the training space for a while but in terms of serving inference, USA is absolutely COOKED at the low-mid end.
narrator 4 hours ago [-]
[dead]
Aboutplants 4 hours ago [-]
I’ve been on this issue for a while now, models are not going to matter as much in the future. Pure energy cost will be the determining factor in who is most successful. The US just cannot build cheap energy the way other China can and at the scale that China will build it. 10 years from now it will be seen as the single source of advantage
tpolm 3 hours ago [-]
> The US just cannot build cheap energy
Nuclear power anyone?
dboreham 3 hours ago [-]
Cheap.
tpolm 3 hours ago [-]
What is expensive in nuclear energy? Reason there is not more of nuclear reactors is not the cost, it is regulation. Regulation can be changed (it also seem to already have, recently, IIRC - starting 2024 NRC law changes by Biden admin and later by Trump admin)
dboreham 3 hours ago [-]
Same as bitcoin then.
SpicyLemonZest 4 hours ago [-]
If the cost of software development falls so precipitously that energy costs are a driving factor, that implies so many other changes that I don't know how we can trust any analysis of what would happen.
gentleman11 4 hours ago [-]
You mean coal?
pjmlp 4 hours ago [-]
Energy costs and privacy.
Currently the projects I am involved require devs to use approaches like Ollama, Foundry Local and co if they happen to have good enough hardware, picking the best alternatives out of https://www.canirun.ai.
burnte 4 hours ago [-]
> "Frontier models" are caught in a financial dilemma of their own making --- they have spent such huge sums on development and as a result, they may have inadvertently priced themselves out of the market.
I feel it'll wind up like the dotcom/fiber bubble. Way too much money poured into it, lots of expensive bankruptcies or write-offs, and a readjusted market sea level.
wongarsu 4 hours ago [-]
Absolutely. We are in a phase of "free money" for AI. Just as with the dotcom bubble that leads to 1) lots of experimentation, and 2) lots of infrastructure buildout (which includes AI model training). Once the money dries up, some infrastructure (including models) will turn out to be profitable, most won't. And some experiments will turn out to be successful, most won't. Lots of useful things will come out of that, both the failed and the successful attempts. Just as the dotcom boom payed real dividends 5-10 years later and laid the groundwork for the world we have today
EGreg 4 hours ago [-]
This sounds to me like the Bitcoin bros. Yes, the first-gen technology was very energy-heavy, but afterwards people (bitcoin maxis and people who held the bag) kept insisting that all new technology is “shitcoins” and that everyone should just buy bitcoin.
Actually, platforms that serve many customers can bring down the costs tremendously through caching, and don’t need the AI credits as much: https://safebots.ai/costs.html
Hamuko 4 hours ago [-]
Bitcoin is a poor analogue for much anything since it's very much designed to be energy-heavy.
iwontberude 4 hours ago [-]
Bitcoin is a good analog because the goal was to create durable trust. The energy utilization is just a means to an end of fairly distributing new tokens to members of the network. There are many other schemes they could use and have considered adopting. The energy use is not necessary, it’s sufficient.
EGreg 4 hours ago [-]
Oh, and neural networks doing a huge number of floating point operations per word is not energy-heavy?
Training these neural networks every few months isn’t energy-heavy?
Both Bitcoin and these large models weren’t “designed to be energy-heavy”. It was a consequence of first-gen design decisions to solve a specific problem. Then as time went on, costs went down and they became a huge outlier in terms of energy. The question is whether the bagholders (the AI companies that invested untild amounts into the initial training) will fight to keep people using their tech and fearmonger about everything else.
Groxx 4 hours ago [-]
Bitcoin is pretty much explicitly designed to use as much electricity as the market will allow, without becoming any more useful. If you removed 99% of the miners from the current system, Bitcoin will still be exactly the same - it won't be any faster or slower, and the same number of transactions will flow through. The cost of electricity serves only as a lower bound on the expected value of a coin.
Neural nets on the other hand generally show more capability as you add more compute power. There's a point where it's less valuable than the cost increase, so people don't do more than that, but it isn't constant value like Bitcoin.
EGreg 1 hours ago [-]
It wouldn’t be exactly the same, because if you had all that mining capacity and 99% magically took a holiday, there is now enough mining power to take over the network anytime. It’s not secure.
Same with AI. Now that the Mythos and other models are finding exploits in every code base and anyone can run them, you can’t afford anymore not to keep burning credits securing your code base. It’s like proof of work red queen theory. You have to run faster and faster just to stay in place. Great business model.
2 hours ago [-]
crimsoneer 4 hours ago [-]
I think this is a compelling argument, but I think 2 issues:
1. I remain unconvinced LocalAI can work well for majority of businesses. It looks vaguely comparable on benchmarks, but it tends to be fragile and a lot of management overhead in reality.
2. Similarly, while Deepseek is comparable to Opus/Codex on benchmarks, for agentic work at scale I definitely notice the difference. That's not to say it's not economical, just that I definitely miss the big boys when I swap.
I kind of wish this was true, because the UK would be in a great place to compete with the US. But somehow people are happy to pay 3x the salary for an engineer in SF.
hobofan 4 hours ago [-]
> It looks vaguely comparable on benchmarks, but it tends to be fragile and a lot of management overhead in reality.
I'm working on an self-hostable LLM (web) UI[0] that aims to provide a comparable good UX to e.g. ChatGPT, and you are right that there is a decent amount of fragility involved, and more management overhead than most people would expect.
However, we usually find that those details happen a lot more in e.g. the harness (= out application), or some prompt tuning that's required for each of the models, rather than model quality itself. We have seen customers using self-hosted LLMs with similar user satisfaction across their organization to other customers that heavily lean on latest GPT-5 models on Azure. Especially given that you have to do some level of tuning and setup anyways, you might as well invest it in "local"/self-hosted AI (if you can make the financials of the inference cost work out for you).
I think it should also be noted that the inference providers on hyperscalers also tend to be quite fragile, each in their own way (e.g. Google with a horrible rate limit system or Azure with almost weekly intermittent 500-error incidents).
Fair points. I used to think that until some months ago but the latest generation of OSS models are surprisingly good. Plus maybe it is the way I work, but I find myself constantly overriding the decisions of frontier LLMs (because they start degenerating towards god objects and spaghettification) so most use I have gotten out of the AI agents is really their ability to code quickly and syntactically correctly.
Also worth noting that it doesn't have to be full either-or, there can be a two tier enterprise deployment that routes to locally hosted vs frontier model, over time more and more usecases could get routed to local LLM
aftbit 4 hours ago [-]
I wish Deepseek could read images. I've been having good luck guiding it around on personal projects, but anything that needs to render to a screen really needs to be looked at to see bugs.
dyauspitr 4 hours ago [-]
Only if you don’t allow construction of local data centers
rgbrenner 4 hours ago [-]
US has over 10x the number of data centers as China; and produces 2x more energy per capita than China.
chrisweekly 4 hours ago [-]
what about energy consumption per capita?
aftbit 4 hours ago [-]
What about it? Energy production basically has to equal energy consumption in the medium term, so if the grandparent comment is correct, it is 2x per capita.
Dunno how trustworthy this source is, but it says ~35 MWh/person in China and 77 MWh/person in USA.
The second issue is that the quality of the model “operator” makes a massive difference in the outcomes. Highly skilled senior devs who know how to prompt and have high agency will outperform team people that lack motivation and foundational skills.
Lastly, there is a massive difference in capabilities, determinism, and error handling between 5T SOTA models like Opus and tiny distillations from DeepSeek that perform well only in benchmarks.
So large companies are getting billed a lot more than those discount subscription plans.
Claude can be very good but enterprise prices don't make sense to me.
The real cost effective way is giving a team $20 cursor $20-100 Claude $20-200 codex.
I'm spending 1k on Claude enterprise easily and that's with trying to spread it on codex and cursor using pi.
What's your source for Opus being a 5T model?
> and tiny distillations from DeepSeek that perform well only in benchmarks.
I don't think you know what you're talking about. Local models aren't “distillations from Deepseek”.
And they don't perform well “only in benchmarks”, Qwen 3.6 is a very decent model (obviously it's not Opus, but it's also much faster and speed is a quality of its own).
Probably Elon Musk: https://eu.36kr.com/en/p/3760679047267075
Elon Musk tweeted that Grok is 0.5T or 1/10th the size of Opus. https://xcancel.com/elonmusk/status/2042123561666855235#m
While this source's reliability is certainly debatable, the size matches the results of this paper, in which researchers estimated the parameter count from model knowledge. https://01.me/research/ikp/
Massive understatement. Nowadays it has become hard to find a single Musk statement that doesn't contain at least one lie.
> the size matches the results of this paper, in which researchers estimated the parameter count from model knowledge. https://01.me/research/ikp/
Thanks for the pointer. This estimation has Grok 6 times bigger than Musk claims it is, so maybe that's where the lie is.
(I'm quite skeptical about that number though, it would be quite disappointing for the US tech if their flagship models had to be that much larger than the Chinese ones for such a small edge in performance. Because I don't think US labs are incompetent, I'd bet that US flagships aren't more than 2/3 times bigger than Chinese flagship. Otherwise it really doesn't bode well.)
This is a temporary phenomenon. Expect either drastic price increases or draconian throttling or both in the coming months.
These companies are operating at huge loses and have hundreds of billions in liabilities and commitments. They need to turn on the money faucet sooner than later.
If prices keep going up, watch for companies to exit frontier models and go to local llama.cpp instances for 6-month-ago SOTA, with the flex of being housed within the office - no more privacy leakage, no more price gouging.
To be honest, I’m not sure why a Y-Combinator backed company hasn’t come out yet flooding the market with highly capable OPAI (pronounced “Oh-pah” as in what Greeks shout as the drink shots), which stands for “On-Prem AI”
… yes, I just made up OPAI right now lol
edit: I see in other comments on this thread you think Ed Zitron is a reliable pundit so that explains everything.
And you think it is unreasonable to consider this unsustainable?
For context, ChatGPT business subscriptions give you a fixed pool of credits to use, after which you get billed a la carte at inflated 1.75x rates vs API, or if you don't want to pay, you get access to anything but the non-reasoning models turned off for the month.
We also tried Claude Enterprise, which was unusable as people blew through their monthly limits in a matter of hours.
Looking at the pricing of 1-2T models like Kimi or DeepSeek on the open market, I'm tempted to assume that inference costs are closer to subscription pricing than to API pricing.
Especially considering that subscriptions a) distribute load over time via rate limits, and b) will include a lot of users who get only a fraction of the possible value, whether they are on a personal account where they are on the rate limit on the weekend but barely use it during the week, or are corporate users who were issued an account they rarely use. Subscription prices are usually measured on the average case, not the most extreme value a power user can get out of it
So just going on vibes?
While some people don't like his content, Ed Zitron shows a lot of evidence for your assumption being very wrong.
These companies are bleeding cash at ungodly rates. It's likely their API pricing is still subsidized if you look at their overall financial picture.
Related, there's a good reason those API prices keep going up a lot every new version and it's not just because the models are better.
Also, API prices going up a lot every new version is more an OpenAI thing, and even there it's a recent trend: GPT 5.0 was a big price drop compared to 4.1, and 4.1 was cheaper than 4o, which itself got a price cut at some point and is cheaper than 4. Meanwhile Anthropic's API pricing stayed stable for many versions, then got slashed to a third with the 4.2 release and have stayed at that level since.
Of course they do have to "make bank" in some way to offset the insane training costs. But whether they go for high prices or high volume, or offer some services as a loss leader to drive profits elsewhere is somewhat orthogonal to that
Even very cheap mini-PCs and laptops can run any of the models run by cloud providers, albeit at a much lower speed (i.e. with the weights stored on SSDs).
Whether such a low speed is useful, depends on the application. For something like a coding assistant or bug scanning, an instant response is desirable, but certainly not necessary.
Giving strong “640k is enough for anyone” vibes here.
Some might say the price wouldn't be great if you could actually process and validate it...
My hunch is that this is the source of much of the variability in outcomes upstream of HN commenters claiming extremes of, "This model changes everything!" to "This[same] model is crap."
We haven't operationalized what it means to "be good at prompting," nor developed proxies/heuristics/shibboleths for accessing prompting skill. There's community skepticism over whether prompting skill even exists. Besides even if prompting skill is real, who wants to hear, "Actually you kinda suck at prompting."
LLMs are likely to replace outsourced devs because your employees that know the context can use LLMs to do what offshore devs did before.
What evidence is there of the quality of Indian devs specifically?
One signal I'd expect to see, for example, would be success in programming competitions. Here's the list of winners of the IOI competition [1] - India has won 3 times.
Meanwhile, Turkey has won 4 times, Estonia has won 5 times, and Vietnam has won 22 times!
Why should we suspect that there are more or better developers in Indian than in any of the countries that has produced more winners??
[1] https://stats.ioinformatics.org/countries/?sort=medals_desc
Offshore Indian devs make sense when you can have a large Indian division so you can amortize communication infrastructure/process management over a lot of heads, and you're building for international customers so you're not paying an English -> X tax inherently.
And outsourcing certainly became a thing though not in the way everyone predicted. There are far more software engineers in the US today than there were in 2004.
Just recently I asked a dev there for a POC of a feature with decent specificity and ended up with about 8k LOC of spaghetti. I re-wrote it later in a few hundred. This is about in-line with my career experience.
I've had a few standout devs there but it does feel like a lot are putting in the bare minimum or are just working really far outside of their abilities.
Request: “manual step X should not be part of the automated build script”
Fulfilled as: build script is now split in two. X is still done as a manual step in between. Rather than prompting and waiting for it to be done, the documentation and scripts no longer mention X.
Part poorly written requirements, part implementing under pressure, and part lack of engineering discipline.
The main issue is catching stuff like this early enough to course-correct. Differences in time zone, language and cultural norms can make that a challenge, all of which LLMs have the advantage in.
I’ve had it assume I meant the folder multiple times :/
The usual counter-argument is the operational burden, but human capital is also a relatively fixed cost. A dedicated team of 3-5 FTEs could probably handle inference ops for a F500 company.
Meanwhile, the capability delta is shrinking fast. We have more evidence that local open-source is viable with the release of DeepSeek v4, and the industry is only trending further in this direction. Especially as we rely more on test-time compute and task-specific harnesses rather than model size.
So, if you're an executive looking at a marginal but fixed operations cost, added flexibility, and a rapidly closing gap in capability, why wouldn't you just run open-source models on your own infrastructure to get those highly predictable costs? Plus, you decrease the risk of one of the frontier
There’s so much uncertainty, it seems like the safe option is to give everyone a Claude or OpenAI subscription/api key until the frontier isn’t changing every six months.
If inference cost comes down (as it has been for the last few years) you’ll be able to run today’s SOTA in your laptop by the end of the year.
0 - https://www.williamangel.net/blog/2026/05/17/offline-llm-ene...
I spent a month comparing Gemini Ultra plan to using much lower cost DeepSeek v4 with open source coding harnesses and, spoiler alert: I was happier using the much cheaper and more environmentally friendly open models: https://marklwatson.substack.com/p/my-evaluation-of-ai-agent...
It's still far too costly and not effective to use Local AI that can match what the frontier models can offer, especially when the inference hardware is being heavily restricted due to geopolitical risks. Claims about local LLMs somehow putting these frontier companies a run for their money I find especially doubtful in the long run.
Tokens are getting expensive because they are beginning to corner the market and will use that advantage to limit hardware distribution within and beyond the borders.
It's more likely that some workflows will see more local LLMs but those will never be the ones that require frontier model level or beat the price that a lighter smaller version of frontier model will offer to capture that tail end
My impression is that deepseek designed v4 specifically for cheap inference and they are not loosing money even at 75% lower price.
This is absolutely false, because other providers serving the Deepseek models on OpenRouter are also able to offer very low prices, and they don't have the money to subsidize anything.
There are misaligned incentives here between users just trying to get stuff done and AI companies competing on having the "smartest" model that passes benchmarks and continuously does some nobel peace price winning stuff. It's mostly overkill for the more mundane stuff normal people actually do with them. It's nice to have the option when you need that. But defaulting to that is not economical and a bit unnecessary.
There's also a difference between smart models and bigger context windows. Most of the progress in the last year was simply the context windows getting big enough to fit all/most of the stuff needed to solve issues. Before then, you had to carefully manage the context to not run out of space and they wouldn't fit much more than small hobby projects.
With sub agents, the parent agent doesn't need to be a frontier model. It can delegate to smarter agents. And most stuff it delegates shouldn't need a frontier model. Wouldn't it be nice if it could decide on a case by case basis.
The walled gardens offered by OpenAI, Antrhopic, and others currently default to one size fits all "frontier" models. This is not sustainable. They should evolve to using smaller and effective models most of the time with complexity based escalation as needed based on either estimated complexity or when the small models fail. I'm guessing some open source based alternatives to these walled gardens are probably already heading that direction.
The irony here is that with a walled garden, these companies are selling a premium experience. But in the current market that boils down to burning billions of investor cash to keep the GPUs going without much hope on profitability. Eventually surviving companies are going to have to compete on quality, cost and margins. The smart approach would be to dynamically adapt token and context window sizes instead of blindly defaulting everything to the best possible. Don't boil the oceans for a simple email summary or a simple web UI. That stuff already worked well enough with models even a few years ago.
- 5.5 is significantly more token efficient than 5.4 - the same task takes often a third of the tokens
- because of this, is it also much faster to do the task
- you get high "intelligence" per token even after accounting for token efficiency - 5.5 medium is just under 5.4 pro levels of intelligence (imo). It has found tricky bugs for me that all other models failed at
So overall, ideally you will end up with more intelligent, faster model for slightly cheaper.
Back when it became expensive I learned to live with it and I find my "AI skills" (mainly communication) have a substantial impact on the efficiency of the model. Not saying my work is difficult, it's not, but I find there is quite a bit of wiggle room. Smaller models can still perform useful work, but you have to do the heavy lifting yourself. It saves a ton of money.
I used to burn through 75% of my tokens in an hour or two. Now I can work all day and hit maybe 50-60% if I use it heavily.
I'm curious, who/what is operating the frontier LLM in this scenario?
The rest of the article is equally incoherent.
The big question is how subsidies vs technology improvement will play out. As we saw with Uber, selling at a loss can happen for a very long time, and technology improves relentlessly.
For reference, we publish https://botsbench.com/ that shows time and cost per answer are going down while quality is going up.
Local LLMs are great and very useful but if you are claiming that their code quality is in the same ballpark as Claude Code or Codex with their best models I cannot consider you a serious person. I feel like this is analogous to the folks arguing that The Cloud is "someone else's computer." As if billions of dollars of spend gives these companies zero benefit over a Mac mini.
Regarding offshore, at least in my experience, better coding agent output is down to two factors. First, is subject matter expertise. Providing the right context to the coding agent based on the tech you are building for is beyond critical. That's the issue with the Vibe Coded slop projects. No expertise in a technology means no awareness of gotchas, React is the most obvious because the LLM default is to useEffect endlessly.
The bigger issue is that by their very nature LLMs are very sensitive to quality prompting in English. I have seen offshore devs fail endlessly because they don't have the English skills to successfully prompt the machine. That has caused more work for my US based devs to either carefully tune the work ticket so it is basically a coding agent prompt. Or to go through multi day exercises to enforce better prompting.
A single US dev with Claude Code is orders of magnitude better than typical offshore. Adding local models into the mix would make offshore completely useless. I'm sure many companies will see ballooning AI bills and expensive onshore devs and be very tempted to go to TCS or similar. I hope so, because that will give startups plenty of easy targets to disrupt.
AI will become a commodity technology the same way virtual machines are a commodity.
The contradiction here is that without frontier models, there'd be no foundation for models like DeepSeek to reference and catch up to. Is there an economic model that captures this kind of dynamic?
A bigger issue is this thing calls AIs better coders than people and I have tried for the past 4 months to get one of the several I looked into to consistently produce a simple event-bus backed Java monorepo going with exactly zero success. Claude even repeatedly wanted to put my login logic at the actual event bus, for some reason.
What does "better coder" _exactly_ mean at this point?
This is the (m/b)illion dollar question, isn't it? I think there's also a question of what do you think capability is exactly, and how the difference manifests itself.
On the one hand, when something becomes "good enough" that's a clear capability threshold. On the other hand, what's the limit of those capabilities, and equally as important, how does capability reflect on reliability?
We've seen "local models" lately improve on capabilities where they're "good enough" for some tasks. Reliability of solving those tasks is a bit harder to measure/benchmark/test. It'll get better as more people work with those models. But, something I've noticed in the past ~6months is that the frontier models are gaining a lot in both the breadth of capabilities, as well as the reliability of solving those tasks that they're capable of solving. I think this is where scaling (both compute and data) is showing, and where having more compute is simply better (more parallel exploration, more training data output, more broad data, etc).
There's also the problem of benchmarking true capabilities. The popular ones are getting old, and aren't as reliable as they used to be (not even touching on the subject of benchmaxxing, just thinking about their saturation, even with honest intentions).
So the question then becomes what will users prefer? Do you get the best of the best, or the one that's good enough? There might be a market for both, honestly. Not everyone does SotA stuff. And a lot of what people used to do in a company is probably mundane enough that a "good enough" model with "good enough" reliability can probably handle (w/ some supervision ofc).
What I'm more interested in is if things like Thaalas succeed and they get to provide local hardware that runs models "burned in silicon". That would be interesting, because speed and all the advantages of local models are a "quality" on their own. For example, right now I'd pay ~1k$ for an external hdd-sized block that can run a ~32B model that's popular right now, even knowing that it can only run that model. I have no idea if that's feasible or not, if it makes sense from a financial pov. But I'd buy one. And local inference on dedicated chips doesn't need to be "oss only". I'm sure oAI / etc would probably take the risk of licensing one of their -mini / -lite models provided that the risk of the weights leaking is small enough (and it probably is).
> This keeps a ceiling on how much or how fast the frontier labs can raise prices.
I generally agree, but from a different perspective. Up till now we've seen that the 3 labs influence each other's price points. When gpt5 came out at a radically smaller price, the others lowered them as well. Now with opus being SotA for coding, w/ 5.5 close behind, they've raised them back. Google seems to follow slowly. But there's hope that, being 3 top labs + 2 trailing (xAI & Meta), there'll be pressure once again. If any of those trailing labs manage to get to SotA again, the prices will drop once more. Some people say that open source also provides a pressure here, but I'm not yet convinced of this. There's still a question of who'll serve the models, at what scales, etc.
"Frontier models" are caught in a financial dilemma of their own making --- they have spent such huge sums on development and as a result, they may have inadvertently priced themselves out of the market.
Energy costs are a huge factor for AI. He who has the lowest energy costs will likely be able to dictate market prices. And fossil fuels dependence doesn't look to be advantageous for AI.
The frontier models are going to win that way. They won't feed your code back into the system but they will track which code you keep and what code gets a "try again claude".
They're not going to lose on price. No consumer software ever has because ultimately it's not that expensive relative to salary and the marginal cost is 0.
This is true for traditional SaaS too, but the number of concurrent users that could be served by one machine and the cost of the hardware were both at least an order of magnitude better.
In other words, AI is not your daddy's software. Comparing AI with old school software markets simply does not compute.
Lists examples of software that are free to the users
Last week we were all talking about how Anthropic has too much demand, how they had to rent a data center from a competitor, and how the limits they’ve put on their service to deal with the demand are making users angry.
DeepSeek is cheap because they’re working hard to attract users.
The open weights models released for free weren’t free to train. It’s a loss leader to get attention to try to sell you something in the future.
The prices we pay for tokens right now are set by supply and demand, with some being sold at high premiums and others at a loss. Some models are given away for free after the companies spent money on researchers and compute.
https://openrouter.ai/deepseek/deepseek-v4-pro/providers
Deepseek v4 Pro is much cheaper when provided by Deepseek itself, likely as a combination of the loss leader strategy you mention and the desire to have more data flow through their pipeline for training. However, the same open weights model, provided by other providers, is somewhere in the $2-3/1M output-tokens range. Compare Opus 4.7 at $25/1M output-tokens.
Unless you mean that releasing open weights models is the loss leader, in which case, you might be right but I hope you're wrong. We've seen some of this from Qwen at least - their latest model is closed only. I hope there's always someone willing to make this bet and release better and better open models.
This is specifically what I meant.
DeepSeek’s official service is trying to recoup some of the training and engineering costs too.
The other providers only have to recoup their hardware costs and the cost of a team to run it.
Even though DeepSeek’s official service is more expensive per token, they’re running at a lower profit than the OpenRouter providers because they had to pay for the R&D.
This is a deliberate choice. We already see it with Qwen splitting their releases between open weight and hosted only models. The open weights are a loss leader to get attention. Without them you’d almost never hear about their hosted models.
What would this bet be? Training is expensive and open weights mean that for hosting you compete on price with people that don't have this item on their bill.
So far, it's really only the Chinese labs (and FAIR or whatever Meta's project is called now) that are doing this. Oh yeah, and Google's Gemma.
At the moment, this is all massively distorted by the prestige and investment money flowing into the space. None of the labs have to charge the real cost of inference let alone the marginal cost of training because they are instead lighting investment money on fire to cover that.
One imagines (though I have not investigated in detail) that there's a degree of national prestige work going on too. The Chinese labs are trying to show that they can build better and more efficient models and are releasing open to undercut the US labs.
This is a good insight. I think everyone has seen that chart China's electricity generation going parabolic vs the US. That combined with cheaper yet equally good talent means at least in that segment, the closed labs won't catch up anytime soon
Even if we all switch to Chinese models, the west isn't going to be running the model on Chinese servers... and the majority of costs are from inference.
> cheaper yet equally good talent
China has tech talent, but this isn't a 3rd world developing nation. Chinese AI researchers are getting paid $10M+ USD/year salaries.
Also they're equally good, but somehow consistently behind?
Which closed labs won’t catch up to whom?
Not to say that frontier labs won't make progress, but the bar for a sufficiently capable agent is all the OSS models need to meet to make this happen. I imagine a lot of hybrid setups where something like Opus is used only for planning/architecture, and anecdotally, the real token consuming part is implementation not architecture.
Nuclear power anyone?
Currently the projects I am involved require devs to use approaches like Ollama, Foundry Local and co if they happen to have good enough hardware, picking the best alternatives out of https://www.canirun.ai.
I feel it'll wind up like the dotcom/fiber bubble. Way too much money poured into it, lots of expensive bankruptcies or write-offs, and a readjusted market sea level.
Actually, platforms that serve many customers can bring down the costs tremendously through caching, and don’t need the AI credits as much: https://safebots.ai/costs.html
Training these neural networks every few months isn’t energy-heavy?
Both Bitcoin and these large models weren’t “designed to be energy-heavy”. It was a consequence of first-gen design decisions to solve a specific problem. Then as time went on, costs went down and they became a huge outlier in terms of energy. The question is whether the bagholders (the AI companies that invested untild amounts into the initial training) will fight to keep people using their tech and fearmonger about everything else.
Neural nets on the other hand generally show more capability as you add more compute power. There's a point where it's less valuable than the cost increase, so people don't do more than that, but it isn't constant value like Bitcoin.
Same with AI. Now that the Mythos and other models are finding exploits in every code base and anyone can run them, you can’t afford anymore not to keep burning credits securing your code base. It’s like proof of work red queen theory. You have to run faster and faster just to stay in place. Great business model.
1. I remain unconvinced LocalAI can work well for majority of businesses. It looks vaguely comparable on benchmarks, but it tends to be fragile and a lot of management overhead in reality.
2. Similarly, while Deepseek is comparable to Opus/Codex on benchmarks, for agentic work at scale I definitely notice the difference. That's not to say it's not economical, just that I definitely miss the big boys when I swap.
I kind of wish this was true, because the UK would be in a great place to compete with the US. But somehow people are happy to pay 3x the salary for an engineer in SF.
I'm working on an self-hostable LLM (web) UI[0] that aims to provide a comparable good UX to e.g. ChatGPT, and you are right that there is a decent amount of fragility involved, and more management overhead than most people would expect.
However, we usually find that those details happen a lot more in e.g. the harness (= out application), or some prompt tuning that's required for each of the models, rather than model quality itself. We have seen customers using self-hosted LLMs with similar user satisfaction across their organization to other customers that heavily lean on latest GPT-5 models on Azure. Especially given that you have to do some level of tuning and setup anyways, you might as well invest it in "local"/self-hosted AI (if you can make the financials of the inference cost work out for you).
I think it should also be noted that the inference providers on hyperscalers also tend to be quite fragile, each in their own way (e.g. Google with a horrible rate limit system or Azure with almost weekly intermittent 500-error incidents).
[0]: https://github.com/EratoLab/erato
Also worth noting that it doesn't have to be full either-or, there can be a two tier enterprise deployment that routes to locally hosted vs frontier model, over time more and more usecases could get routed to local LLM
Dunno how trustworthy this source is, but it says ~35 MWh/person in China and 77 MWh/person in USA.
https://ourworldindata.org/grapher/per-capita-energy-use