They will be, and that moment is not that far off. We've got the progression in place already: first, large data centers could have performant LLMs, we are now firmly in "a bunch of servers with a couple of H100s each" territory, slowly going into "128 GB VRAM on a MacBook Pro or a Strix Halo". Within the next year, the pattern of "expensive remote LLM for planning, local slow-but-faster-than-human LLM for execution" will become the norm for companies, slowly moving to "using local LLM for everything is good enough". And then we'll have the equilibrium we already have with the "classic cloud": you either self-host or pay for flexibility and speed. The question will be: how much of the current compute capacity craze will local hosting give the kiss of death to and what that means for the market.
reisse 12 hours ago [-]
> They will be, and that moment is not that far off.
It's here, right now. I'm running quantized Qwen and Gemma on a decent, but three years old gaming rig (think RTX 3080 12GB and 32 GB RAM). Yes, it's slow, it has a small context window. But it can (given a proper harness) run through my trip photos and categorize them. It can OCR receipts and summarize spendings. It can answer simple questions, analyze code and even write code when little context is required. Probably I could get a half-decent autocomplete out of it, if I bother with VS Code integration. "128 GB VRAM on a MacBook Pro or a Strix Halo" is already a minimum viable setup for agentic coding, I think.
> And then we'll have the equilibrium we already have with the "classic cloud": you either self-host or pay for flexibility and speed.
Currently, it works exactly the other way. The cloud versions are orders of magnitude cheaper than self hosting, because sharing can utilize servers much more efficiently. Company can spend half a million bucks on a rig running GLM 5.1, and get data security, flexibility and lack of censorship, but oh it's so expensive compared to Anthropic per-seat plans.
pbgcp2026 2 hours ago [-]
I'm sorry to spoil it for you, but Perl script was able to do all of that like ... 10 years ago? The out-of-the-box Shotwell manages photos quite well without any intelligence. The problem, as people mentioned above, is SOTA models cognitive and tooling abilities. Also, have you noticed as top-end Mac Studios got downgraded recently? They don't want you to have access to frontier models. And you will not have it. See Mythos as Exibit A.
jclardy 10 minutes ago [-]
The Mac Studio's disappearance is related to the fact that people now want them for the purpose of running local models. Supply and demand. That plus Apple doesn't shift prices for released products, and it essentially became underpriced when large RAM quantities exploded in price. For the price of 512GB of RAM alone you could get an M3 Ultra with 512GB of unified memory in a nice, quiet, and power efficient package. With the RAM you still need to spend a few thousand more on CPU/GPU, power supplies, storage and case.
Also the fact that an M5 version will be coming, and they likely know they are going to sell out on day one (I expect we'll see a price correction from Apple for higher end configs of M5 studios, base price will probably stay the same), so they need to build up stock reserves.
tjoff 11 minutes ago [-]
Do we even have decent OCR nowadays? Any free solutions?
JonGretarB 29 minutes ago [-]
Huh? Why would Apple not want you to be able to run local models? They have very deliberately stayed the hell away from this space.
Hamuko 1 hours ago [-]
>Also, have you noticed as top-end Mac Studios got downgraded recently? They don't want you to have access to frontier models. And you will not have it.
Isn't that a function of RAM supply not being available now?
ubercore 1 hours ago [-]
The conspiracy angle here is not really relevant. Ram is expensive and they're gearing up for M5 studios. Not the illuminati keeping better LLM models out of your hands.
digitaltrees 10 hours ago [-]
I built my own IDE and run my own model specifically to have private agentic coding. I can still access model APIs but I can be purely local if I want too. It’s amazing.
manmal 3 hours ago [-]
Curious, why did Zed with ACP not work for you?
Fokamul 1 hours ago [-]
I'm just guessing, but IDE which is using 3D acceleration just for stupid UI to run "smoothly", that is ridiculous.
Who runs IDE with LLM agents accessing your local filesystem, on bare metal?
Or am I alone to run everything LLM related on my VM just for development work.
Then because of ZED genius decision, you need to share your GPU to VM, then some important features will not work, like snapshots. So you also need workaround for this, etc.
Too much hassle, Zed is not for me.
But I'm anti-Apple, so maybe that's the reason :)
Btw, even "ImHex" devs realized this and they're providing version without acceleration for VM use.
They're using ImGui. Using it for local desktop app UI is also ridiculous, imho. Whatever.
DrewADesign 8 hours ago [-]
Multiple gazillion dollar companies each seem to be spending to ensure that they alone pretty much dominate all knowledge work, with customers eating up their tokens like Cookie Monster. I wonder if the any of them could survive as LLM providers if they not only failed to do that, but the entire industry ended up selling what the current Cookie Monster would call a “sometimes snack,” for very special occasions?
datadrivenangel 11 hours ago [-]
In my experience once you get to ~30 gigs of ram for a model like Gemma4, the rest of the 128g of memory is simply nice to have. The speed and costs are what make it tough though, because its slower and more expensive than the same model served on a big accelerator card, and is going to be worse than a frontier model.
digitaltrees 10 hours ago [-]
I wonder if it really needs to be worse. I am playing with the idea of fine tuning a model on my exact stack and coding patterns. I suspect I could get better performance by training “taste” into a model rather than breadth.
andy_ppp 6 hours ago [-]
Fine tuning these models (at least with PPO or equivalent) requires even more VRAM than inference does, potentially 2-3 times more.
epicureanideal 6 hours ago [-]
I also wonder about JS only, Python only, etc models.
Maybe the future is a selection of local, specific stack trained models?
andy_ppp 6 hours ago [-]
These models being able to generalise at coding will likely get worse if you remove high quality training data like all of python.
dust1n 2 hours ago [-]
Can you share how you use it to categorize trip photos!
fennecfoxy 2 hours ago [-]
>It's here, right now.
I mean I've been forcing my good old 1080ti to run local models since a short while after llama was first leaked.
But I wouldn't say "local models are here" in the same way as "year of the Linux desktop!111"
Until someone can just go out and buy some sort of "AI pod" that they can take home, plug in and hit one button on a mobile app to select a model (or even just hide models behind various personas) then I wouldn't say it's quite there yet.
It's important that the average consumer can do it, I think the limitations for that are: things are changing too quickly, ram+compute components are exceedingly expensive now, we're still waiting on better controls/harnesses for this stuff to stop consumers not just from shooting themselves in the foot, but blowing their foot clean off.
Would be interesting to see a Taalas-like chip in a product, albeit there's so many changes going on atm with diffusion based models, Google's Turboquant (which as someone who has had to almost always run quantized models, makes a lot of sense to me).
winocm 10 hours ago [-]
Perhaps I am the odd one out here, but a small part of me wants to see what happens when you run a proprietary SOTA model on a laptop.
pianopatrick 6 hours ago [-]
Currently I'm testing something like this just to see what happens. I have an old laptop with 4GB of RAM. I attached a USB drive with Gemma 4 31B model (which is 32.6 GB). Currently the laptop is running llama.cpp and trying to respond to a prompt by streaming the model from disk.
The USB drive light is flickering, showing something is happening. It's been about 8 hours since I entered the prompt and I've gotten about 10 tokens back so far. I'm going to leave it running overnight and see what happens.
stuaxo 2 hours ago [-]
Nice.
What did you use to do this, something standard like llamacpp or something else like vllm or your own contraption ?
amelius 3 hours ago [-]
You burn your lap?
reisse 10 hours ago [-]
Nothing special?
I mean, inference engine might need to get some tweaks, to support whatever compute is available. But then, if you put a few terabytes of disk for swap, and replace RAM to bigger sticks if possible, it should work? Slowly, of course, but there is no reason it should not to.
reverius42 10 hours ago [-]
The big difference will be measuring seconds per token instead of tokens per second.
martijnvds 6 hours ago [-]
Seconds per token is just fractional tokens per second ;)
degamad 3 hours ago [-]
> fractional
Reciprocal?
yfw 10 hours ago [-]
You can if you have enough ram slots?
SilentM68 6 hours ago [-]
Not sure if this is exactly the scenario you envision but I run ComfyUI on an Acer Helio 300 laptop, from four years ago. Has 16GB RAM, NVIDIA GeForce RTX 2060 w/6144MiB of VRAM and have generated a few images using "NetaYumev35_pretrained_all_in_one.safetensors" @ 10.6GB checkpoint, (well beyond the 6GB capacity of the RTX 2060 card). That being said, it takes more than 10 minutes to complete the task. Of course, I have to turn off all other apps, and browser tabs or hibernate them. If I don't, the laptop's fans begin to spin up like an airplane propeller. It's worth mentioning that I've tried to do this with other IDEs and all seem to fail with some error or another, usually out of VRAM issue. I've only gotten it to work with ComfyUI.
I use an anaconda environment, though would have preferred an "uv" environment, on Linux and automate the startup sequence using the following script (start_comfy.sh) from the term rather than manually starting the environment from same said term:
I'm hesitant to increase the sizes of the renders as that will surely stress my laptop's components.
t_mahmood 4 hours ago [-]
I'm not running local for exactly the same reason, to not stress my components. As it seems we are in for a long haul due to this AI bubble (can't wait for it to pop) so need to make sure I survive this madness, as for sure I can't afford to replace anything right now.
antidamage 10 hours ago [-]
This is my exact setup as well and dear lord gemma is absolutely batshit insane. I'm trying to get a self-reflection and confidence loop going now, but it does feel like it's not the local resources, it's the limits of the training. Dedicated coding or dedicated real-world task models would be a good optimisation.
yieldcrv 8 hours ago [-]
I need to see these proper harnesses
I tried oMLX and OpenCode a few weeks ago and the 65k context window was useless, it tried to analyze a very small codebase before going full on agentic and ran out of context window immediately
I don't have time to tweak 1,000 permutations of settings just re-prove that its not as smart as Opus 4.6
I need out the box multimodal behavior as similar as typing claude in the command line and its so not there yet
but I'm open to seeing what people's workflows are
phamilton 7 hours ago [-]
I'm running opencode with qwen3.6-35b-a3b at a 3-bit quant. I also have qwen3.5-0.8b used for context compaction. I run with 128k context.
It's usable. I set it loose on the postgres codebase, told it to find or build a performance benchmark for the bloom filter index and then identify a performance improvement. It took a long time (overnight), but eventually presented an alternate hashing algorithm with experimental data on false positive rate, insertion speed and lookup speed. There wasn't a clear winner, but it was a reasonable find with rigorous data.
Balinares 4 hours ago [-]
Do you encounter looping issues at such low quants? How do you deal with those?
I gave it the reference C implementation, the LTFS spec from SNIA, and asked it to use the C implementation to verify the correctness of the Go code.
LTFS is a pretty straightforward spec, so it made a very reasonable port within about 2 days. It's now working on implementing the iSCSI initiator (client) to speak with my tape drive directly, without involving the kernel.
Edit: the model is Qwen3.6-35B
nullsanity 8 hours ago [-]
Hey man, you can just say "I'm lazy, so I'm staying with the cloud. if I wanted to use my brain, I wouldn't be using AI, gosh" - it's much shorter.
fennecfoxy 2 hours ago [-]
Personal attacks are against the rules, by the way.
root_axis 9 hours ago [-]
You are greatly underestimating the hardware requirements for productive local LLMs. Research consistently shows that parameter count sets the practical ceiling for a model's reliability. Quantized models with double digit param counts will never be reliable enough to achieve results in the realm of something like Opus 4.6.
thot_experiment 5 hours ago [-]
Flat wrong. Q6 Gemma 31b feels a lot like opus 4.5 to me when run in a harness so it can retrieve information and ground itself. The gap is not that big for a lot of usecases. Qwen MoE is fast as fuck locally for things that are oneshottable. I have subscriptions to all the major providers right now and since Gemma 4 and Qwen 3.6 came out I haven't hit limits a single time. I'm actually super surprised by the number of things I try with Gemma 4 with the intent of seeing how it fails and then having Claude do it only to come away with something perfectly usable from the local model.
cbg0 5 hours ago [-]
Your n=1 might not be very relevant outside your personal use. In less contaminated benchmarks Gemma 4 is way below Sonnet 4.5, let alone Opus models: https://swe-rebench.com/
thot_experiment 3 hours ago [-]
Benchmarks only give you the roughest idea of how models compare in real world use. They're essentially useless beyond maybe classifying models into a few buckets. The only way you gain an understanding of something as complex as how an LLM integrates with your workflow is by doing it and measuring across many trials. I've been running Opus 4.7 in Claude Code and Gemma 4 31b in parallel on projects for hours a day this past week, Opus 4.7 is definitely better, but for many things they are roughly equivalent, there are some things on the edge that are just up to chance, and either model may stumble across the solution, and there are some areas of my work that reliably trip up both models and I get better mileage out of writing code the old fashioned way. I understand that I'm just one data point, but I'm not writing CRUD apps here, I'm doing DSPs and weird color math in shaders, I don't think any of it is hard, and the stuff that I think is hard none of the models are good at yet, but idk, they just don't seem that extremely disparate from one another.
FWIW I think Gemma 4 31b is more likely to be of use to me than Sonnet, idfk, maybe it's a skill issue but I love Opus 4.7, undisputed king, but Sonnet seems borderline useless and I basically think of it as on the same level as Qwen 35b MoE.
cbg0 2 hours ago [-]
"essentially useless" is a gross overstatement. Your personal benchmarks will always provide you with the most value, but disregarding standardized benchmarks because you care more about vibes is not exactly scientific.
thot_experiment 1 hours ago [-]
Sorry, "essentially useless in the context of local model availability". It's a fine model but it's tier of inference is fully fungible.
larodi 4 hours ago [-]
I’m building a pipeline and testing against gemma4 and Gemini’s 3-1 flash. Both are very good on certain tasks and even n-way clustering works almost perfect almost always.
But they diverge greatly on other particular ones whenever the ViT tower and the apriori knowledge of the world is crucial. I wish Gemma was on par but both me and Google know they not.
onion2k 4 hours ago [-]
You do need to ask whether or not Sonnet or Opus are overkill for a lot of work though. If Gemma4 with some human effort can achieve the same result as Sonnet then it's arguably a lot more cost effective as you're paying for the person to operate each one regardless.
thot_experiment 3 hours ago [-]
I 100% agree with your philosophy but I wanna note that I genuinely find Gemma 4 31b to be better than Sonnet. To be clear, this makes NO sense to me, so I'm probably just high and making stuff up or just biased by a small sample size since I don't use Sonnet that often. I find that Gemma 4 makes the sort of "dumb AI" mistakes Sonnet makes less often, especially in agentic mode. I genuinely don't know how that can be true but Sonnet feels much more like "autocomplete" and Gemma 4 feels like "some facsimile of thought".
stuaxo 2 hours ago [-]
What harness are you using ?
I'm going to switch to local LLMs for most stuff soon.
thot_experiment 2 hours ago [-]
Overall using screentime as the metric, derived from some imperfect logging and vibes it's about 50% OpenCode 15% Continue 15% my homebrew bullshit 13% Claude Code and 7% Cline. I've been deep on agentic stuff lately (1.3wks aka 3 months of AI time), there are only so many hours in the day to duplicate work and AB test, but in the past I've sworn by Qwen Coder + llama.vim and I still enjoy that workflow for deep work far more than I like prompting agents, but there's a lot of dross I'm learning to delegate.
root_axis 4 hours ago [-]
Sorry but you're just seeing what you want to see. The idea that a 31b model is anywhere even in the ballpark of something like Opus 4.5 is just absurd on its face.
thot_experiment 3 hours ago [-]
False. The absolute capability is irrelevant, with the proper harness 31b is more than adequate for a very large portion of the tasks I ask AI to do. The metric isn't how good the model is at Erdos Problems, it's how reliably it can remove drudgery in my life. It just autonomously reverse engineered a bluetooth protocol with minimal intervention, it's ability to react to data and ground itself is constantly impressive to me. I do a ton of testing with these models, today I had Gemma answer a physics problem that Opus 4.7 gave up on. With a decent harness and context the set of tasks where their capabilities are both good enough is very surprising. The tasks I have that stump Gemma often also stump Opus 4.7.
diordiderot 53 minutes ago [-]
Maybe reaching for an analogy would be helpful here.
Thot_experiment is saying that his 2016 Toyota Prius is a great and reliable car for his daily commute and running errands.
Whereas everyone is screeching about its capability gap with a Lockheed Martin F35 lightning.
amelius 3 hours ago [-]
This is like saying that 640kB is enough for anybody.
thot_experiment 3 hours ago [-]
No, it isn't. I am saying that the set of tasks that can be completed by Opus 4.7 has a surprisingly large overlap with the set of tasks that can be completed by Gemma 31B. It is meaningfully equivalent in many cases.
(of course if i'm being honest 640kB is fine, i'm sure tons of the world's commerce is handled by less for example, the delta between a system with 640kb of ram and a modern one is near nil for many people, the UX on a PoS terminal does not require more than that for example, the hacker news UX could also be roughly the same)
BoredomIsFun 3 hours ago [-]
It would be true, if model providers did not throttle their models. I do not have definitive proof they do but the rumors are abundant.
alfiedotwtf 5 hours ago [-]
I’m guessing Qwen3.6 for agentic coding and Gemma4 for non-coding stuff?
thot_experiment 4 hours ago [-]
No, exactly the opposite actually. Qwen3.6 is too imprecise for long running agentic tasks. It doesn't have the same ability to check itself as Gemma does in my testing. I keep Qwen MoE in vram by default because there are tons of tasks i trust it to oneshot and it's 90tok/sec is unparalleled, anything where I don't want to have to intervene too much it can't be trusted.
KurSix 25 minutes ago [-]
[flagged]
wincy 9 hours ago [-]
Won’t these H100s drop in price in a few years? With the data center build out surely these will become 1/10th the price and you’ll be able to set up a local LLM as good as opus 4.7. Even if the frontier model become more advanced, and memory hungry, you could use the same power usage as your oven to run a current day frontier model as needed? If I could drop $10,000 to have an effectively permanent opus 4.7 subscription today, I would.
root_axis 9 hours ago [-]
> Won’t these H100s drop in price in a few years
Doubtful. The increase in demand is greatly outpacing supply, and all signs point to a continued acceleration in demand
> If I could drop $10,000 to have an effectively permanent opus 4.7 subscription today, I would.
lol well obviously, but realistically that price point is going to be closer to $100k, with a perpetual $1k a month in power costs.
wincy 8 hours ago [-]
Cool, thanks for the information. I guess they drive prices down by massively parallelizing requests on say an H100 X8 array? So this is spread across. So if I say, wanted to use it for 8 hours a day in my theoretical world it’d be too expensive. My work definitely wouldn’t pay $100,000 for a server farm even if it’d give an AI to all our employees, you’d have to have engineers, a colocation space, basically all the problems that companies didn’t like and went to AWS for.
root_axis 7 hours ago [-]
Well $100k was a generous guesstimate for some time in the future where something like an Opus 4.7 is old news.
If we think about the near future, something like Kimi2.6 is within the realm of Opus 4.6 today, but requires closer to $700k in hardware to run.
Galanwe 1 hours ago [-]
Kimi 2.6 is very close to the Opus family from my experience. Also it does absolutely not require $700k to be able to run locally in an interactive fashion. We are talking more in the range of $10k for a slow Q2 with degraded perplexity, to ~$35k for an acceptably fast 200k context Q4 (quasi lossless perplexity).
dyauspitr 4 hours ago [-]
Why? These models are going to keep drastically improving and given all the new data centers token prices will probably drop a lot in the future. Seems shortsighted given the absurd timelines these things have been improving on.
aaronblohowiak 5 hours ago [-]
taalas!!!
33MHz-i486 8 hours ago [-]
opus 4.7 caliber models are trillions of params, and a single instance would likely run on multiple h200s. $100k of hardware. not coming to your laptop anytime soon.
segmondy 9 hours ago [-]
Jokes on you. We are already running Deepseekv4Flash, Mimo2.5, MiniMax2.7, Qwen3-397B locally in very affordable hardware. These models are in the real of Opus4.6. For those of us a bit crazy, we are running KimiK2.6, GLM5.1 and more ...
root_axis 8 hours ago [-]
I have two A100s and have been playing with local models for years. There's definitely moments where they are quite impressive, but small context sizes and unreliability become immediately obvious.
> For those of us a bit crazy, we are running KimiK2.6, GLM5.1
Yes, those can compare to Opus, but you can't run those unquantized for less than $400k in hardware.
doctorpangloss 8 hours ago [-]
Two Mac Studio M3 Ultra 512GB and 1 USB cable can run all those models - maybe about $30,000 in hardware - and based on my benchmarks, those Mac Studios were twice as fast as the A100s on Deepseek v4 Flash, which has a quantization but not really a lossy one.
root_axis 8 hours ago [-]
That cannot run KimiK2.6 or GLM5.1 i.e models within the ballpark of anything offered by frontier companies.
Galanwe 1 hours ago [-]
Yes it can, but the experience is not great.
A single M3 maxed can run a Q2 Kimi 2.6, though thats with a hardly degraded perplexity.
2x M3s with RDMA can run a lossless Kimi2.6 at Q4, but with CPU only you would get okayish decode but horrible (+1m) TTFT, that wouldnt be a great _interactive_ experience.
binyu 8 hours ago [-]
They all still fall short of Opus 4.6, definitely though. They are good but fail on extremely complex tasks, in contrast with a frontier model that will keep on trying until it succeeds or exhausts the solutions space.
julianlam 7 hours ago [-]
Not by much, and moving goalposts makes for a bad comparison. Local open weight models are already more powerful than frontier models from only a year back.
If you believe what you read here, the gap is closing fast.
stubish 5 hours ago [-]
It depends on what you mean for 'productive'. Article mainly seems to be about targeting consumer level hardware, such as the Neural Processing Unit you need for a 'Copilot PC'. Windows Recall is (was?) one such local AI application. If Microsoft get their way and my next PC has one, I look forward to using it for 'productive' purposes such as playing games, handling natural language stuff and leaving my GPU free for GPUing.
CuriouslyC 9 hours ago [-]
Parameter size gets you world knowledge and better persistence of behavior as context grows. Both of those things can be engineered around to a large degree, and the latest Qwen models show that small models can be quite smart in narrow domains and short time windows.
alfiedotwtf 4 hours ago [-]
… maybe we should just teach models how to get their world knowledge from a local Postgres connection! Then the model can be tiny, and it can query to its little heart desires AND run on commodity hardware TODAY!
byzantinegene 9 hours ago [-]
i would argue we don't need anything near Opus to be productive. Sonnet is plenty productive enough
root_axis 9 hours ago [-]
I use Opus 4.6 as an example because it's the LLM that has been widely recognized by the public as being reliably capable of doing real work across many domains. However, the same logic applies to Opus 4.5 and even previous generations. These models have huge parameter counts and large context sizes, there's no training technique that can compensate for those qualities in small and quantized models.
JumpCrisscross 9 hours ago [-]
> we don't need anything near Opus to be productive. Sonnet is plenty productive enough
For niche applications, sure. For general use, I think the tendency towards the best model being used for everything will–to the model publishers' delight–continue. It's just much easier to get a feel for Opus and then do everything with it, versus switch back and forth and keep track of how Haiku came up with novel ways to dumbfuck this Sunday evening.
josteink 4 hours ago [-]
> You are greatly underestimating the current hardware requirements for productive local LLMs.
Fixed that for you. Right now most models produced are based on floating point maths and probabilities, which is "expensive" to do math on.
Microsoft has researched 1-bit LLMs which can run much more efficiently, and on much cheaper hardware[1].
If this research is reproducable and reusable outside their research models, this means the cost of running self-hosted LLMs will be reduced by an order of magnitude once this hits mainstream.
I think it's inevitable that access to good enough LLM models will be democratised.
However that's not the real battle here. The real battle is control of information to operate over.
While I might have access to a decent model - I don't have the huge integrated databases of everything that companies like Google have, and increasingly governments will accumulate.
As a citizen AI operating of these large datasets is where the concern should be.
pier25 10 hours ago [-]
How fast do you reckon most people will be able to afford 128-256GB of RAM?
Schiendelman 10 hours ago [-]
Other than this recent spike, it's been trending cheaper continuously for decades. In a few years 128GB will be as affordable as 12GB (what flagship phones have now) is today.
pier25 10 hours ago [-]
I'm sure it will happen but I don't think it will be soon.
10 years ago I was using 16GB in my MBP and today it's 48GB. It's just a 3x increase during mostly a bonanza period.
DennisP 9 hours ago [-]
For most of that time, I don't think many people had much use for more ram than that. If demand picks up, companies will provide it.
And the Mac Studio was available with 512GB until ram got scarce and they cut the max in half recently.
pier25 8 hours ago [-]
The Mac Studio is a high end computer that the majority can't afford or justify its expense.
There's plenty of demand for RAM right now. We'll see how this turns out.
3 hours ago [-]
amelius 3 hours ago [-]
That "spike" could be a wall ...
fennecfoxy 1 hours ago [-]
Nope.
Because late stage capitalism demands endless growth in order to pay executives and shareholders (especially those late to the train) more and more YoY.
And those requirements for growth mean that cost cutting is needed. Over the past few decades cost _have_ been cut, building things more efficiently, components becoming cheaper, larger volumes in mass manufacturing.
But we have already reached a point where there are no other places to cut than the quality of the product itself. Look to shrinkflation in food and other places - look at how "live action" versions are being made of previously animated movies, how game franchises from 2 decades ago are being brought back from the dead, the huge influx of remasters etc.
Why? Because it's cheaper to revive/reuse an existing IP than it is to create a new one + it guarantees success with the drooling consumer masses. And cheaper = more Ferraris for the multi millionaire/billionaire execs.
See how much Mario movie made? Just wait...bet you there'll be a live action version. ;)
cpt_sobel 3 hours ago [-]
Their prices are currently so unreachable because of the big players hoarding every chip they can get their hands on, but if/when the market realizes that locally deployed LLMs are the way to go, maybe (hopefully?) then more chips will be available to the consumers for lower prices.
Arn_Thor 54 minutes ago [-]
The only way that'll happen is if deep-pocketed corporate buyers exit the market almost entirely, and therefore stop being the highest-available bidder. Even in a scenario where it's obvious to everyone that consumer-side hardware is a viable option, it's still not in the big AI providers' interest to abandon the effort to push/pull everyone to their cloud. They'll keep buying as long as there's liquidity to fund them and the will to do so, and we're a ways off that collapsing. I'm quite pessimistic. Prices will probably come down in the next 12-18 months, but not to where they were before this
discordance 8 hours ago [-]
“Gradually, then suddenly”
emadb 5 hours ago [-]
Do you think small models will arrive? I mean if I need to write a web application in typescript why should I use a model that knows all the programming languages and it is able to reply to any questions about almost everything? I just a need a small performant model that knows how to write web applications in typescript. That could be very helpful and easy to run on my laptop.
driese 5 hours ago [-]
For the same reason that a human who is fluent in five languages can probably express themselves better in either one compared to human that only speaks one, while also having a more nuanced understanding of general grammar.
From what I know, learning on a more diverse set makes a model better overall.
amelius 3 hours ago [-]
This might be an interesting research question: can you train a model on many languages, and then extract a much smaller model that knows only one language without much loss of quality?
thot_experiment 5 hours ago [-]
Depending on your laptop, if your laptop is a Strix Halo or a Macbook with a decent amount of ram, that day they arrived is about 6 months ago, and today if you can run Gemma 31b, you're golden for your basic workslop code. You can do most of it with local models. Heck, for a lot of the tier of programming you might encounter in the average job Qwen 35b MoE is good enough and it can hit 100tok/s on decent hardware.
elbasti 9 hours ago [-]
> The question will be: how much of the current compute capacity craze will local hosting give the kiss of death to and what that means for the market.
This will depend on how much inference happens for consumer (desktop, local) vs enterprise ("cloud"), vs consumer mobile (probably also cloud).
I would assume that the proportion of "consumer, local" is small relative to enterprise and mobile.
stubish 5 hours ago [-]
I think the proportion is small because someone has to pay for the cloud services. When phones, PCs and Desktops ship with NPUs whole new markets open up for all that stuff people want but not enough to pay for.
inf3cti0n95 8 hours ago [-]
Certainly, I don't think Data centers are the way here.
I guess, it'll most likely be an AI processing and everything else becoming API.
In case of GPTs and Claudes of the world. They'll be just using an Indexing APIs and KB on top of their LLMs.
RataNova 13 hours ago [-]
The biggest impact of local models may simply be that they prevent remote inference from becoming the only game in town
dnnddidiej 8 hours ago [-]
Except you will want the frontier to compete. Local models are useful but you will always need $$$ to be in the same order of magintude as frontier. And also $$$ for same token speed.
The question is would you choose to save $10 a day if it causes your inference to slow down 10x and waste 2 hours a day waiting on stuff.
dakolli 13 hours ago [-]
This is simply delusional, It cost 20-30k a month to run Kimi 2.6. The tokens are sold for $3 per mm.
To sell tokens profitably you'd need to be able to run inference at 150 tokens per second for less than $1,000 USD a month.
I don't think people realize how expensive it is to host decently capable models and how much their use of capable models is subsidized.
You can only squeeze so many parameters on consumer grade hardware(that's actually affordable, two 4090s is not consumer grade and neither is 128gb macbooks, this is incredibly expensive for the average person, and the models you can still run are not "good enough" they are still essentially useless).
People are betting their competency on a future where billionaires are forever generous, subsidizing inference at a 10-1 20-1 loss ratio. Guess what, that WILL end and probably soon. This idea that companies can afford to give you access to 2mm in GPUs for 5 hours a day at a rate of $200.00 a month is simply unsustainable.
Right now they are trying to get you hooked, DON'T FALL FOR IT. Study, work hard, sweat and you'll reap the benefits. The guy making handmade watches, one a month in Switzerland makes a whole lot more than the guy running a manufacturing line make 50k in China. Just write your own fkin code people.
Don't bet your future on having access to some billionaire's thinking machine. Intelligence, knowledge and competency isn't fungible, the llm hype is a lie to convince you that it is.
zozbot234 13 hours ago [-]
No one runs SOTA models 24/7 for individual use or even for a single household or small business, whereas you can run your own hardware basically 24/7 for AI inference.
With the new DeepSeek V4 series and its uniquely memory-light KV cache you can even extend this to parallel inference in order to hide memory bandwidth bottlenecks and increase compute intensity.
This is perhaps not so useful on a 128GB or 96GB RAM Apple Silicon device (I've seen recent reports of DS4 runs with even one agent flow hitting serious thermal and power limits on these devices, so increasing compute intensity will probably not be helpful there) but it will become useful with 64GB devices or lower that have to stream from a slow disk, or with things like the DGX Spark or to a lesser extent Strix Halo, that greatly overprovision compute while being bottlenecked on memory bandwidth.
doctorpangloss 8 hours ago [-]
deepseek v4 flash on mlx at 1m context runs at 20 t/s decode on a mac studio m3 ultra with 512gb of RAM
alfiedotwtf 4 hours ago [-]
What is everyone running DeepSeek v4 Flash with?!
It’s currently unsupported on Llama.cpp and vllm doesn’t support GPU+CPU MoE, so unless all of you have an array of DGX Sparks in your bedroom, what’s the secret sauce?!
Just because you read it on a github repo doesn't make it true, it also doesn't take into account cpu temps and inevitable throttling you'll encounter.
doctorpangloss 7 hours ago [-]
i ran it on my own device haha
i don't comprehend why people are in such disbelief at how much better this stuff runs on a mac studio than on NVIDIA hardware with 1/5th the VRAM. look, what can i say? NVIDIA is a bigger rip off than Apple is!
platevoltage 7 hours ago [-]
Which is good, because Nvidia pulling a Micron and ceasing consumer hardware production is right around the corner.
13 hours ago [-]
NitpickLawyer 13 hours ago [-]
API prices are most likely not subsidised. A brief look at openrouter can tell you that. There are plenty of providers that have 0 reason to subsidise that sell models at roughly the same average price. So the model works for them (or they wouldn't do it otherwise).
ai_fry_ur_brain 12 hours ago [-]
They are subsidized, heavily. This is simple math, there are lots of reasons to subsidize. Please go look up the hardware requirements to run your favorite model and a given tok/ps then multiple that by 86400 (seconds in a day) then divide that by 1mm and multiple by the $ per mm tokens, then ask yourself if there's any possibility they could be profitable or even close to break even.
You are going off vibes alone, this is easily verified, please go verify.
What makes you think they have zero reason to subsidize, because the providers aren't a household names you assume they wouldn't operate at a loss? Whats your logic here? You make no sense.
gpugreg 2 hours ago [-]
Serving a single user is likely not profitable, but total throughput rises a lot when serving many concurrent users, because the same weights can be used to generate tokens for all users at once, which increases efficiency.
Also, a lot of money is being made on input tokens and cached tokens, which are much cheaper to compute.
The amounts of API tokens many large companies are using through, say AWS bedrock are quite high. We've seen leaks on the bills for real world use cases. It's not unreasonable to see normal individual subscriptions as possibly subsidized.... but do we think someone like Anthropic is going to be subsidizing 7, 8, or even 9 figures monthly bills from megacorps? Because said megacorps will swap out to a competitor immediately, so your subsidy is unlikely to lead to loyalty or anything.
If Anthropic and OpenAI are subsidizing the metered API usage, their model is going to end up just as successful as MoviePass. They are burning enough money on the training costs already.
dakolli 8 hours ago [-]
Large companies are paying an arm and a leg, but I'm still certain even at $15.00 per million tokens they are not profitible.
If you have a machine running at 150 tok/ps you can only make $5820 a month at $15 per 1mm running 24/7. It costs a hell of a lot more than 6k a month to run Claude 4.7 @ 150 tok/ps on that machine 24/7.
This math is a bit off, because you have input tokens too, but regardless its still not profitable especially for how long it takes to turn around a request and the caching is probably not all that profitable.
NitpickLawyer 6 hours ago [-]
You are all over this thread, but you have no idea how inference works, and it's obvious. Your napkin math is off because you don't know what to add up, you lack the necessary background. And yet you persist and reply all over this thread. I don't get it.
Serving models on dedicated hardware is not the same as your at home 150t/s thing. Inference is measured in thousands of tokens / s in aggregate (i.e. for all the sessions in parallel). That's how they make money.
CuriouslyC 9 hours ago [-]
Anthropic and OpenAI make money on API calls, margins have been reported in public filings. Subs are subsidized.
dakolli 8 hours ago [-]
That's not possible, read my comment above. These are private companies, there are no public filings regarding their profitability in any sense. You're just making things up.
If you have a machine running at 150 tok/ps you can only make $5820 a month at $15 per 1mm running 24/7. It costs a hell of a lot more than 6k a month to run Claude 4.7 @ 150 tok/ps on that machine 24/7.
This math is a bit off, because you have input tokens too, but regardless its still not profitable especially for how long it takes to turn around a request and the caching is probably not all that profitable.
mtone 7 hours ago [-]
You're forgetting a critical factor: concurrency. If a given hardware serves a single request at 150 tokens/s, it can also serve 20-30 requests at 100 tokens/s. Suddenly your $5K becomes $100K/month, enough to recoup the cost of the hardware in a year or so.
The reason it works: each time you read the model (memory bound) to calculate the next token, you can also update multiple requests (compute bound) while at it. It's also much more energy-efficient per token.
Interesting I didn't know about this, but it makes sense after reading the article. They are benchmarking on a single GPU on a 20bb param model. Does it scale across 60 H100s over NVLink/NVSwitch. I would be interested to see those benchmarks.
The idea that everyone is spinning up a $2 million in GPUs to scan their email inbox, search the web or avoid learning something is still ridiculous to me regardless.
CamperBob2 13 hours ago [-]
It cost 20-30k a month to run Kimi 2.6. The tokens are sold for $3 per mm.
Not if you're OK with 4-bit quantization. More like $30K-$50K one time.
RTX 6000 Pro retails for $10k so an 8x is $80k before anything else in the computer, and long-context will have... pretty bad performance (20+ seconds of waiting before any tokens come out), but it's true it technically works.
I don't think cloud models are going away; the hardware for good perf is expensive and higher param count models will remain smarter for a looong time. Even if the hardware cost for kind-of-usable perf fell to only $10k, cloud ones will be way faster and you'd need a lot of tokens to break even.
zozbot234 12 hours ago [-]
> I don't think cloud models are going away; the hardware for good perf is expensive
I think local AI will win in its niche by repurposing users' existing hardware, especially as cloud hardware itself gets increasingly bottlenecked in all sorts of ways and the price of cloud tokens rises. You don't have to care about "bad" performance when you've got dedicated hardware that runs your workloads 24/7. Time-critical work that also requires the latest and greatest model can stay on the cloud, but a vast amount of AI work just isn't that critical.
reissbaker 8 hours ago [-]
Users do not have an existing $80k of hardware, are not going to buy $80k of hardware for worse performance than paying $100/month, and models are continuing to grow in size while memory grows in price.
zozbot234 3 hours ago [-]
You said you need $80k in hardware for "good performance". I'm saying the local AI inference workflow will be a lot more flexible about performance than that, and can get away with something vastly cheaper and in line with what the user owns already.
otabdeveloper4 5 hours ago [-]
> paying $100/month
There will not ever be a monthly subscription for LLM tokens. The economics isn't there.
Local tokens will always be cheaper.
entrope 7 minutes ago [-]
What's the basis for saying local tokens will always be cheaper? As others have outlined, LLMs serving one user at a time are pretty expensive, but concurrent users become much more cost-effective (assuming there's enough RAM for the contexts). If "local" to you means ~10 hours daily use by a team of employees, the company still has to balance against cloud services that can amortize non-recurring costs over 24 hours of service per day.
ai_fry_ur_brain 12 hours ago [-]
"I think"
Well your thinking is completely vibes based and not cemented in any reality I exist in.
CamperBob2 9 hours ago [-]
Other sites beckon.
otabdeveloper4 5 hours ago [-]
> higher param count models will remain smarter for a looong time
They're not smarter, they just know more stuff.
You probably don't need knowledge about Pokemon or the Diamond Sutra in your enterprise coding LLM.
The "smarts" comes from post-training, especially around tool use.
anon7725 5 hours ago [-]
If the smarts came from post-training, we could show significant gains by doing that post-training again for previous generations of models. But we know that isn’t happening - effective post training is necessary but not sufficient for model performance.
alfiedotwtf 4 hours ago [-]
If 8 x RTX 6000 is getting you 20s before initial token, how are cloud vendors doing this?
zozbot234 13 hours ago [-]
4-bit quantization is native for Kimi 2.x series.
CamperBob2 13 hours ago [-]
You're right, I was thinking of Qwen. K2.6 will run at UD-Q2_K_XL precision on 4x RTX6000 boards, but I have no idea if it's worthwhile.
hparadiz 13 hours ago [-]
Posts like this are so funny to me. I'm staring at a mountain of old hardware right now that cost about $20k ten years ago. I have to pay someone now to come haul it away. What makes you think the current new hardware won't end up with the same fate.
> Just write your own fkin code people
Bro is nostalgic for googling random stack overflow threads for 10 days to figure out a bug the agent fixes in an hour.
HWR_14 5 hours ago [-]
Do you have any old laptop ram?
hparadiz 5 hours ago [-]
It's old rack mounts. Only one of them has some ECC DDR4 worth something.
cindyllm 13 hours ago [-]
[dead]
dakolli 13 hours ago [-]
I'm just saying that agent that can fix your bugs actually cost $100-150 an hour to run and you're getting it essentially for $200.00 a month.
The cost of cloud compute actually hasn't gone down for old hardware all that much, it still costs $500.00 a year rent 4 core i7700k that's 10 years old. Don't expect much more valuable hardware, like modern GPUs to deflate in price all that quickly.
There's 3 fabs in the world that make ddr7 and they aren't going to be selling their stock to consumers going forward, it will be purchased by datacenters almost entirely and stay in them until EOL.
Your brain is going to atrophy (this is proven), they'll raise the price to something thats closer to break even and you'll be forced to pay it because you no longer have those muscles.
hparadiz 13 hours ago [-]
The architectural problems I deal with day in day out leave no room for atrophy. This is just cope.
platevoltage 7 hours ago [-]
You're going to see major cope once that bargain $200/month plan goes away, and every person or company that has embedded these services into their workflows gets to see their actual costs.
hparadiz 6 hours ago [-]
Have you actually tried this stuff or are you just saying stuff you hear on the internet?
nullc 13 hours ago [-]
> two 4090s is not consumer grade
I think that is a very narrow perspective. Enormous numbers of consumers own $50,000 cars, but a pair of $2000 GPUs is "not consumer"?
I agree with your view that cheap tokens on SOTA are a trap-- people should use local AI or no AI.
ac29 12 hours ago [-]
> Enormous numbers of consumers own $50,000 cars, but a pair of $2000 GPUs is "not consumer"?
$50k is a median priced car in the US. I'd guess >99.9% of people do not own $4000 of GPUs. I consider myself a computer person and I dont think I even own $4000 of computer hardware in total
swiftcoder 4 hours ago [-]
> I consider myself a computer person and I dont think I even own $4000 of computer hardware in total
A top-spec MacBook Pro is >$4k, so I assure you that plenty of computer people do own $4k of computer hardware.
Hell, most tech folks are wandering around with a ~$1k smartphone in their pocket too.
janalsncm 5 hours ago [-]
Fwiw you can finance a car over something like 7 years now. So a lot of people will be paying like $750 per month, not $50k lump sum.
zozbot234 12 hours ago [-]
Plenty of gamers own serious GPU rigs that are reusable (at least to some extent) for local AI inference. That's almost certainly more than 0.1% of the populatiom.
nullc 11 hours ago [-]
I guess I wasn't clear-- I wasn't so much making the point people do own $4000 in GPUs (though I suspect you are massively underestimating the number who do, also before the current market conditions this would have been more like $2500 in gpus...), but they certainly could per the evidence of car ownership.
A car is super useful, so is an AI. But even if we decide cars are incomparably more useful a great many people pay much more than $4000 over the minimum viable car, and that's money that could be deployed to secure access to private, secure, and autonomous AI facilities. A few thousand dollars in computing is consumer hardware, or at least could easily be with more reason and awareness driving adoption.
People spend a LOT of money in things less useful than local copy of qwen3.6-27b can be.
dakolli 13 hours ago [-]
I would still question what usefulness there is with a local model even with 10k in GPUs. I certainly haven't seen any great uses myself from these smaller models (<500 parameters) except claims from people who are totally enamored with AI and basically anything output from an LLM impresses them like a toddler who's entertained by the sound their velcro shoes makes.
robot-wrangler 12 hours ago [-]
Probably you're focused on coding agents? I bet someone could use that kind of hardware to filter snarky comments
nullc 12 hours ago [-]
Here is an example-- I'm running hermes + qwen3.6-27b on a workstation GPU (an older RTX A6000 which gets 55tok/s, though people run this model on more limited hardware).
I instructed the agent to read the URL, implement the technique in C++ for 32-bit registers, then make a SIMD version that interleaves several extractors in parallel for better performance. It implemented it (not hard since there was an implementation there that it read), then wrote more extensive tests. Then it vectorized it. It got confused a few times during debugging because the algorithm uses some number theory tricks so that overflows of intermediate products don't matter and it was obviously trained a lot on ordinary code were such overflows are usually fatal. I instructed it to comment the code explaining why the overflows are fine and had it continue which mostly solved its confusion.
It successfully got the initial 12MB/s scalar implementation to about 48MB/s. Then I told it to keep optimizing until it reaches 100MB/s. I came back the next day and it had stopped after 6 hours when it achieved just over 100MB/s. Reading what it did: it went off looking at disassembly, figured out what hardware it was running on, and reading microarch timing tables online and made some better decisions, tried a lot of things that didn't work, etc. (And of course, the implementation is correct).
I'm pretty skeptical about AI and borderline hateful of many people who (ab)use it and are deluded by it-- but I think this experience shows that a small local model can be objectively useful.
(oh and this experience was also while I only had the model running at 19tok/s)
Running the model in a loop where it can get feedback from actually testing stuff allows you to make progress in spite of making many mistakes.
I could have done this work myself but I didn't have to and I certainly spent less time checking in and prodding it than it would have taken me to do it. In my case I wondered how much faster parallel extractors using SIMD might be-- an idle curiosity that would have gone unanswered if not for the AI.
ai_fry_ur_brain 12 hours ago [-]
This is maybe the first time Ive seen someone claim to do something useful with such a small model.
Congrats, but you're in the 0.0001% thats not just frying their brains, fapping to their local models or doing various magic tricks like a toddler entertained by playing with velcro.
At the end of the day you lost an opportunity to improve yourself and excercise your brain, maybe the opportunity cost is worth it idk, but Im going to keep taking things slow.
Sounds like you're coping for the vendor lock-in you cornered yourself into.
nullc 11 hours ago [-]
This is a change that's been happening gradually over time-- I don't think I could have done this on a local model that could run on a consumer class gpu a couple months ago.
There are plenty of other uses that people have been making for a long time-- e.g. I know someone who uses a fine tuned local model to sort their incoming email and scan their outgoing messages for accidental privacy leaks.
I don't agree with your assessment on an opportunity lost-- I got my reps in on the original work, the AI gave an incremental step forward which made the whole exercise somewhat more valuable to me with minimal additional cost. I think this improves the cost vs benefit in a way that makes me more likely to try other pointless activities, knowing that when I run out of gas I can toss it to AI to try some variations.
Sometimes you're also 27 steps deep on a nested subproblem and you're really just trying to solve sometime. Even in finr craftsmanship not every step needs to be about maximum craftsmanship. :) Sometimes it's just good to get something done.
I think this is much like any other tool. One can carve furniture using only hand tools, but the benefits of a router are hard to dispute. Both approaches exist in the world and sometimes both are used in concert.
As far as people frying their brains with AI -- you don't need local models for that, plenty of people are driving themselves into deep personally and socially destructive delusion just using the chat interfaces.
ai_fry_ur_brain 11 hours ago [-]
I do think post training smaller open source models for very narrow tasks is largely overlooked and there'll be lots of value there if one puts in the effort. However, in a lot of cases we're just compeleting a circle back to deterministic behavior at 1000x the memory/compute requirements just to avoid writing regex.
I agree with you, there's a way to use them responsibly like your router anology, I just think most aren't doing this correctly and its a slippery slope. I'll contend that you probably have used them responsibly in your example.
KurSix 15 minutes ago [-]
[flagged]
0xbadcafebee 9 hours ago [-]
Here's some things you can do right now with local models on a consumer device:
- text-to-speech
- speech-to-text
- dictionary
- encyclopedia
- help troubleshooting errors
- generate common recipes and nutritional facts
- proofread emails, blog posts
- search a large trove of documents, find information, summarize it (RAG)
- manipulate your terminal/browser/etc
- analyze a picture or video
- generate a picture or video
- generate PDFs, documents, etc (code exec)
- simple programming
- financial analysis/planning
- math and science analysis
- find simple first aid/medical information
- "rubber ducking" but the duck talks back
A quarter of those don't need more than a gig of RAM, the rest benefit from more RAM. Technically you don't even need a GPU, it just makes it faster. I do half that stuff on my laptop with local models every day.
That said, it really doesn't need to be local. I like the idea that I can do all that stuff offline if I'm traveling, but I usually have cell service, and the total tokens is pretty cheap (like $2/month for all my non-coding AI use).
acidhousemcnab 7 minutes ago [-]
RAG on every machine, and the means for corporations / shadowy powers to query it.
satvikpendem 8 hours ago [-]
Please add double new lines as your formatting for the bullet point list makes it all one paragraph.
fennecfoxy 1 hours ago [-]
Tbf I've always hated that about HN formatting as it's not very clear at all that that's how it works.
If there's a newline in my comment, why not retain it? Whyyyyy?!
xigoi 1 hours ago [-]
Because of the 6 people who write HN comments in Vim with hard wrapping turned on.
acidhousemcnab 11 minutes ago [-]
We need better GUI and OS integrations with sandboxed local LLMs, before this is thrust on everyone and rolled out as the default in commercial OSes. Here in Berlin, I was functionally surrounded and hounded out of a local meetup, due to confrontation over the naive pushing of OS-level and network access agentic AI, done in the mode of mystical powers and artistic possibilities, which due to recent experiences, comes off as string-pulling, to produce a threat or danger that then must be observed and kept tabs on, according to Goodhart's Law.
adamtaylor_13 9 hours ago [-]
Cool, well let me know when Opus 4.5 level performance is available locally, at speeds that serve everyday use, and 100% I'm right there with you.
Until then, I'm going to keep sending my JSON to the server farm in Virginia because it's the only place that can serve me a model that actually works for my uses.
am17an 3 hours ago [-]
Local models embody the hacker spirit, constant Claude glazing is spiritually incompatible with tinkering. Don't upload your spirit to the cloud.
Aurornis 8 hours ago [-]
I experiment a lot with local models, and I agree.
I have a lot of fun with the local models and seeing what they can do.
I appreciate the SOTA models even more after my local experiments. The local models are really impressive these days, but the gap to SOTA is huge for complex tasks.
janalsncm 5 hours ago [-]
Reasoning over a large codebase is only one use case for large models. For the use cases in the article (summarizing, classifying, basic text rewrites) most phones can handle them just fine.
agnishom 5 hours ago [-]
The article is not about those use cases. There are plenty of use cases for which local models are already pretty good
binyu 8 hours ago [-]
DeepSeek V4 with 1 million token context window is pretty powerful, although still not there. There's hope that Opus 4.5 level performance locally is not that far away.
Aurornis 8 hours ago [-]
Running DeepSeek V4 without extreme quantization locally requires a lot of hardware.
The IQ2 quants that fit into 128GB machines are very degraded.
binyu 8 hours ago [-]
That is true, it is a 1.6T parameters model so it requires a great deal of memory. I also heard there's a 2bit quantization that works well on Apple metal.
tuananh 8 hours ago [-]
From what I read, ds v4 is very close with opus 4.6 performance.
DeathArrow 5 hours ago [-]
The full model is, not the quantized versions.
tuananh 4 hours ago [-]
yeah that goes without saying. how can openweight, quantized version beat SOTA :)
5 hours ago [-]
thefounder 8 hours ago [-]
Next year there will be Opus 4.5 level available on open source models so theoretically you may be able to run it locally but in reality it will be too expensive (i.e maybe 2 x max Studio 512GB ram each) for “normal” users.
storus 8 hours ago [-]
Depending on a task, there are already models matching Opus 4.5. Just not in everything. But you can always swap a local model for a particular task.
bugglebeetle 8 hours ago [-]
The frontier Chinese open source models are already at this level, GLM-5.1 and Kimi K2.6 specifically.
DeathArrow 5 hours ago [-]
But you can't run the locally at full quality. And quantized versions you can run locally are a far cry from Opus 4.6.
bugglebeetle 5 hours ago [-]
Anthropic serves quantized versions of their models and you can run q8 locally.
nicce 3 hours ago [-]
I don't even use Sonnet anymore. Current feels worse than Claude 3.5 couple years ago. They have quantized that much? Switched to GPT 5.5, let's see how long it will stay good.
gkcnlr 7 hours ago [-]
It seems like everybody is focused on "LLM"s, a.k.a Large Language Models. One interesting addition to that is fine-tuned- small parameter, distilled, context-dependent small language models that:
1- Do a particular task with great capability (due to its constrained, limited scope)
2- Do it in such a way, it integrates gracefully in your workflow without ever requiring you to know you are using an LM.
There is a difference between outsourcing your workflow to AI and actually utilizing it.
Eh I think the small model thing is kind of a no-go.
Reason being is that many workloads for AI are dynamically mixed, where training from multiple subjects comes into play and you just can't know exactly what mix will be required for each task ahead of time.
I was hoping loras would do this for us as well but they don't really seem to have worked out for llms (compared to in the image/video diffusion space).
Perhaps some future model will have some sort of "core" that can load/unload portions of itself dynamically at runtime. Like go for a very horizontal architecture/hundreds of MoE and unload/load those paths/weights once a parent value meets or exceeds some minimum, hmmm.
tzm 5 hours ago [-]
People want local AI, but only if UX is good. Tooling/harness quality may matter as much as model quality.
I think the future will probably be a hybrid of:
1. local AI for simple, private, everyday tasks
2. online AI for very hard or long tasks
anemoknee 4 hours ago [-]
The Clippy app someone made and posted here a while back is the perfect average person LLM interface;
local LLMs builds tool that does exactly what user wants, how it wants it, which is bext UX
this becomes AI literacy
LLMs already nicely bridge the gap form "I want this" to "here's a local page that does it".
examples of tools i have built that requires almost very low tech knowledge
* push a button on my phone to take screenshot in my mac (when i watch videos)
* help me exercise, gamify it for me
* "help me track time spent online to how it impacts what i do in real life, built a tool that rewards and me points me towads things that make me DO things online"
* i want to improve my writing, give me exercises and build addiitonal tools (leading to an "append only" digital keyboard i use to exercise )
local AI can already create these tools, and no external company is ever going to beat me/the-user because instead of getting features i don't want, or that almost do what i want, or that do something that advantages the company they just do what I want
Repositories of tools-as-ideas created by others are quite often just index.html and ... that's all? manage data in localstorage, end of it?
Online inferences is still needed for large data (audio/video/images) processing. For now? we don't know, history suggests we'll have the capabilities to do that locally "soon". Or maybe not :)
The main issue is "online for collaboration". Not same user across different devices, that is easy.
MeteorJS-style approaches (making local copies of part of dbs, reconcile to remote/origin) seems to be an interesting possibility at small scale, since once you have the right primitives in place you can go horizontally everywhere.
Gud 4 hours ago [-]
The UI is already great.
I can’t wait to run my models locally. The sooner I can do my shit without some American mega corp gulping down all my data, the better.
nicce 3 hours ago [-]
I fear that easier it gets to run models locally, more expensive all the hardware gets. So at the same time it gets further and further. You should have bought the hardware yesterday.
worldsayshi 3 hours ago [-]
The more expensive it gets, the higher the incentive for more competition in the hardware space.
nicce 3 hours ago [-]
The thing is that it is something which takes so long time. E.g. why Taiwan is still so important?
rmunn 7 hours ago [-]
For image generation, this has already happened. To what degree, I can't tell, as I don't do image generation much so I don't have numbers on Midjourney subscriptions or any other image-AI-as-a-service sites. But civitai.com has become a place where people share their models, based off of Stable Diffusion or other similar bases, with various fine-tunings to achieve desired results. You name it, you can find a model for it at Civitai, and people doing some very creative things with them. (And also a lot of the obvious things, but it's the Internet, what did you expect?)
I haven't seen a text-based model sharing site spring up yet (perhaps they already have and I don't know about it yet). Civitai, being focused on image-generation, has the obvious advantage that it's easy to show off impressive results from the model on the front page of the website, and judging what someone's home-grown fine-tuned LLM will produce is a lot harder. But at some point I expect a Civitai equivalent site for text models, especially code-based ones, to become popular. That will seriously undercut Anthropic, OpenAI, et al, and will probably force them to find a price equilibrium.
Because once you're competing with "I spend $2,500 up front on a powerful video card, download an open-source model for free, and then I get pretty much everything I need for free" (additional power cost of running that video card isn't nothing, but probably not noticeable in your power bill compared to what you're already using)... then suddenly $200/month means your customers are thinking "after one year I would have been better off with the homegrown solution". The only way they'll continue to pay $200/month is if Claude/GPT/Gemini/whoever is truly head-and-shoulders above the "pay upfront once for hardware then use it for free afterwards" models available. And that's going to be doable, perhaps, but tough.
janalsncm 5 hours ago [-]
> I haven't seen a text-based model sharing site spring up yet (perhaps they already have and I don't know about it yet)
Huggingface.
The reason HF doesn’t also compete for image gen is probably some combination of momentum from Civit AI and HF not wanting to deal with the moderation headache.
peab 7 hours ago [-]
Civit ai is like 99% porn though. Most production usage of image gen is google or open ai as they are by far the best
rmunn 7 hours ago [-]
As I said, a lot of the obvious things. And if you're scrolling through the front page without being logged in (i.e., so the default "no mature content" filter is on), there's some really creative stuff being done. I personally like the looks of the RPGv5 (or is the guy up to v6 by now? I forget) model, and plan to use it eventually to create custom portraits of characters in my tabletop roleplay campaigns. (Not running any right now, due to having basically zero free time at the moment, but eventually my current situation will change and I'll have the occasional weekend open again).
But for a site sharing code-generation models, it's a very different scenario. I'm curious to see what will happen in that space.
janalsncm 5 hours ago [-]
Google and OpenAI are good for one-offs but if you want a consistent style you need to use a LoRA.
TheJCDenton 15 hours ago [-]
For the mainstream audience, the sentiment around local ai today is the same that they had around open source a few decades ago. For a few products, some paid solutions were so much more advanced that open source were very often completely overlooked. Why bother ? And the like. Then we had captive SaaS and other plateforms and now it's obviously wrong for most of us.
The dependency we have with anthropic and openai for coding for instance is insane. Most accept it because either they don't care, or they just hope chinese will never stop open weights. The business model of open weights is very new, include some power play between countries and labs, and move an absurd amount of money without any concrete oversight from most people.
It's a very dangerous gamble. Today incredible value is available for nearly everyone. But it may stop without any warning, for reason outside our control.
apublicfrog 13 hours ago [-]
> It's a very dangerous gamble. Today incredible value is available for nearly everyone. But it may stop without any warning, for reason outside our control.
What stops you from running the best open weighted LLMs currently available on consumer grade hardware for the rest of time? They're good enough for 95% of use cases, and they don't have a used by date. From what I can see, the "danger" is not having the next tier that comes out, but the impact of that is very low.
giobox 13 hours ago [-]
> they don't have a used by date
For quite a lot of use cases, the current systems arguably do get worse over time if not continually updated. The knowledge cutoff date will start to hurt more and more as the weights age in a hypothetical scenario where you are stuck with them forever.
Coding, one of the most popular usescases today, would not be great if it say only understood java to a version from years ago etc.
One solution is not to advance anything of course. I'm not even joking, is there going to be a successor to React? I suspect not, with the vast amount of training data for React now, it's going to look silly to move to something else with less support. What is the last new popular programming language, rust? Will there be another one? I suspect not. Same reasoning. The irony of all this AI acceleration talk is it'll work best if we don't accelerate the underlying tech at all.
WarmWash 11 hours ago [-]
There probably won't be new stuff so much as trends in how stuff is done, and updates around optimizing those trends.
jvm___ 11 hours ago [-]
Will programming languages evolve into less human oriented written code and more just calls to a trusted AI.
Or will human readable code be less and less of a thing as AI learns it's own, more terse language to talk to other AI's.
digitaltrees 9 hours ago [-]
Yes. I am seeing a big push to use vanilla js for single file html apps that are easy to build, deploy and distribute because they have no build step. I could see component libraries emerging that make it easier build from chat interfaces with less ceremony
byzantinegene 8 hours ago [-]
i'm not sure the tradeoff in code readability is worth it as of now.
hadlock 10 hours ago [-]
Name/post content combo on point
Spooky23 11 hours ago [-]
Alot of the language work is scratching the itch of engineers and developers. I think you’re correct and react is the new COBOL.
apsurd 10 hours ago [-]
Humans are notoriously bad at predicting the future. Toward that end, your prediction is laughable. React is the end all be all of UI… lol
melagonster 10 hours ago [-]
Programmers won't be allow to exist in future. Vibe coding is the final resolution people can apply.
rrvsh 12 hours ago [-]
Nobody is unaware of the knowledge cutoff, and sharing the Wikipedia article is not helping anyone. Your point is easily rebutted by taking whatever open weights/source model has an outdated cutoff and training or fine tuning it on more data, which is again always going to be viable given a modicum of compute
tcp_handshaker 12 hours ago [-]
You could learn how to code...a whole generation did it before...
mrtesthah 11 hours ago [-]
>Coding, one of the most popular uses cases today, would not be great if it say only understood java to a version from years ago etc.
This LLM trained only and entirely on pre-1930s texts was able to code Python programs when given only a short example:
Small models are more useful for "doing stuff" than "knowing stuff" to begin with. Add in an agentic harness and a small model can happily read more current information on demand (including from e.g. a local wikipedia snapshot).
turtlebits 13 hours ago [-]
FOMO. A new model comes out weekly and the HN crowd debates over the minutia of changes.
Pockets are too deep, it will only change once everyone is out of money.
3eb7988a1663 7 hours ago [-]
What is really amusing to me is how N months ago, the latest SOTA was incredible, but now utterly unusable. Feels like there is a model reality-distortion field in play where people can only acknowledge the flaws in retrospect.
lxgr 12 hours ago [-]
They’re really not good enough, unless you consider 64 GB of memory or more consumer grade.
steve_adams_86 12 hours ago [-]
I’m pretty happy with what a 32GB Mac Studio can do for a lot of tasks. They’re the things I’d throw a model like Haiku at, but still genuinely useful. We don’t have an answer to frontier models in the consumer range yet, but we’re not totally trapped.
Side note though, it’s the speed that bothers me more than the reasoning. Qwen 3.5 is awesome, but my Claude subscription can tear through similar workloads an order of magnitude faster than my local LLM can when using Haiku. That’ll matter a lot to some people.
datadrivenangel 11 hours ago [-]
Yeah this is the real killer. slower and more expensive is tough.
root_axis 8 hours ago [-]
> They're good enough for 95% of use cases
They're not at all, not even close. Especially when you consider the use cases for people who are paying for LLM services today.
nightski 12 hours ago [-]
Hardware. Frontier labs are driving up demand so much that it's priced significantly above cost making it far less affordable. Just look at Nvidia's profit margins.
suika 13 hours ago [-]
The use cases in the future will be nothing like the use cases from today.
apublicfrog 6 hours ago [-]
Maybe. The use cases people primarily use LLMs for (documents, coding, design, research) existed decades ago with different tooling. Who knows if the future will have a slew of new problems that require new models or will continue to be similar?
avazhi 10 hours ago [-]
> What stops you from running the best open weighted LLMs currently available on consumer grade hardware for the rest of time?
Uh… the hardware requirements? And stop acting like some dog shit 8B model the average Joe can run on a laptop is even close to being comparable to what Claude or even Codex can currently do.
I have pretty good hardware and I’ve tinkered with the best sub-150B models you can use and they are awful compared to Anthropic/OAI/Grok.
apsurd 10 hours ago [-]
What if the harness and loops get sufficiently better though? CC is using haiku for code-base gripping and such, you don't see a local commodity model being "good enough" for the 80% case when matched with better harnesses and tool calls?
honest question, i'm very interested in this, but too casual as of now to know any better.
byzantinegene 7 hours ago [-]
vast majority of average users don't use llms for coding, and for those purposes, local llms with low param count are a far cry from SOTA models.
apublicfrog 6 hours ago [-]
> And stop acting like some dog shit 8B model the average Joe can run on a laptop is even close to being comparable to what Claude or even Codex can currently do.
I'm not, you've actually illustrated my point. LLMs in 2022 were very impressive. By 2024 the general public was finding them an acceptable replacement for many research driven tasks and massive shortcuts for other tasks (coding, image work, document preperation, etc).
Those models are absolutely runnable on consumer hardware now, and we were extremely happy with the results. It's no different to how we used to think CRTs were amazing or early smartphones, but going back now they seem awful.
We're long past "danger". If what we have is the best we'll ever have open source, we're already in an excellent position.
avazhi 3 hours ago [-]
> LLMs in 2022 were very impressive.
No they weren't. They were a gimmick - it is only in the past 6 or so months that frontier models have started to do stuff beyond mere gimmicks when it comes to coding, and you could make the argument that Mythos has been the first 'Holy shit' moment that we've had that has stepped us beyond 'Yeah that's really neat but...'
> Those models are absolutely runnable on consumer hardware now,
A sub 50B model is awful and can't even write proper English sentences half the time, to say nothing of how bad its world knowledge is. Try the 32B Gemma 4 local model for a week and then go back to Claude and then get back to me.
> We're long past "danger". If what we have is the best we'll ever have open source, we're already in an excellent position.
Not sure what to tell you other than that you and I have very different standards. What we have locally right now is barely more than a glorified autocomplete, and it feels worse than using ChatGPT 2 years ago because the context window is less and it doesn't have good webhooks on consumer setups. Another thing I'd say is that you clearly have no clue what 'consumer hardware' means, or what consumers that can even get this stuff running locally would have to do to get it to even rival the frontier models in terms of their usability (most consumers are't going to just boot into Ubuntu and run this thing from a command line) flow, to say nothing of the hardware requirements. I'd love to never use Claude or Gemini or ChatGPT again for both privacy and money reasons, but the quality of outputs and depth of thinking and writing ability between even the very best local models you can run right now is many orders of magnitude less than what you get using distributed frontier models, and those 'very best' local models require a top of the line machine that 99.9999% of consumers don't have and would never consider buying. The cloud models all have like a trillion(!) parameters now. It isn't even close.
I sure hope the local side of things massively improves over the next 2-3 years, but based on how this has gone my guess is that in 3 years you'll be lucky, if you have very top of the line hardware, to get benchmark performance that we had 6 months ago with the frontier models. The distributed hardware/memory gap is just too big.
ai_fry_ur_brain 11 hours ago [-]
95% of usecases. What are you smoking.
selcuka 9 hours ago [-]
There are very good open weight models (such as DeepSeek v4 Flash) that can run on consumer level hardware.
Note that we are talking about 95% of everyone's use cases, not your specific use cases (which could require better models all the time).
10 hours ago [-]
oytis 14 hours ago [-]
What is the business model of open weight AI? I don't think there is any. At best it can serve as an advertisement for the more advanced models you sell.
The huge difference to open source is that you can't just train an LLM with free time and motivation. You need lots of data and a lot of compute.
I sure want to be wrong on that, I definitely like the open-weight version of the future more
wood_spirit 14 hours ago [-]
Meta released Llama just when OpenAI was so hot and its valuation was going through the roof. Speculating, but Meta probably thought the model not competitive enough to keep as a secret weapon but well good enough to commercially damage OpenAI who were a sudden competitor for most-valued-company?
In the same way you can imagine the Chinese government pushing the release of deepseek etc to make sure no one thinks the US has “won” and to keep everyone aware that a foreign model might leapfrog in the short term future etc.
At some point though if OpenAI/Antropic/Google plateau or go bust then the open source sponsorship becomes less likely, as making it open source was a weapon not a principle.
2ndorderthought 13 hours ago [-]
I disagree. I think deepseek, qwen, and kimi earn a lot of trust open sourcing their models. While still profiting.
Effectively they are saying "yea don't crowd our data centers with small queries, go ahead and send your frontier questions to our frontier models. Oh btw those us models? You can run something about as good for free from us if you want hah." It's a power and marketing move. It's also insanely smart to keep up with it to remain sustainable as a brand. Especially given how small their investments into this are.
Look at anthropics growing pains. Deepseek has other hosts spreading their brand for free while they grow. Brilliant honestly. In my opinion it makes anthropic and openai look clueless on a lot of levels.
China is playing a different game here. To them this is commoditizing their compliment and building good will. The Chinese economy doesn't teter on the brink of collapse to deliver frontier grade LLMs. Nope, Alibaba just made qwen because it needs it. It needs efficient models. Similarly, in China they manufacture and automate so much more than the US ever could. LLMs to them are a topping not the whole meal like they are in the us.
WarmWash 11 hours ago [-]
The Chinese labs don't have to make money or be profitable. They are funded by the state to achieve the state's goals, and the global praise of their open models just serves as Chinese soft power.
They're state companies, not some kind of ethical VC charity fund project.
2ndorderthought 10 hours ago [-]
The fun part is, they are making money and have way less to pay off despite 100s of billions in donations than the US companies do.
Spooky23 10 hours ago [-]
Is it so different?
If the US’s fascist experiment continues past the current president, we’ll absolutely be nationalizing frontier companies or exerting equivalent control.
treis 10 hours ago [-]
Yes, China is very different from the US.
ThunderSizzle 9 hours ago [-]
Sigh. Obama and Biden were as every bit "fascist" as Trump.
I'm glad I get reminded that TDS is real, but everyone forgets that Bush, Obama, and Biden all did things with executive power that Congress ignored or provided little real oversight for. And Congress has proven over the last several decades that their oversight is rather meaningless for the goals of American voters rather than special interests.
But it's all Trump's fault is much more convenient.
watwut 3 hours ago [-]
> Sigh. Obama and Biden were as every bit "fascist" as Trump.
Absolutely not. There is huge difference in the their behaviors.
> But it's all Trump's fault is much more convenient.
It is not just Trumps fault. Trump is logical consequence of what conservative party became. J.D.Vance and Miller are as much fascists if not more. The whole party worked for this for years and created this.
> And Congress has proven over the last several decades that their oversight is rather meaningless for the goals of American voters rather than special interests.
Of course congress in general is not the place to stop republican party from their fascists goals, because republicans in the congress support Trump 100%. They stand by project 2025 100%. They are doing oversight all right when it comes to blocking democrats.
The idea that the party that made Trump big, promoted ideas he build on and created project 2025 is supposed to be counterbalance to itself is absurd.
platevoltage 7 hours ago [-]
Certainly Biden and Obama check off a few of the 14 points of Fascism, but are we really being serious here? "TDS" is just a thought terminating cliche.
try-working 12 hours ago [-]
Correct. Open source is a PR and marketing strategy for new labs, regardless of origin.
Interesting article, but Qwen does seem to be closing off. They don't release big variants anymore, and I'm not sure that the fact the local-LLM community keeps praising it actually increases the number of people using their API.
It did work for Deepseek for sure and it seems to move the needle for Xiaomi's MiMo; but will it be enough for Qwen and Gemma? Those are the models you can actually run without going all-in on AI (but only with gaming GPUs and such).
try-working 9 hours ago [-]
Definitely. Open releases will accelerate this year, including from Qwen because they're behind in adoption.
HDBaseT 11 hours ago [-]
You can still make money on open weight models.
The compute required to run these models is still very far out of reach for the average consumer, yet known enthusiast, therefore they still sell inference, whilst also getting consumer goodwill for providing open weights.
datadrivenangel 11 hours ago [-]
And the efficiency! Big accelerator cards are ~100x the throughput per watt in terms of raw processing power.
mystraline 12 hours ago [-]
Thats because the USA has really nothing big to export. Yay, designs.
China? Im getting ready to watch the URKL (universal robot knockout league) go on. The USA is dicking around with failed robot dogs.
The USA has been a failed country, coasting on massive inertia. But the tech avenues from a article I cant find showed the USA 8/64 areas excelling. China was 56/64 areas excelling.
WarmWash 10 hours ago [-]
China is an advanced 2nd world country with pockets of first world.
Smart people in China design fast manufacturing lines for $25k/yr.
Smart people in the US design bond hedging strategies or ad-pixel trackers for $250k/yr.
China is in the stage the US was in 60 years ago, and eventually those high paying, high impact jobs will suck the intelligence out of all the "blue collar" work. Just like it did in the US.
2ndorderthought 12 hours ago [-]
I believe it. The us intentionally lacks accountability to prop up the already wealthy in almost all of its ventures. Which socializes losses and capitalizes gains. It's an economic model that guarantees deterioration and stagnation.
Dodging politics, the power structures in us industry need serious revamping.
mrleinad 11 hours ago [-]
China is going to be the next Germany: a loser in the new world without globalization
watwut 3 hours ago [-]
> Thats because the USA has really nothing big to export. Yay, designs.
USA exports and exported services, especially in IT. And a lot. USA has nothing to export is true only if you intentionally ignore stuff USA exports.
sillysaurusx 11 hours ago [-]
If this is true, then why are most of the companies that change the world founded in the US?
try-working 12 hours ago [-]
Open sourcing models is a marketing strategy. Chinese labs and small international labs have no awareness or distribution, so unless they become a hot topic for a while, nobody is going to bother trying out their models. Open source gets them that, and is essentially a tax on newcomers. When you start out you simply have no other option but to open source your models.
So, the business model of open models is the same as closed models: Sell inference. Open source is marketing for that inference.
China’s long term goal might just be to own the chip layer alongside everything else, and outproduce the US in data centers.
Frontier US labs could still have an advantage for a long time, but many use cases would start gravitating towards Chinese models if they 10x the data centers and provide similar quality inference for a third of the cost.
js8 13 hours ago [-]
What is the business model of Wikipedia? I don't think there is any.
Not everything good in our society needs to have a "business model". People still work on it. It's FINE.
sroussey 13 hours ago [-]
> What is the business model of Wikipedia?
Donations. Have you donated lately?
Wikipedia is cheap compared to creating and training models.
I don’t think donations will suffice at all.
As an example, we had millions of web developers download and install Firebug before browsers shipped their own dev tools. Donations over the course of multiple years would have paid my salary for a month if I were not a volunteer.
But from the “it’s fine” point of view, models will be baked into your OS.
Then later models will be embedded into hardware. Likely only OS makers models.
selcuka 9 hours ago [-]
> Wikipedia is cheap compared to creating and training models.
DeepSeek said it spent $5.6M [1] on training V3, which doesn't sound too much for a near-SOTA model.
An open source entity can come up with a hybrid business model, such as requiring a small fee from those who want to host the model as a business for the first n months following the release of a new model, but making it fully free for individuals.
Ultimately, information is a public good: it is non-excludable (you can’t stop
people from using it) and it is non-rival (we can all use it at the same time). Public goods are often very useful, and because they are non-excludable and non-rival, ultimately can’t have a market-based business model. I would class open-weights AI models as public goods, and would support government expenditure to produce them.
phainopepla2 13 hours ago [-]
Training AI models is capital intensive, though. Unless there's some sort of mega-crowdfunding effort for open weight model training there needs to be a way to recoup that money on the other end. Either that or state sponsorship I guess
PAndreew 14 hours ago [-]
Perhaps you can create a compelling UX around it and sell it as a subscription. "Normies" will not be able/willing to build it. You can then patch the model/ship new features around it as it evolves. For example I have built an ambient todo list / health data extractor using Gemma 4 2EB and Whisper. Nothing to brag about but it does fairly decent job even in foreign languages.
karussell 14 hours ago [-]
> What is the business model of open weight AI?
This is what I do not understand as well and advertising the knowledge and more advanced model is also the only thing that comes to my mind.
Since a month I am using gemma4 locally successfully on a MBP M2 for many search queries (wikipedia style questions) and it is really good, fast enough (30-40t/s) and feels nice as it keeps these queries private. But I don't understand why Google does this and so I think "we" need to find a better solution where the entire pipeline is open and the compute somehow crowdfunded. Because there will be a time when these local models will get more closed like Android is closing down. One restriction they might enforce in the future could be that they cripple the models down for "sensitive" topics like cybersecurity or health topics. Or the government could even feel the need to force them to do so.
2ndorderthought 14 hours ago [-]
Why would you want to try to support all users simple queries on your ai data center if they could run it on their own computer?
It builds good will also. it also shows research prowess.
For China it's different. They need to show Americans who don't trust them at all because of propaganda that they have no tricks up their sleeve. It also doesn't hurt when Chinese companies drop models for free people can run at home that are about as good as sonnet. Serious mic drop.
TheJCDenton 13 hours ago [-]
Very good point on using local ai to avoid data centers costs.
Running AI models on local hardware was exploratory at first, and if it's so easy today it's thanks to open source. It's a little bit coincidental that we have this today, and that mainstream hardware have this capability. The fact that a phone can run very small models is exploratory or some kind of marketing opportunity at best.
Why would hardware company ships cards with more AI capabilites (like more VRAM) in the foreseable future ? On what ground does the marketing for on device AI will keep generating interest ? For something as important, it's very uncertain. But above all, it should not depends on these brittle justifications.
Showing good will in distribution and research prowess today is positive communication, but it can be exactly the oppositite if/when an attack using those small models will reach a high value target.
For China the cultural difference is so huge, it's difficult to say. I would think they first and foremost need to show to evryone inside and outside of China that they match american models. Second, i would say that when americans prefer few very powerfull companies on the get go because they can leverage a lot of capital rapidly to industrialize, China will prefer leveraging a lot of smaller companies exploring a lot of things simultanously (so doing a lot of research), THEN creating legislation to let only the best (or a few) to survive effectively. In the end it's the same result (monopoly or oligopoly), but China may have a stronger core (research) and America may have stronger productive capital, that may be proved obsolete... In the long run, in either side it's a gamble, again.
2ndorderthought 12 hours ago [-]
They have already shown that their models match or excel over American ones in different cases. For cheaper too.
I disagree on the second point. I think most Americans don't prefer fewer competition, that's a bit antithetical to the free market.
I doubt the Chinese government cares as much about controlling a few companies as you think they do.
China has a few things going for it beyond research. They are mission driven, they actually have needs for this technology, their needs will forward their entire economy as they are the world's largest manufacturers. They are also huge exporters and have buckets of customer support for various languages.
China also has considerably stronger infrastructure for electricity, etc. even with an nividia embargo they are doing more than showing up.
I don't think it's a matter of who "wins". There is no winning. I think China stands to gain far more from LLMs than the US does, and they have proven they don't need the us to do it, even with he us trying to sabotage it's every move into the space. The game is already more or less over in my mind.
If anything I see LLMs as having a huge market in China, and now the US can't even sell it to them.
All I care about is, if I have to use this technology, let me run it locally to avoid the surveillance capitalism aspect. That seems to be the real reason the us has propped up it economy in anticipation for this technology. Yet it doesn't long term benefit the us nor me.
codebje 10 hours ago [-]
I'd expect unified memory architectures (Apple M-series, AMD Ryzen AI series, etc) to be the future of local inference, not GPU cards.
2ndorderthought 10 hours ago [-]
Time will tell. Depends on small model architecture trends and hardware availability. I wouldn't be surprised if something came slightly out of left field. Considering Taiwan is trapped into producing the same chips for the next 2 years, I wouldn't be surprised if a new player emerged.
karussell 14 hours ago [-]
Indeed cost can be another factor. Maybe also the main reason why Chrome added an offline model.
2ndorderthought 14 hours ago [-]
That and it's lucrative for Android/chrome to have a text summarizer model embedded on your phone probably for government contracts and data exfil but we won't go through there.
14 hours ago [-]
majormajor 14 hours ago [-]
> What is the business model of open weight AI? I don't think there is any. At best it can serve as an advertisement for the more advanced models you sell.
I don't think local will necessarily be open-weight. And then it's not that different from personal computing: you're giving up the big lucrative corporate mainframe, thin-client model for "sell copies to a ton of individuals."
So it'd be someone else (an Apple, or the next-year equivalent of 1976 Apple) who'd start eating into that. There are a few on-device things today, but not for much heavy lifting. At first it's a toy, could maybe become more realized in a still-toy-like basis like a fully-local Alexa; in the future it grows until it eats 80-90% of the OpenAI/Anthropic use cases.
Incumbents would always rather you pay a subscription or per-use forever, but if the market looks big enough, someone will try to disrupt it.
treis 12 hours ago [-]
Compute has gone back and forth from mainframe/thin client to fat client a few times already. LLMs will probably follow at some point but I think it's going to take a long time.
The cost to transmit text is basically free and instantaneous. The rent (i.e. a GPU in a data center) vs buy is going to favor rent until buy is a trivial expense. Like 50-100 range.
Even then a LLM that just works is easier than dealing with your own
majormajor 8 hours ago [-]
Storage has moved back and forth but I don't thnk compute has ever really gone back to thin client. Even Gmail, Google Docs, etc are running a buttload of javascript on the user device. Various attempts at avoiding that (remote .NET or JVM stuff on early "smart-ish" phones) crashed and burned.
Video game streaming is the closest thing, and it's never really taken off. (And this, IMO, is a good comparison because it's a pretty similar magnitude up-front-cost, $500-$4000.)
Once the local-AI-is-good-enough (Sonnet level for a lot of basic tasks, say) for a $1k up-front investment the appeal of having something that can chew on various tasks 24/7 w/o rate limits, API token budget charge concerns, etc, is going to unlock a lot of new approaches to problems. Essentially more fully-baked line-of-business OpenClaw-type things. Or the smart home automation bot of Siri's dreams. You can more easily make that all private and secure when all the compute is local: don't give any outside network access. Push data into the sandbox periodically via boring old scripts-on-cronjobs, vs giving any sort of "agentic" harness external access. Have extremely limited data structures for getting output/instructions back out. I'd never want to pass info about my personal finances into a third party remote model; but I'd let a local one crunch numbers on it.
Even if you need Opus/Mythos/whatever level for certain tasks, if 95% of everything else you'd pay Anthropic or OpenAI for can now be done on things you own w/o third party risk... what does that do to the investment appeal of building better AI appliances to sell end users vs building better centralized models?
I think "what if today's LLM performance, but running entirely under your control and your own hardware" opens up a LOT of interesting functionality. Crowdsource the whole world's creativity to figure out what to do with it, vs waiting for product managers and engineers at 3 individual companies to release features.
treis 8 hours ago [-]
There was a time where people ran software on their computer with limited connectivity. Late 90s/early 2000s most of what you did was running locally on your machine. Your emails would be downloaded and there'd be a shared drive but otherwise all local.
Anyways, who's spending $1k for a LLM machine when they can spend $20 (or 0) on a subscription? And who's having an LLM crunching away 24/7 anyways? Anyone who is going to do something like that probably wants a cutting edge model.
It'll (probably) get to a point where the hardware is cheap enough and advancement levels off. But we're a ways from that and even then when a data center is 20ms away why not offload heavy compute that's mostly text in text out.
zozbot234 12 hours ago [-]
Except that buy is a trivial expense because the hardware has been bought already. You've got a whole lot of iGPU and dGPU silicon that's currently sitting idle as part of consumer devices and could be working on local AI inference under the end user's control.
thefounder 6 hours ago [-]
Cloud providers have incentives to release open source models but for some reasons this happens only in China. Amazon, Azure, Google benefit from open source models because people run them on their hardware.
worldsayshi 14 hours ago [-]
It should be feasible to crowd fund training runs right?
dmd 14 hours ago [-]
A training run costs somewhere in the neighborhood of a billion dollars. That’s a thousand millions.
How many crowdfunded projects do you know that have raised even one percent of that? Who’s going to be in charge of collecting that scale of money? Perhaps some sort of company formed for the benefit of humanity, which will promise to be a non-profit? Some sort of “Open” AI?
Oh, wait.
derektank 10 hours ago [-]
It’s well within the capabilities of governments in developed countries. If Mistral did not already exist, I would definitely expect the French government to invest in a national LLM, if only because of how defensive they are of the French language.
iugtmkbdfil834 14 hours ago [-]
<< That’s a thousand millions.
I can't say that you are lying and you are not exactly exaggerating either. It is true that a new SOTA model -- from literal scratch -- it would be expensive.
But, and it is not a small but, is the starting point really zero?
sumeno 13 hours ago [-]
If a local model hits critical mass the business model is to use it to shape opinions in a way that is advantageous for the company/owners.
Much like the current Twitter model, being able to put your thumb on the scale of "truth". Bake a stronger bias towards their preferred narrative directly into the model. Could be as "benign" as training it to prefer Azure over AWS. Could be much worse.
dleslie 13 hours ago [-]
This is where government funding can play a role.
Sometimes there are things where the public good is best served with public expenditure.
CamperBob2 12 hours ago [-]
"Government funding" these days would mean that Trump pays Elon Musk (or more likely vice versa) to make Grok 4.20 the only legal LLM for use by Americans.
dleslie 12 hours ago [-]
Outside of the USA it would not look like a wealth transfer to an oligarch.
Not every country is in a crypto-libertarian race to hoard power and wealth.
CamperBob2 10 hours ago [-]
Not every country is in a crypto-libertarian race to hoard power and wealth.
Meanwhile, in the EU, the model would be collectively financed, trained by a competent, neutral agency... and then completely lobotomized in the name of "the children," "safety," "IP rights," "correct speech," dozens of individual countries' legal and regulatory requirements, and any number of additional vocal, noncontributing NGOs.
So no one would get rich off of the public model, but no one would get much of anything else out of it, either.
As another reply suggests, there's a reason why things happen in the USA first. Even when they don't, the prime movers move here as soon as they can. Or at least they used to.
fragmede 14 hours ago [-]
The business model is the total lack of attention to Qwen and Kimi that would happen if their models weren't downloadable. Before releasing the weights, there was basically zero attention paid in the western hemisphere to them, for whatever reason. By releasing the weights, they're relevant in the western world. The business model is to get people in the West to pay to use their platform hosting their AI, that otherwise would never have heard of them. As you said, advertising/marketing, essentially.
codebje 10 hours ago [-]
Baidu have a lot of services I've never heard of, that are highly successful in China. The lack of interest in expanding into Western audiences doesn't seem to matter there - what's different about inference?
digitaltrees 10 hours ago [-]
Exactly this. The assumption that your access will last is very risky. Or that Chinese companies will keep trying to erode the economic viability of American models by open sourcing the reversed engineered models for ever is naive.
ios-contractor 12 hours ago [-]
I don't think it should be local vs cloud AI. I think local AI should be treated as a separate product. local ai should do things that really don't need cloud AI, then cloud AI should be used as a fallback. That would reduce a lot of costs
slicktux 13 hours ago [-]
I’m just waiting for the US Government to implement their own local AI. Which will eventually lead to them open sourcing it because it’s tax payer funded and being that the NSA has decades worth of internet data they can train on; open weights would be just as good as any companies…
fragmede 5 hours ago [-]
with this administration?
beloch 10 hours ago [-]
Keep the Silicon Valley pattern in mind:
1. Innovate, create, and offer it all at sweetheart prices to the public while you rack up debt.
2. Shovel in more money and either buy out or outlast the competition. Become dominant. Lock in your users any which way you can.
3. Enshittify and cash in.
The deals Anthropic, OpenAI, etc. offer won't stay this good much longer. Don't let them lock you in. Failing that, you should budget more for the same service. You're going to need it. Having an open alternative running on your own hardware offers non-negligible peace of mind.
aabhay 14 hours ago [-]
Disagree with this. When cost becomes an important factor or the free but worse option becomes compelling and accessible (i.e. on device agent via apple style UX), there has been significant user behavior towards local. Think about stuff like removing backgrounds from photos, OCR on PDFs, who uses paid services for casual usage of these things?
furyofantares 12 hours ago [-]
What's the gamble here exactly? What agency do we have in it right now?
iLoveOncall 13 hours ago [-]
The mainstream audience does not have the faintest idea that "local AI" is even a thing.
CamperBob2 13 hours ago [-]
Just as their counterparts in 1975 had no idea that "personal computers" were even a thing.
Read through a 1970s-era issue of Popular Electronics or Byte, and then spend some time surfing /r/LocalLlama. You'll get a sense of real-time deja vu, like you're watching history unfold again.
irishcoffee 13 hours ago [-]
I own 2 5070TI cards in a rig I would gladly donate time to for a distributed training model effort. The kicker is the training data. I would want to gate the data to anything before 2022. I don’t know how to coordinate that, but I would really like to be involved in something like this. SETI, for LLMs.
AlexCoventry 12 hours ago [-]
Bandwidth is the killer, in distributed LLM training.
irishcoffee 11 hours ago [-]
What’s the rush?
codebje 10 hours ago [-]
It depends on the purpose for the model. AFAIK LLMs aren't particularly capable at researching answers, relying more on having 'truth' baked in to their weights, so if it takes 12 months to train up a crowd-trained LLM it'll be 12 months behind the times.
How serious a risk is poisoned weights?
Can we leverage the cryptobros into using LLM training as a proof of work?
MarsIronPI 8 hours ago [-]
What? I use Qwen 3.5 35B-A3B and it definitely knows how and when to do web searches to fill in gaps in its knowledge.
codebje 6 hours ago [-]
Does Qwen3.5 know it needs to do this because the API in question has had loads of churn and much of its training data is on obsolete versions, or do you need to prompt it? How well does it handle having an API reference with sample code in its context window?
Having an LLM use a web search tool isn't the same thing as researching a topic, IMO, because it's so ephemeral and needs constant reinforcement. LLMs aren't learning machines, they're static ones.
irishcoffee 25 minutes ago [-]
How many facts change over time to create obsolete data? Unless you’re researching current events, I contend it’s a moot point.
michaelje 12 hours ago [-]
[dead]
RataNova 13 hours ago [-]
[dead]
throawayonthe 11 minutes ago [-]
it's not going to happen with LLMs unless ram + storage gets several orders of magnitude cheaper like, yesterday
informatics aren't magic, you'll never be able to compress """knowledge""" into a small model in a way equivalent to the 1.5 TB model
kilroy123 4 minutes ago [-]
I agree. But I also think the future is some kind of hybrid approach where agents run locally, what they can, and then call out to the cloud for what they can't.
acidhousemcnab 9 minutes ago [-]
This will happen, but reconfiguring the infrastructure of the entire planet to train LLMs and run them over networks might be the "bubble", the megalomania.
Guillaume86 13 hours ago [-]
I think we should separate the private AI discussion from the local AI discussion.
The pragmatic choice to run big LLMs is one/several big servers online, but that doesn't mean private companies should be the only ones to run them.
A self hosted inference solution that offer good tenant isolation guarantees (ideally zero trust) and is easy enough to deploy and maintain (think Plex for AI) would be my choice for privacy. Now to be honest I have done zero research about this and have zero idea how feasible that is, maybe it already exists and there's some discord servers I should join?
Edit: I don't need to mention it here but what's incredible is that open models are in the ballpark of the best commercial models so supposedly, the hardest part by far is already solved.
FrasiertheLion 12 hours ago [-]
Another option is verifiably private inference with open source models running inside secure enclaves on the cloud (using NVIDIA confidential computing), and the enclave code is open source and verified via remote attestation upon connection, cryptographically proving that the inference provider cannot see any data. Tinfoil: https://tinfoil.sh/ is a good example of this (disclaimer: i'm the cofounder). You can read more about how this works here: https://docs.tinfoil.sh/verification/verification-in-tinfoil
>that open models are in the ballpark of the best commercial models
This is basically true for certain tasks. As an example, chat interfaces are not well poised to take advantage of higher model intelligence than what the best open source models already provide. But coding harnesses still benefit from greater model intelligence and even more so, the reinforcement learning that tightly interlinks the provider's coding harness (claude-code, codex) with the model's tool calling interfaces is another reason for discrepancy in effectiveness even when controlled for model intelligence. The opencode founder (open source coding harness that supports different model providers) was recently complaining about the challenges making the harness work well with different providers: https://x.com/thdxr/status/2053290393727324313
supermdguy 9 hours ago [-]
Interesting to see this after the recent post about Chrome’s on-device model using up 4gb of storage, which frustrated a lot of people [1].
I agree local models are great, and it’s cool that Apple has models built in now. But I feel like it basically has to be an OS level feature or users are going to get upset. I’d certainly rather have a small utility call out to OpenAI than download its own model.
I get the sentiment for self hosting. But there are a few counter arguments:
- Self hosting is expensive. It involves expensive machines with GPUs that cost hundreds per month if you use cloud based ones. You might need multiple of those. And you need people to mind those machines and they are even more expensive per month.
- If you run stuff on your laptop, it consumes a lot of resources and energy. I have qwen running on my laptop. Even minimal usage turns my laptop in a radiator. Nice as a demo, but I can't have it this hot all the time. It would run out of battery, and it's probably not great for longevity of components in the laptop.
- Models are evolving quickly and the self hosted smaller ones aren't as good when it comes to things like tool usage, reasoning, etc. Being able to switch tot he latest model is valuable.
- It's easier to get your use case working with one of the top models than with one of the smaller self hosted ones.
- If you get the wrong hardware, it might not be able to run the latest models very soon.
- Self hosting models is mostly a cost optimization. It only becomes relevant if you hit a certain scale.
- You have alternatives in the form of hosted models via a wide range of service providers. Some of those are EU based and offer all the things you'd be looking for if you are offering your services there. Including legal requirements.
- Reinventing what these companies do in house is technically challenging and possibly more expensive than self hosting models because now you need a lot of engineering capacity dedicated to that. And legal. And all the rest.
If, like most companies/people, you are at the experimenting stage, the cheapest and fastest is just getting an API key from an API provider of your choice. You can take it from there if your experiment actually works. And then it's mostly about optimizing cost. If your API usage goes to the thousands per month or worse, it becomes a cost/quality trade off.
mgrund 3 hours ago [-]
I really really want to like local AI, but I highly doubt it will see wide adoption for a long time.
The additional up-front cost for hardware designed to run an LLM in addition to normal workload is unlikely to be accepted by most consumers.
The scale will be very constrained (like Apples on-device models which are small, heavily quantized, and have a small 4K token context window). It’s also terrible for battery life.
AI as it is implemented today is simply just computationally expensive and unless you put in dedicated hardware (like the ANE) for only this purpose - a large cost driver - I don’t really see it getting large scale adoption.
Companies will probably need a server-backed solution as fallback if they want reasonable user experience, so why even invest in diverse hardware support.
teiferer 59 minutes ago [-]
Every reply here forgets/overlooks the main reason for why this is not going to happen: The astronomical AI data center investments currently underway. Those place are not just for training. They are for inference too and the way all those investments are expected to eventually pay off. The whole AI sector of our industry depends on running models in these places.
zozbot234 45 minutes ago [-]
These astronomical AI data centers will be used for high-value inference with smarter models that really are too large for running locally. The investments will be fine once they pivot to that use. Currently available open models are not in that range.
ninjahawk1 4 hours ago [-]
In my opinion, this is similar to the earlier internet and computers. Few households or individuals had access to state of the art computers, it was primarily research or more well-off individuals. Most random people didn’t really know what it was and certainly didn’t use one.
Now today, AI is very expensive and not readily accessible to most people without paying a good amount.
The early internet became now you can just get a free phone from phone companies so long as you get their extras. Then you get a ton of subscriptions and ad-ons, but you don’t have to spend money, could just use youtube with ads etc.
Local AI would similarly shift this dynamic to paying for access to plug-in’s and tools for your local AI to be able to use. Like how the subscription model works right now.
With local model advancements, such as specifically Qwen 3.6 35B A3B, this future is becoming more likely by the year IMO.
robot-wrangler 13 hours ago [-]
Entrenched interests are going to do everything to stop local, but there's at least a few technical reasons to believe small and specialized models could be the norm eventually. If that does happen, local will follow.
TFA is focused on whether big models are necessary for what users want. There's some evidence they may never actually be reliable enough unless a) mechanistic interpretation matures far enough or b) our multi-agent systems all become multi-model.
For (a), advancement in MI might fix problems with big models, but would also mean we can maybe get unified representations, and just slice and dice the useful stuff out of huge models, getting only what we need without the junk. Ability to isolate problems won't really come without bringing the ability to isolate functional subsystems. Only want logic? Only vision? Just cut it out of the big monster and enjoy reduced costs and surface area for problems.
For (b), just look at stuff like the evil vector, or the category of hallucinations specific to tool-use. Without a complete solution for helpful/honest/harmless alignment, it seems likely that creativity and rigor (and many other things) are fundamentally at odds. If you start to need many models for everything anyway, why do we need the huge expensive do-everything ones? So specialization also becomes a pressure to shrink everything towards minimal reliable experts
wrxd 13 hours ago [-]
The example in the post confirms my theory that for local models to succeed they need to be "good enough", not big enough that they can compete with frontier models.
They need to be able to do a small task well and they need to be able to run reasonably on consumer-class devices. Even better if they can run on mobile phones.
In my experiments with local LLMs I noticed that while increasing the size of the model is nice the real thing that turns a barely useless model into something useful is the ability to use tools.
Giving my models the ability to search the web and fetch web pages did way more to solve hallucinations than getting a bigger model. And it doesn't have a training cutoff.
Sure, the bigger model is probably better at using tools but I often find the smaller models to be good enough.
Gigachad 7 hours ago [-]
Will there even be a web to search in the future? These days public access blogs are dying and being replaced with hallucinated AI websites. Sites with original research like Reddit and YouTube are being locked up to prevent 3rd party indexing.
Knowledge and clean data sets are becoming increasingly valuable, and free community knowledge is drying up. The next big programming language won’t have years of Stack Overflow posts to train on.
Maybe we will see some kind of licensing deals where owners of good datasets charge you a fee to let your AI search them.
gregjw 2 hours ago [-]
Is there a place to learn more about Local AI specifically and maybe even more specifically about models for bespoke purposes or curating them yourself for more specific uses? Feels like theres a lot of fat you can trim off because you don't need generic use, but I don't understand where to even begin there.
StevenWaterman 2 hours ago [-]
/r/localllama is one of the most useful places
worthless-trash 6 minutes ago [-]
How long till we have distributed AI, where we can have different people run/understand different parts of problems and pass off work to different nodes across the internet.
almogodel 2 hours ago [-]
Remember nodes and graphs? A comfy user interface allows pretty incredible wiring among models local ai is like eurorack. The current graph skews heavily towards a a pair of small dense models collaborating with the large heavyweights selectively. It’s Qwen 3.6 27B with Gemma 4 31B, both unquantized, bf16/fp16, with phi 14b, nemotron cascade 2, and then those large heavyweights, r1 and subsequent deepseek models including speciale, gpt oss 120b, glm, min max,kimi, command r, mistrals, ever body, up in one graph, all them llm nodes patched and interconnected. Slow, resource intense, better than non local ai. I used Matteo’s graphllm for inspiration, and comfy ui (and st), and used the models to roll a new imgui node/graph model compositor. Now what?!
gpugreg 1 hours ago [-]
> Slow, resource intense, better than non local ai
Why should connecting small models to big models result in higher output quality than just running the big models without the small models?
tomelders 2 hours ago [-]
I do think local models are the future, but there's still the question of cost to be answered. Even if there's some slew of effincency improvements that mean an LLM can run locally on consumer level hardware on an affordable budget (and that's a big "if"), there's still the cost of training the modles to consider.
Assuming we end up in a future where people pay to run multiple smaller models on their machines for specific tasks (e.g. A summariser model, a python coding model, or however fine grained/macro you want to go), the people training those models will need to turn a profit.
So how much will that cost? And how often will consumers have to pay? Models have a very short self life. Say you have a dedicated python coding model - that needs re-training every time there's a significant update to the language itself, any popular packages, related technologies (e.g. servers, cloud infra etc). So how often will users need to "upgrade" to the lastest version? It's going to be "frequently".
And it still needs the language stuff on top of that. Users aren't going to interact with a python coding model by writing python. They're going to use natural language. So the model needs all that stuff. And they're going to give it problems to solve. What if you asked the model "Write me a Bezier curve function". It needs to know about bezier curves, which have nothing to do with Python. So where do these LLM providers draw the line on what makes it into the training data and what doesn't?
And if an LLM doesn't know what a Bezier curve is, that's not going to stop it from just hallucinating an answer. If a significat proportion of prompts resulted in a response that said "Sorry, I don't know what you're talking about", then people will just stop using it. The utility of these things will be quickly overshadowed by the frustrations.
The way these frontier models have been introduced and promoted has set unrealistic expectations, and there's no putting the genie back in the bottle.
rufasterisco 2 hours ago [-]
> the question of cost to be answered.
Commoditizing complements.
If Anthropic/OpenAI/etc is eating your lunch, make it work with cheap local LLMs , you can beat them on price by having local inference you don't pay (nor need data centers for), and try to keep your (user/data) moat.
The more Anth/OAI disrupt, the more likely this is to happen. If they don't disrupt enough (.ie: grow as an ecosystem to defend against incentives to commoditize), then yes, those incentives are removed, but they also leave money on the table, which they need.
Not only at business level, but also geopolitical (to a lesser extent? or not since lots of open weight models comes form China?).
tomelders 2 hours ago [-]
What are you talking about Willis?
revolvingthrow 15 hours ago [-]
A local Answer Machine is the dream, especially when the internet is decaying and generally on its last legs, but the hardware requirements seem like a huge mountain to climb. Things are progressing tremendously - deepseek v4 flash is very good for what it is - but even that goes beyond any reasonable local setup, which imo is 128 GB ram + 16 GB vram. 4 ram slots on a consumer board craters ram speed, 256 gb macs are too expensive, and even then the inference is ungodly slow.
On the other hand… v4 flash model is actual magic compared to what was available 2 years ago. If the rate of improvement stays as is, we’ll get a similar performance in a ~120B model in a year, which is viable (if expensive) for everyman hardware. Possibly you’ll be able to run its equivalent on a ~$1200 laptop by 2028, which for me-in-2020 would sound straight out of a scifi movie. A good harness that lets the model fetch data from other sources like a local wikipedia copy from kiwix could do a lot for factual knowledge, too; there’s only so much you can encode in the model itself, but even a cheapish (pre-curent prices) 2TB drive can hold an immense amount of LLM-accessible data.
Big caveat: I don’t see local models for programming or generally demanding agentic tasks being worth it anytime soon. You likely want bleeding edge models for it, and speed is far more important. Chat at 20tok/s is fine; working on even a small codebase at 20tok/s, especially on a noticeably weaker model, is just a waste of time. Maybe it’s a PEBKAC but I have no idea how people make any meaningful use out of qwen 3.6.
zozbot234 13 hours ago [-]
> and even then the inference is ungodly slow.
This is the wrong way of putting it. Local inference with SOTA models is all about slowing down compute for the sake of fitting on bespoke repurposed hardware. You don't need to go fast if you have the whole machine to yourself 24/7. Cloud AI vendors can't match that kind of economics.
As OP says, it shines in constrained environments where the model is transforming user-owned data. Definitely less useful for anything more open-ended.
2ndorderthought 13 hours ago [-]
Yea I do not recommend treating chromes prompt API as a good example of local LLMs. It's fine and stuff but it's really weak. 8b models from a year ago are better in some ways. And a lot of the recent model drops are meaningfully better.
scriptsmith 13 hours ago [-]
It's based on a Gemma 3n model, and yeah it's not the best. But if you have a use case that needs constrained JSON output for example, it's pretty neat.
Maybe it would do better with the new Gemma 4 models, which the Chrome devs have been hinting at moving to. And why the API doesn't let you introspect / pick the model, I'm still not sure.
robot-wrangler 12 hours ago [-]
> I've got some demos of what the new Prompt API can do:
> Use surrounding context to rewrite your ad copy:
Yup, that's the plan. No local model, no webpage; more, better and cheaper adtech extortion/surveillance for vendors while everyone else pays for the juice and hardware degradation.
dakolli 13 hours ago [-]
So you're running an llm to do data transformation that deterministic processes would be much better suited for and running 1,000 watt power supply to do so. Wild.
khoury 2 hours ago [-]
Agree with the sentiment, but: "We are building applications that stop working the moment the server crashes or a credit card expires."
This has been the case for way longer than openAI and Anthropic has been around with services like AWS, Cloudflare, etc.
manyatoms 7 hours ago [-]
It just depends how quickly models become "good enough" that we don't care about SOTA
julianlam 7 hours ago [-]
Arguably, some of the things HN readers ask for can be capably completed by a local open weight model for free.
Tepix 3 hours ago [-]
I'm pretty sure that AI assistants will become widespread.
I consider it to be very careless to entrust your emails, your chats, your calendar, your notes, your calls, your pictures, your contacts, your location history, your waking hours, your files, your TODO list, i.e. stuff including your health data to the for-profit AI companies. The temptation to earn money with your data is just too great, plus the risk of the data being stolen and sold illegally.
Local AI should be the default. For everone who can't do local AI, we need confidential compute. Yes, it has been hacked before. But it's making it a lot harder.
pjerem 3 hours ago [-]
> I consider it to be very careless to entrust your emails, your chats, your calendar, your notes, your calls, your pictures, your contacts, your location history, your waking hours, your files, your TODO list, i.e. stuff including your health data to the for-profit AI companies.
Still, we all do it with Google. (I don't do it anymore but i did it for mostly two decades so I include myself)
maxdo 56 minutes ago [-]
The start of the argument is already broken . Ok , slapping api is bad , so you push api that mimics to your provider, install some Chinese llm that will never obey any lawsuit in your country , install 500 packages to do so , every of them has a potential risk a security issue . How is that better ?
Oh yeah , it feels independent and not lazy , sure
duchenne 7 hours ago [-]
Cloud models can use batch processing which is significantly more efficient. A local model has basically a batch of one which takes as much time to process as a batch of 100 because the gpu is memory bound and spend most of its time loading the model from vram to the gpu cache while the gpu cores are idle. With a batch of 100 the model loading time and compute time are roughly similar. So local
Models have a first 100x lower efficiency. Secondly, local models are idle most of the time waiting for the user to write a prompt, so the efficiency gap is probably more around 1000x.
DrScientist 1 hours ago [-]
And what if your local computer essentially has an model chip with dedicated memory where the model stays loading 100% of the time?
r0b05 5 hours ago [-]
It's an interesting point but local gpu efficiency is not something I think about when I'm being rate limited or when my subscription costs keep rising.
vb-8448 14 hours ago [-]
> Use cloud models only when they’re genuinely necessary.
The problem is that it's much easier to use the SOTA models (especially if they are subsidized) instead of spending time fixing the knobs with the local one.
I just realized this with coding agents, yeah, you probably shouldn't always use latest version at xhigh, but you will end doing it because you do the job in less time, with less "effort" and basically at the same price.
I guess we'll see a real effort for local AI only when major vendors will start billing based on actual token usage.
lelanthran 13 hours ago [-]
> The problem is that it's much easier to use the SOTA models (especially if they are subsidized) instead of spending time fixing the knobs with the local one.
That's not a problem, that's a feature; I have something like 8 tabs open to different free-tier providers. ChatGPT, Claude and Gemini are the SOTA ones.
I have no problem maxing one out, then moving to the next. I can do this all day, have them implement specific functions (or classes) in my code. The things is, because I actually know how to write and design software, I don't need to run an agent in a loop to produce everything in a day, I can use the web chatbots with copy/paste to literally generate thousands of lines of code per hour while still having a strong mental model of the code that I can go in and change whatever I need to.[1]
---------------------
[1] Just did that this morning on a Python project: because I designed what I needed, each generation was me prompting for a single function. So when I needed to add something this morning I didn't even bother asking an chatbot to do it, I just went ahead directly to the correct place and did it.
You can't do that if you generate the entire thing from specs.
vb-8448 13 hours ago [-]
We are speaking about local AI, and having all this SOTA models basically for free is blocking the progress of local or independent third party setups.
lelanthran 13 hours ago [-]
Maybe I should have clarified what the feature is (After re-reading my post, I see that I basically just ended after adding the footnote)
The feature of using all these SOTAs to exhaustion on the free tiers is burning their VC money!
The more I use for free, the more of their money I burn, the closer we'll get to actual 3rd-party and independent setups (local or otherwise).
RataNova 13 hours ago [-]
The path of least resistance usually wins, especially when the pricing hides the real cost
Analemma_ 14 hours ago [-]
I'm also just not seeing good performance from local models. Every time a thread about LLMs comes up, there are tons of people in the comments insisting that they're getting just as good results from the latest DeepSeek/qwen/whatever as with Opus, and that just hasn't been my experience at all: open-source models just fall over completely compared to Claude when asked to do anything remotely complicated.
I have a sneaking suspicion this is kinda like the situation with Linux in the 90s, where it kinda worked but it reeeeeally wasn't ready for the home user, but you had a lot of people who would insist to your face everything was fine, mostly for ideological reasons.
kgeist 14 hours ago [-]
It depends a lot on how you run those models. I think a lot of disagreement is because of that. A lot of people run local models with incredibly small context windows (makes an agentic LLM circle in loops), use very small quants (like 4 bit => huge degradation), don't set the recommended parameters (like top-p/temperature), or download GGUFs with broken chat templates. And then they claim model X is bad :)
I'm currently running both Sonnet 4.6 and Qwen 3.6-27b on the same codebase (via OpenCode, the parameters were carefully tuned to have a good quality/context size ratio), and on this project, they both struggle with complex non-trivial tasks, and both work flawlessly otherwise. Sonnet 4.6 understands the intent better if my task is ambiguously formulated, but otherwise the gap is pretty small for coding under a harness.
lelanthran 13 hours ago [-]
> Every time a thread about LLMs comes up, there are tons of people in the comments insisting that they're getting just as good results from the latest DeepSeek/qwen/whatever as with Opus, and that just hasn't been my experience at all: open-source models just fall over completely compared to Claude when asked to do anything remotely complicated.
Different usage patterns - you want to issue a single spec then walk away and come back later (when it has consumed $10k worth of API tokens inside your $200/m subscription) to a finished product.
Many people issue a spec for a single function, a single class or similar. When you break it down like that, the advantages of SOTA models shrinks.
vb-8448 13 hours ago [-]
My experience is that in medium/big codebases even with single functions going with the xhigh is basically better from a user perspective (faster to get the result, and you can trust it) while going with lower models(e.g. sonnet instead of opus) you have to always carefully review the output because 1 of 10 it will hallucinate, you won't catch it immediately and at some point it will bite you.
lelanthran 13 hours ago [-]
> My experience is that in medium/big codebases even with single functions going with the xhigh is basically better from a user perspective (faster to get the result, and you can trust it) while going with lower models(e.g. sonnet instead of opus) you have to always carefully review the output because 1 of 10 it will hallucinate,
What do you mean "trust it"? It sounds like you want to vibe-code (never look at the output), and maybe for that you need SOTA, but like I said in a different comment, I can easily generate 1000s of lines of code per hour just prompting the chatbots.
I don't, because I actually review everything, but I can, and some of those chatbots are actually SOTA anyway.
vb-8448 12 hours ago [-]
With SOTA models I can just set up the instructions (even a little bit fuzzy), go away for 10 or 15 minutes, come back and just check result and adjust when necessary (and most of the time small adjustment are necessary, but the overall work is pretty good).
With subpar models I must be more careful on providing instructions and check it step by step because the path it chose is wrong, or I didn't ask for or the agent stuck in a loop somewhere.
catlifeonmars 9 hours ago [-]
A lot of people aren’t using agents that way. Not saying that it’s not a legitimate use or anything, just that I think the use cases are different. And yeah maybe for your specific use case, sota hosted models are the right choice
bilbo0s 13 hours ago [-]
This.
I’ve begun to suspect that most people are probably running different hardware. Sure, you run the latest deep flash on your brand new M5 128G maybe you get acceptable performance?
But honestly, how many people have an extra $9000 laying around these days?
Right now, running with acceptable performance is kind of a luxury. I wish the people who always say - “This is great!” - would realize that not everyone has their hardware.
vb-8448 13 hours ago [-]
Actually even with a 9k hardware you won't get good enough performance. There is an interesting video from antirez on trying to run deepseek v4 flash 2bits on a m3 max 128GB ... and the result is kind delusional: as soon as the context start growing you are around 20token/s.
11 hours ago [-]
zozbot234 13 hours ago [-]
Prefill performance used to be the real bottleneck on antirez's DS4 and that's been greatly improved by now, it doesn't perceivably slow down with growing context.
hyfgfh 13 hours ago [-]
Local LLMs is the only thing viable and probably the only thing it will remain once the hype dies down.
A smaller cheaper local model can delivery most the value for coding, while we still use some services for code review and security compliance.
Once the VC money runs out and they start to charge the real price, the C-level will have to impose budges or limits. The current pissing contest over who can expend the most tokens is both ridiculous and shortsighted
QuadrupleA 6 hours ago [-]
Not sure how excited I feel about visiting your website and having it auto-download a 8GB model with GPT-3.5 level hallucinations, and then probably crash because I only have 6GB of VRAM. My dad won't be able to use it, or anyone else without a bleeding edge device. On a powerful enough "neural engine" device the battery will be drained quickly, while the heatsink burns a hole in my lap.
dgb23 2 hours ago [-]
Local could also mean self hosted.
The obvious optimization for the case presented would be to generate all the summaries on a server instead of in the client. Then the totally used compute would scale with the number of articles instead of number of users.
cl0ckt0wer 47 minutes ago [-]
If they do then hardware costs will explode even more
jjordan 15 hours ago [-]
It feels like we're one technological breakthrough away from all of these data centers going up to be deemed irrelevant.
Lalabadie 15 hours ago [-]
The cynical take is getting more and more to be the only rational one:
The promised mega-data center deals are meant to boost valuations today, not serve tons of customers three years from now.
_heimdall 14 hours ago [-]
It seems pretty clearly inline with the dotcom bubble to me. Every company claims to be a leading AI company, those building infrastructure are promising the moon and getting 1/3 of the way there, and no one knows how to monetize it justify the hype or expense.
jjordan 14 hours ago [-]
oof, this bubble popping is gonna be brutal.
krupan 13 hours ago [-]
It took us only, what 70-ish years of computer and AI research to get to this point, so yeah, probably just one little thing and then we'll have it </sarcasm>
Seriously. I have never ever seen so many people so willingly drink the marketing kool-aid from companies selling their product before. It's scarier to me than any threats of AI actually disrupting society (because it is so far from being capable of doing that).
i_love_retros 15 hours ago [-]
What would that breakthrough be?
Waterluvian 15 hours ago [-]
Magic math and computer science that allows us to get the same quality response for a fraction of the GPU.
intothemild 14 hours ago [-]
That's already happening. Qwen3.6 and Gemma4.
Basically small and medium models that are crazy well trained for their sizes.
Then we have a lot of specular decoding stuff like MTP and others coming to speed up responses, and finally better quantisation to use less memory.
Local LLM is the future, and the larger labs know that the open models will eat their lunch once people realise that the gap is only a few months. If we were good with LLMs a couple months ago, we're good with the open models now.
krupan 13 hours ago [-]
And how were those models developed and trained?
lelanthran 13 hours ago [-]
> And how were those models developed and trained?
That's irrelevant to my decision to use local or not.
krupan 13 hours ago [-]
That's not what this thread is about? We're saying some new breakthrough is needed, someone said it already has happened, and I'm asking if it really has. Has it? I don't think so, those models are not in some way fundamentally different than other LLMs
lelanthran 13 hours ago [-]
> We're saying some new breakthrough is needed, someone said it already has happened, and I'm asking if it really has.
I didn't read "and how were those models trained" as "Are we there yet?"
intothemild 5 hours ago [-]
There's a percentage of people who love to question how the open models were trained.. they are almost always going to try and make some argument about using the closed frontier models for distillation as some form of theft.
Just totally forgetting that the frontier models themselves stole an insane amount to get to where they are.
It's theft all the way across the board, and when someone tries to make the argument that open models theft is bad, but Altman or Amodei's theft is good.. they are revealing a lot about themselves
13 hours ago [-]
YZF 14 hours ago [-]
The current LLMs are also "magic" so anything is possible. AFAIK there is no proof that the current architecture is optimal. And we have our brains as a pretty powerful local thinking machine as a counter-example to the idea that thinking has to happen in data centers.
_heimdall 14 hours ago [-]
I want to ask what makes them magic, but even those building LLMs don't really know what happens when they run inference...
I have to assume current architectures aren't optimal though, the idea that we stumbled into the one and only optimal solution seems almost impossible.
15 hours ago [-]
toufka 14 hours ago [-]
I mean, the most cutting edge of iPhones, iPads and MacBook Pros _today_ are quite capable of running in realtime today’s high-end local LLMs.
If you project out that hardware just a couple of years, and the trained models out a couple of years, you end up in a place where it makes so much more sense to run them locally, for all sorts of latency, privacy, efficacy, and domain-specific reasons.
Not all that different from the old terminal & mainframe->pc shifts.
Finally - hardware has seemingly gotten out ahead of software that most folks use - watching YouTube, listening to music, playing a game or two. There was a time when playing an mp3 or watching a 4k video really taxed all but the nicest systems. Hardware fixed that problem, like it very well could this one.
sofixa 14 hours ago [-]
> I mean, the most cutting edge of iPhones, iPads and MacBook Pros _today_ are quite capable of running in realtime today’s high-end local LLMs
Definitely not the high end local LLMs. The small ones, yes, absolutely.
> If you project out that hardware just a couple of years
One of the biggest bottlenecks for LLMs is memory capacity and bandwidth. With the current glut for memory, it's unlikely we'll see lots of advancements in terms of average memory available or its bandwidth on regular (not super high end devices) in the coming years.
Alternatively, it's possible we get dedicated SMLs for e.g. phone specific use cases, that are optimised and run well.
15 hours ago [-]
_heimdall 14 hours ago [-]
I'd assume its a totally different architecture that isn't based on storing a compressed dataset of all digital human text.
h05sz487b 4 hours ago [-]
I really want this to be true. For me getting all models to run to the best of my hardwares ability and the cli tool to also make best use of the model is still a headache. I had coding models not being able to do a search and replace depending on the tool through which they were called, visible <thinking> elements in my message flow, agents doing a task, failing at the linter, then reverting everything again so the linter is happy and presenting the result as a "good compromise".
Right now it feels like we have all the pieces but nobody integrating all that into an amazing experience.
dgb23 2 hours ago [-]
I‘m surpised at the presented dichotomy between JSON formatting and what the Apple SDK provides to parse output into structs.
Based on what I understand about how the former works, I would assume that the latter has the same properties and failure modes.
continueops_com 2 hours ago [-]
Opus 1M context window and lighting fast response time is hard to compete with, even if you run a local A100 the local models are just not as good as tool calling, long running tasks and non-hallucinations
twoodfin 2 hours ago [-]
It was hard for an Apple ][ to compete with an IBM mainframe at enterprise data processing, but the power of personal ownership & commodity economics was disruptive enough that 30 years later 99%+ of enterprise data processing was taking place on descendants of the original personal computers.
reshef316 3 hours ago [-]
not saying i disagree with the general statement, but there need to be options, not everyone has a machine capable of doing the same type of lifting required to properly run a local version. so what, if my machine is older i'll be locked out? restricted? forced to pay?
diwank 8 hours ago [-]
in order for us to get there, i think we need a standardized api at the os layer for local models so that the os could optimize, batch and safely allocate resources. something like an analog of chrome's local model "prompt" api but provided and managed by the os itself. the user can choose which model they want to primarily use and so on but all of the heavy lifting and continuous batching is done automatically by the os
holtkam2 14 hours ago [-]
I wish I could upvote this twice. We (devs) really REALLY need to consider on-device compute before going to the cloud for LLM inference.
mattlondon 14 hours ago [-]
Yet there is another post a few rows down where people are losing their shit that Chrome has a local LLM model that uses a couple of GB of space for local-inference.
Damned if they do, damned if they don't.
dlcarrier 14 hours ago [-]
Maybe don't use gigabytes of bandwidth and storage space, without asking.
hparadiz 13 hours ago [-]
Easy. Stop using Chrome.
userbinator 12 hours ago [-]
If I want a model I'll go download one. (And I did, not long ago, to play around with image generation.)
bytecauldron 14 hours ago [-]
This is a bit disingenuous. People aren't losing their shit about a local model being installed. It's the lack of user autonomy. Just give the option to download a model instead of a silent install. It's not that hard. This is how every other local option works.
wmf 13 hours ago [-]
AFAIK Apple and MS auto-download local models.
FridgeSeal 10 hours ago [-]
The former has made a big deal about local inference and marketed that as an OS level feature.
You can also…turn it off.
Chrome silently elected people into it _and_ downloaded the model without asking because they decided that’s something they (chrome) fancied doing.
The difference should be pretty obvious.
bytecauldron 10 hours ago [-]
Sorry, I should have been more specific. This is how every *good local option works.
13 hours ago [-]
aabhay 14 hours ago [-]
This is a weird take. If its not opt in or you’re shoe horning it into a browser, then that sucks. Nobody is getting enraged that an app for running local LLMs downloads data to do so.
avadodin 13 hours ago [-]
Although you can opt out and even disable the download feature when you build them in some cases, most of the local LLM tools are too download–happy by default.
fg137 14 hours ago [-]
You might want to read the comments to understand what people are actually complaining about.
This comment is quite dishonest about the nature of the discussion.
themafia 14 hours ago [-]
If it was such a good and laudable idea why didn't they tell me about it before they activated it? It seems to me like they avoided it in the hopes that I wouldn't notice, because, presumably if I had, I would have IMMEDIATELY disabled it.
Also why doesn't their task manager show that it's actually the one downloading? Why does it go out of it's way to hide this activity?
Since I have conky on my desktop I could catch this immediately, and take the action I preferred with my own computer, which was to _immediately_ disable it.
StilesCrisis 14 hours ago [-]
I'm guessing you immediately close the What's New Chrome tab when you update?
They have absolutely not been shy about any of this.
themafia 14 hours ago [-]
I've never had a "What's new" tab ever open because I disable the customized home page where that's displayed. I'm guessing you're not aware that's an option.
Please show me where in either of those documents it explains it's going to download a 4GB model.
crazygringo 13 hours ago [-]
I use an extension that gives me a customized homepage, but I still always get the "what's new" tab on every major version upgrade.
It's a totally separate tab that opens. It's got nothing to do with what you use as your homepage.
themafia 2 hours ago [-]
Thank you for going out of your way to deny my exact experience. Do you think I'm doing this to rag on Google? And you're this eager to defend them?
I'm on gentoo. I have to update chrome manually. I updated it. On update I _never_ get a "what's new" page. I've had this profile for more than a decade so I have no actual idea why, but, I can absolutely tell you, I do *not* get one. After update it started consuming all my bandwidth. This use did not show in it's task manager. I have a metered connection. This is a problem for me. I worried it was a compromised plugin. I had to spend 10 minutes in Firefox discovering why chrome was doing this then going to the configuration and disabling this.
This was a disappointing experience. I'm sorry you feel differently; other than stating the obvious, I seriously have no idea what you and the other corporate defense squad members are trying to achieve with this gaslighting nonsense.
ekjhgkejhgk 14 hours ago [-]
You don't understand the difference between "I run a local LLM because I chose to" vs "The browser chose to run a local LLM and I have no say"? You don't understand?
Not to mention that the LLM that I choose to run requires a monster machine and is infinitely more capable than whatever google chose to put on their browser?
I mean, none of this affects me because I don't use chrome, obviously, but you don't see the difference? Bewildering.
StilesCrisis 14 hours ago [-]
Did you opt into WebGPU? QUIC? Canvas 2D? Brotli? Browsers don't work that way.
za_creature 13 hours ago [-]
The size difference between the local LLM and all of the above is about... the size of the local LLM.
hackermanai 6 hours ago [-]
> “But Local Models Aren’t As Smart”
This is what makes me continuously doubt and rewrite the local-first approach to inline chat in my editor. Next edit/ code complete makes more sense due to latency advantage. But chat is hard.
It's fast and feels good to run locally, but output quality is just not ChatGPT etal.
z3t4 4 hours ago [-]
We are experimenting with local LLM and opencode at work and the quality is not as good as Claude code et.al but it's not far off and local speed is actually faster. We got 3 of Nvidias latest AI GPU's which was not cheep. It's not good enough to train our own models, but we can run the biggest open models with some tweaking.
timeattack 15 hours ago [-]
My problem with LLMs (apart from philosophical aspects and economical impact) is that it would be unlikely for any of us to be able to train something functional locally (toy-like LLMs -- sure, but something really useful -- no). Apart from that it requires immense computing power, it also requires a dataset which is for the most part is obtained illegally.
kibwen 14 hours ago [-]
This seems overly pessimistic.
I may personally be of modest intelligence, but to acquire the intelligence that I do have, I did not need to train on every book ever written, every Wikipedia article ever written, every blog post ever written, every reference manual ever written, every line of code ever written, and so on. In fact, I didn't train on even 1% of those materials, or even 0.00000000001% of those. The texts themselves were demonstrably not a prerequisite for intelligence.
At minimum, given that it only took me about 20 years of casual observation of my surroundings to approximate intelligence, this is proof positive that the only "dataset" you need is a bunch of sensors and the world around you.
And yes, of course, the human brain does not start from zero; it had a few million years of evolution to produce a fertile plot for intelligence to take root. But that fundamental architecture is fairly generic, and does not at all seem predicated on any sort of specific training set. You could feasibly evolve it artificially.
krupan 13 hours ago [-]
What does this even have to do with the parent? Your capabilities have nothing to do with LLM capabilities. The two work in completely different ways. The reason LLMs work is because they are huge and have been trained on vast amounts of data, full stop. Sure, there's potential someday to get something useful using less data, but we aren't there.
avadodin 12 hours ago [-]
You are right on the limitations of the architecture but I wouldn't call LLMs huge. Flagship models maybe but that's just because they don't scale very well.
A universal translator with image and voice recognition and a decent breadth of encyclopedic knowledge in only a small fraction of an English Wikipedia dump(6GB/20+GB) is not "huge".
It is probably closer to the theoretical limit than anyone could have expected.
_heimdall 14 hours ago [-]
You're also embodied and experiencing the world around you with more senses than only the ability to read text.
rogerrogerr 14 hours ago [-]
> the only "dataset" you need is a bunch of sensors and the world around you.
dlcarrier 14 hours ago [-]
Not the whole thing, at least with current technology, but LoRAs are really good at fine tuning, and can be generated in a few hours on high-end gaming computers, so as long as the base model is in your language, you likely have enough spate computing power, in whatever electronics you own, to train a few LoRAs a month.
In the future, when regular home computers have the capabilities of modern servers, we'll be able to train the entire LLM at home.
pronik 14 hours ago [-]
There is so much technology that we are unable to reproduce locally, I don't think LLMs are in any way different. There will be large LLM manufacturers, small LLM manufacturers, LLM artisanals, LLM enthusiasts and of course LLM consumers, just like with everything.
krupan 13 hours ago [-]
And this is important because even though you are running a model locally, it's still a proprietary model. You have no say in what it was trained on, how that training data is labeled, what the guardrails are, what biases it might have, none of that.
Ucalegon 15 hours ago [-]
Depends on the domain. There are plenty of different use cases where the data needed for training is available for personal, or non-commercial, use. At that point, it does come down to compute/time to do the training, which if you are willing to wait, consumer grade hardware is perfectly capable of developing useful models.
woah 12 hours ago [-]
Can you make your own CPU, locally?
RataNova 13 hours ago [-]
That's a fair concern, but I'd separate training from inference here
cyanydeez 15 hours ago [-]
That sounds like government. So your problem is mostly that you expect to have a collective social effort, but not enough to pay for it as a public good.
nezhar 5 hours ago [-]
For me, building with open weights models sounds like the right approach — you are able to switch providers, and you can control where the server is running.
You don't have any guarantees in terms of data, that's true, you rely on the provider. But this is similar to a database or other services where you don't have the knowledge or resources to run them yourself. Hardware cost is an additional factor here.
If on the other hand your idea works out and the model fits the use case, you can always decide to move to a dedicated infrastructure later.
nezhar 5 hours ago [-]
[dead]
nate 9 hours ago [-]
I've been fooling with the Apple Foundation model for AlliHat, so you can chat with it from a Safari sidebar instead of just Claude. It's passible for some basic things like summarizing a page. But it really reminds me of Claude from like 3 years ago. I was trying to get it to generate synonyms for me and it would only generate about 10 with some duplicates. And when I asked for more, it said it would be a waste of resources to generate more. It has some kind of "act responsible" thing that Claude seemed to have. I also asked it to help me come up with synonyms for the game Pimantle, and it decided Pimantle was related to the adult industry and no matter how many times I said "it's just a game" or "I think you've misunderstood", it was stuck on not helping me with anything related to adult websites. And recommended I should play Wordle instead.
All of this being said, it seems Claude gave up this "constitution" it used to train on? I remember trying to get it to help me code some video editing tools, and it was convinced I was pirating videos and so wouldn't help me anymore in that session.
stuaxo 2 hours ago [-]
Harnessed seem to be a big part of what makes stuff good or not.
I tried Cline and couldn't get it working well and part of this was that at the time it expected OpenAIs output format.
Animats 14 hours ago [-]
Question: for software development, how much of an AI do you need for local development? Can it be run locally? Can someone train something that knows a lot about software but lacks comprehensive coverage of history, politics, and popular culture?
A specialist handrolls a cut-down framework to power a 1 or 2 bit quantised version of a cut-down sort-of-frontier model.
It can be yours if you have 128GB or 256GB of RAM.
14 hours ago [-]
dd8601fn 14 hours ago [-]
The ones that are good for more than elaborate auto-complete are pretty hefty, but it can be done. They’re still not Opus behind claude code.
testfrequency 4 hours ago [-]
Local AI is definitely going to be the future as these models continue to advance at the rapid pace they already are.
This is why I believe OAI and Anthropic I’ve been so aggressive at offering services outside of their pure models like Claude Design. This is what will be competitive and keeping people subscribed.
october8140 4 hours ago [-]
They will never let us have enough RAM every again. RAM will be kept behind locked doors in the name of national security and only trusted corporations will be aloud to run AIs and "safely" run them in the cloud and sell them to us.
j3th9n 4 hours ago [-]
I’ll make my own RAM, with the help of AI.
yuppiepuppie 4 hours ago [-]
Is this a conspiracy?
vivzkestrel 5 hours ago [-]
- can we get suggestions from people on what would the equivalent for android
- and for the web / javascript / svelte applications?
- suggestions for local OCR for bulk images?
kajman 4 hours ago [-]
I hope there's no web equivalent for a while. I usually hate app lock-in, but any hasty API for this is going to be a DoS or fingerprinting nightmare.
imnes 8 hours ago [-]
I'm going through a similar exercise right now in an app I'm building. No server dependencies, for features that have traditionally used server side APIs, moving those capabilities onto the device. And also utilizing the on-board AI features provided by Android and iOS. So far it's been a very positive experience, and the capabilities provided on these devices have been more than capable for my needs. Working on providing apps that don't have ongoing operation costs of running server side infrastructure, so I can offer them as "pay once, run it forever" instead of ongoing subscription costs for the user.
imrozim 3 hours ago [-]
I use Claude api for my startup and the billings and rate limiting hurt. But local models cant do what i needed yet. Wish they could.
Aleesha_hacker 4 hours ago [-]
To what extent is this strategy currently feasible for windows of android development? I am interested in how portable local-first AI is across platforms, but it seems promising on Apple devices.
hydra-f 5 hours ago [-]
Unless there's a breakthrough or a transition to diffusion models, it's hard to imagine them becoming an affordable commodity
Small models are still in their infancy, and there's still much to sort out about and around them, as well
deivid 11 hours ago [-]
Sounds great, but if you din't cave to apple/google (eg: graphene, lineage), models are not built-in. Every app needs to ship their own models, and they are not tiny.
Is there a solution for this?
I'm currently just making users download onnx models if they want a feature, but it's not smooth UX
hackyhacky 13 hours ago [-]
I would like a standardized API for local AI to exist outside of the Apple ecosystem. The Prompt API is Chrome is halfway there.
* What is the answer to local AI for native apps on Windows?
* What is the answer to local AI for Linux?
This is a big opportunity for Linux, given the high quality of open-weight models. I hope some answer emerges before designs fracture and we get a dozen mutually incompatible answers.
GLM 5.1 is very impressive, I wouldn't be surprised if we get to a point where it can live in ~48Gb and have a reliable speed/quality
antidamage 10 hours ago [-]
The roadblock to this is you seem to have to build it yourself. I've noted that none of the current cloud models are very good at building a replacement for themselves, and there's significant work that needs to be done to make a local LLM reliable in any way. I haven't found a single standalone package that makes setting them up easy. Sure, I can run Hermes Agent and a model, but getting the self-reflection loop in and all of the other stuff the need to actually be good? I'm still at it, trying to get anything to work reliably and factually.
DonsDiscountGas 9 hours ago [-]
Could be an opportunity for a business? Except nobody ever wants to pay for software
manlymuppet 12 hours ago [-]
People are trying to “make the best software”, though.
I think the Quixotic accelerationists of AI are more or less a vocal minority of the people who make software, and the choice of online APIs over local systems is largely a choice made for users, rather than developer’s laziness.
You can do more and better with private AI today than with local models. There is no getting around that. Even if local AIs get better, being on the cutting edge of LLM performance is often a very worthy investment.
Most people won’t settle for a product if it’s not the very best and incredibly convenient. That’s a high bar, and local AI often doesn’t meet those standards.
HN’s insistence on treating all users like they are open-source, privacy-first, self-hosted Linux fanatics is painfully corny.
jdub 9 hours ago [-]
> Most people won’t settle for a product if it’s not the very best and incredibly convenient.
... uh?
manlymuppet 9 hours ago [-]
That is, excluding Microsoft users.
try-working 9 hours ago [-]
I'm building a protocol and router runtime for hybrid local/cloud AI.
The goal is that you would assign roles to models based on tasks, capabilities and observed performance. The router would then take care of model selection in the background.
It's tricky though. Probably have another two weeks before I can release the runtime.
You can follow me on Twitter if you want updates (see profile)
ksec 13 hours ago [-]
While I agree that would be the goal, we are too early for that. Just like how speech recognition used to require many server in a Datacenter to process and you send your data over. It is now completely on devices.
We are at least 5 years away from that. And DRAM needs a substantial breakthrough in cost reduction.
RyanZhuuuu 4 hours ago [-]
I’m skeptical that local AI will work well with today’s technology. Running capable models consumes too many resources on end-user devices.
eldenring 9 hours ago [-]
This article makes 0 sense. Its not up to billing or computer systems or ease of use or anything else that matters. The question is will the scaling laws, which in the asymptote are likely the laws of physics, hold up in converting energy to smarter models. Its not really up to anyone, the labs or developers, to choose if local or remote models will be the norm.
Galanwe 15 hours ago [-]
I would love for local inference to be possible, but from my experience, Kimi 2.6 is the only model that would be worth it, and its a $10k (M3 Ultra max spec'd - 30s TTFT so kind of slowish) to $30k (RTX6000/700GB+ DDR5) upfront, noise / power consumption aside.
mft_ 15 hours ago [-]
You're maybe missing the article's point, which is to use local models appropriately:
> “But Local Models Aren’t As Smart”
> Correct.
> But also so what?
> Most app features don’t need a model that can write Shakespeare, explain quantum mechanics, and pass the bar exam. They need a model that can do one of these reliably: summarize, classify, extract, rewrite, or normalize.
> And for those tasks, local models can be truly excellent.
Galanwe 15 hours ago [-]
This is a bit naive IMHO...
I have tried quite a bunch of local models, and the reality is that it's not just a matter of of "it's a small model that should be hostable easily". Its also a matter of whats your acceptable prefill TTFT and decode t/s.
All the local models I used, on a _consumer grade_ server (32GB DDR5, AMD Ryzen) have been mostly unusable interactively (no use as coding agent decently possible), and even for things like classification, context size is immediatly an issue.
I say that with 6m experience running various local models for classifying and summarizing my RSS feeds. Just offline summarizing ans tagging HN articles published on the front page barely make the queue sustainable and not growing continuously.
mft_ 14 hours ago [-]
1) Again, I suspect you're missing the point of the article. The iPhone's on-device LLM is (apparently) ~3 Bn parameters - and runs well/fast enough to be used in the manner described. Of course, the iPhone has its GPU to leverage.
2) It's probably not the time/place to trouble-shoot your "consumer grade server" LLM experience, but if you're running on CPU (you don't mention a GPU) then yeah, your inference speed will be slow.
3) Counterpoint: my consumer-grade Macbook Pro (M1 Max, 64GB) runs Qwen3.6-35B-A3B fast enough to be very usable for regular interactive coding support. (And it would fly with smaller models performing simpler tasks.)
mikrl 15 hours ago [-]
One of my hobbyist workflows involved transcribing ETF prospecti into yaml for an optimizer to optimize over.
Used to take me maybe 10-20 minutes per sheet.
Then I got codex to whip up a script that sends each sheet to a fairly low parameter locally running LLM and I have the yaml in a couple seconds.
My dream is to bootstrap myself to local productivity with providers… I know I’ll never get there because hedonic treadmill etc, but I do feel there’s lots more juice to squeeze. I just need to invest more time into AI engineering…
knlam 8 hours ago [-]
you know what is the hard part about local ai? Supporting it cross platform. The OP get it easy by playing in Apple ecosystem but when you need to support local AI to both iOS/Android the approach is completely different. Even get the users to download the smallest models can be a challenge
mercurialsolo 11 hours ago [-]
Not your weights not your brain. Owning your own action and decision model is super important as these models emulate more of our decisions, thinking and learning. Built claudectl - a local brain for coding agents
https://github.com/mercurialsolo/claudectl
everlier 12 hours ago [-]
There was never a better time to run LLMs locally. It's just a few commands from zero till a fully working LLM homelab.
# Open WebUI -> llama.cpp + SearXNG for Web RAG + OpenTerminal as sandbox
harbor up searxng webui llamacpp openterminal
```
That's it, it's already better than Claude's or ChatGPT's app.
FrasiertheLion 12 hours ago [-]
Overall I'm bullish on standardized local APIs that ship with the browser or platform. Far more tractable than expecting end users to stand up their own local model instances, though r/LocalLLaMA is a fantastic community to follow if you want to go that route.
A useful framing over “local vs cloud AI” can be split along two axes: does the task touch private data, and does it need frontier intelligence? You can use frontier models for developing the software (doesn’t touch data), but open-source models running locally for ops: maintenance, debugging and monitoring (touches data). If you need to fall back to frontier intelligence at some point for a particularly hard to resolve problem, you can still rely on local models for pre-transforming and filtering input in a way that's privacy-preserving or satisfies some constraint before it’s sent off to the cloud for processing. OpenAI's privacy filter is a good example of a model that can be used to mask PII and secrets and that can run locally: https://openai.com/index/introducing-openai-privacy-filter/, before sending any data externally for processing.
Another framing for local vs frontier closed which the article mentions is whether the task saturates model capability. With certain tasks like PDF processing or voice or summarization, adding more intelligence isn't necessarily useful. Arguably we've approached that point for chat interfaces already with frontier open-source models. But for coding and ops through well structured tool use inside a coding capable harness, we're still a ways away.
Tangentially, a contrarian take here is that AI can actually enable more privacy preserving software if you’re so inclined. You can just build personalized software and it lowers the barrier to entry and the effort required to self host. SaaS complexity often comes from scaling and supporting features for all types of customers, and if you're building software for personal use, you don't need all that additional complexity. Additionally, foundational and infra software that is harder to vibecode with AI is often already open source.
msteffen 14 hours ago [-]
> One of the current trends in modern software is for developers to slap an API call to OpenAI or Anthropic for features within their app.
Well there’s your problem, control needs to go the other way. If you want your app to be AI-enabled, you need to make it easy for AI to control your app. Have you used OpenClaw? It’s awesome!
daishi55 13 hours ago [-]
> We are building applications that stop working the moment the server crashes or a credit card expires
Isn’t this true of any application that accesses anything not running on your computer? This is just describing what it means to add an API call to your app. Nothing to do with AI (?)
simonkagedal 12 hours ago [-]
Furthermore, for the example given, it would have made a lot of sense to me to generate those article summaries on the backend. Once and for all, no need to burden each client device (which are going to need to download the content anyway), no need to tie yourself to a specific provider (Apple in this case), can have the same experience everywhere. Of course, the backend could use a local (to itself) model.
Not saying it’s _wrong_ either – maybe it doesn’t use a backend of its own (the client downloads content directly from some predefined set of sites), maybe there is functionality to adjust how the summaries work that benefit from doing it on device, etc. Just doesn’t convince me that ”local AI should be the norm”.
RataNova 13 hours ago [-]
I mostly agree, though I think local AI will need better UX around failure modes. Cloud models are often used not just because developers are lazy, but because they are more capable and easier to support consistently across devices.
karmasimida 7 hours ago [-]
How? Memory price is sky high, that is the choke hold the monopoly will not let go
rarisma 10 hours ago [-]
I think with turbo quant forks eventually being merged, its becoming more feasible on mid tier consumer h/w
Dont quite think its ready yet.
krupan 13 hours ago [-]
Here I was hoping that this was some plea for us to get away from proprietary solutions that we have no control over and go back to open source, but no, not that at all.
TechSquidTV 12 hours ago [-]
Local AI will catch up. Unless we can't get our hands on hardware anymore, which is a legitimate concern I have.
anArbitraryOne 8 hours ago [-]
Just let me turn it off to preserve battery life
vegabook 15 hours ago [-]
>> years ago I launched "The Brutalist Report"
proceeds to brutalise the reader with an 88-point headline font.
1a527dd5 13 hours ago [-]
Consumer/private needs to be local.
Work? I don't want it local at all. I want it all cloud agent.
rduffyuk 13 hours ago [-]
agree with the article but the limitation for local llm usefulness is the limited scope from my experiments. eventually context heavy data pipelines require larger models which consumer hardware can't deal with yet. the local model for summary on a page like you describe could be done via code as well, i've found using an llm isn't always the right choice. for example i use ner tagging in my md docs for better indexing and llm search capabilities. this is purely code based and not via an llm. tried with an llm and the results were a lot worse. augmenting tools to make the llm produce better outputs gives better results.
eyk19 14 hours ago [-]
Apple stock is going to skyrocket
baal80spam 14 hours ago [-]
Maybe. What about NVDA?
Salgat 12 hours ago [-]
Local models are much less energy efficient right?
HDBaseT 11 hours ago [-]
It's a good question, although I think hard to quantify.
If you are simply measuring Watt Cost per Token, you are missing the mark drastically. You have to measure quality output per Watt.
It sounds reasonably difficult to benchmark this, maybe I'm wrong though.
tuananh 8 hours ago [-]
local llm doesn't need to match SOTA performance in order to be useful.
prometheus1992 12 hours ago [-]
Agreed, but the way ram prices are going, I don't think we would be able to afford hardware that can run any useful model.
agentifysh 15 hours ago [-]
Until the hardware is economical and powerful enough, local AI that can compete with frontier models today is still far off.
If we could even get something like GPT 5.5 running locally that would be quite useful.
alfiedotwtf 5 hours ago [-]
This would be nice, but unfortunately the norm at the moment is - release a rushed model that doesn’t work with llama.cpp, but if it does, make sure that the chat template is broken. And even if it did have a perfect chat template, let the model loop endlessly rewriting the same file with same content for hours on end.
It would be nice if model makers could at minimum embrace test harnesses, and stretch goal if they’re going to change underlying formats then at least land compatible readers in the big engines (e.g. llama.cpp and vllm)
ChoGGi 12 hours ago [-]
Who can afford local AI?
m463 12 hours ago [-]
Who can afford to backup their own photos?
who can afford a house?
hypfer 15 hours ago [-]
Same as local compute.
Welcome back to 2014. Let us now continue yelling at the cloud.
refulgentis 13 hours ago [-]
The shitty thing here is, either everyone's shipping 800 MB at least with their binary, or, you have to rely on the platform vendor anyway. I'm hoping there's enough external pressure that the OS vendors turn it more into a repository than a blessed-model-garden.
wrxd 13 hours ago [-]
To be fair the author of the post is using the model Apple provides with the OS so it doesn't have any extra binary size
wilg 14 hours ago [-]
Two issues -
1. Local models are likely to be more power-expensive to run (per-"unit-of-intelligence") than remote models, due to datacenter economies of scale. People do not like to engage with this point, but if you have environmental concerns about AI, this is a pretty important one.
2. Using dumb models for simple tasks seems like a good idea, but it ends up being pretty clear pretty quick that you just want the smartest model you can afford for absolutely every task.
manc_lad 14 hours ago [-]
I think using the best model for every tasks makes sense when these models are subsidised. when the prices go up (assuming they do) this could trigger a more varied approach. assuming the model doesn't self select for you.
dana321 14 hours ago [-]
"NO AI" needs to be the norm, we should be working on better ways of sharing information and better documentation instead of fighting with computers for substandard results.
shmerl 15 hours ago [-]
Depending on some remote AI provider is a major lock-in pitfall. But it's exactly what those AI providers want you to do.
williamtrask 15 hours ago [-]
I wonder if a popularization moment for local AI will ultimately be the pin-prick that pops the AI bubble. Like the deepseek or openclaw moments but bigger/next.
gdulli 14 hours ago [-]
That's like wondering if enough people discovering local media streaming will disrupt commercial streaming services. It's not going to happen. Most people are not ambitious and will let themselves be controlled by the services of least resistance.
And you can't take comfort in knowing that you, personally, will remain in control of your own computing. The majority will let the range and direction of their thoughts and output be determined by the will of the tech giant whose AI they adopt. And that will shape society.
HDBaseT 11 hours ago [-]
I like the analogy of streaming services vs local media streaming, although I don't think it holds up when looking at history.
Streaming Services are getting worse and more expensive. I don't see a single report suggesting piracy is decreasing, it seemingly is only increasing now.
When costs increase, quality decreases people look for alternatives. The advent of faster broadband enabled Napster and MP3 sharing. I think this could have a resurgence if the peices align correctly (a new bitorrent client, a new torrent site, something to break the status quo).
How this related to AI, I don't know, although I wouldn't be set on the idea that we will never have local AI as the norm. There is a lot more movement in this space then there is for local streaming imo.
williamtrask 13 hours ago [-]
Yeah... probably right. I do hold out hope that this is mostly a timeframe thing. Like, the library, printing press, etc. all had their moments of centralization. But eventually they federated.
DoctorOetker 8 hours ago [-]
One advantage of local AI is continual learning.
When I say 'moat' I don't mean moat specific to a company vis-a-vis other companies, but 'moat' specific to the set of inference providers vis-a-vis self-hosted local inference.
The moat consists primarily of being able to batch inference requests.
If we pretend people weren't interested in long context-lengths, there would be a moat for inference providers. who can batch many requests so that streaming the model weights (regardless if from system RAM to GPU RAM; or from GPU RAM to GPU cache SRAM) can be amortized over multiple requests.
However people do want longer memory than the native context length.
One approach is continual learning (basically continue training by using the past conversation as extra corpus material; interspersed with training on continuations from the frozen model, so it doesn't drift or catastrophically forget knowledge / politeness / ...).
However this is very expensive for inference providers, since they would have to multiply model weight storage with the number of users U=N. For a single user the memory cost of continual learning is much less since they only need to support a single user, and are returned some of the memory cost through elimination of KV-caches, and returned higher quality answers compared to subquadratic approximations of quadratic attention.
An advantage of continual learning is that the conversation / code base / context is continuously rebaked into model weights, and so doesn't need KV caches! It doesn't need imperfect approximations to quadratic attention, it attends through working knowledge being updated.
Nothing prevents local LLM users from implementing this and benefiting from the dropped requirements of KV caches and enjoying true quadratic attention implicitly over the whole codebase, or many overlapping projects indeed.
The only remaining moat of inference providers vis-a-vis continual learning local LLM's is the batching advantage, plus the gradient update costs for continual learning minus the KV storage and compute costs, minus the performance loss due to inexact approximations to quadratic attention.
This points towards a stronger incentive for local hosting than currently realized (none of the popular local LLM tools currently support continual learning, once this genie is out of the bottle it will be a permanent decrease of the inference provider moat, the cost of which can't be expressed merely in hardware or energy costs, since it is difficult to quantify the financial loss of inexact approximations to quadratic attention, the financial loss due to limited effective context length and the concomitant loss in quality of the result)
DrScientist 1 hours ago [-]
Anybody know of good real world examples for continual learning?
Does it really work?
QuadrupleA 6 hours ago [-]
This is just emotional rhetoric. Pretty much any app in the last 20 years has depended on a server somewhere, or a cloud provider. Like an AI provider, they can go down, they can turn off if you don't pay your bill, etc.
And local inference requires fairly beefy hardware, that is FAR from ubiquitous across today's userbases. Local models are also still far dumber than what frontier labs can serve.
Weird that this is getting such a tidal wave of upvotes.
j45 8 hours ago [-]
It’s easier to say 32 gb ram needs to be the norm to start getting movement on this
krupan 13 hours ago [-]
If you don't need a lot of smarts, do you even need an LLM? Aren't older machine learning techniques just as good, or like, you know, old-school algorithms?
holoduke 14 hours ago [-]
We need computers with 128gb or maybe even 192gb of memory before local use make sense. From my own experience 32b LLMs are the absolute minimum for proper tool use and decent output quality. But for local ai you want also vision models and maybe even various LLMs. Plus some memory for the system of course.
On my 36gb M3 the 24b Gemma model is nice. But the entire system gets allocated for that thing.
jmyeet 13 hours ago [-]
I've been looking into options for this and we are getting close. There are two main constraints: memory and memory bandwidth.
NVidia segments the market by limiting the amount of memory on GPUs. It currently tops out at 32GB (on a 5090) but it has excellent memory bandwidth (~1.8TB/s). If you want more than the you need to buy an RTX Pro (eg RTX 6000 Pro w/ 96GB for ~$10K) or you get into high high end solutions like H100, H200, etc that have significantly more memory and even higher bandwidth on HBM memory (eg 3.2TB/s+).
NVidia has released the DGX Spark w/ 128GB of memory for ~$4k. The problem is the memory bandwidth. It's only 273GB/s, which is less than the M5 Pro (307GB/s) but more than the M5. You can buy a 16" Macbook Pro with an M5 Max and 128GB of memory for $6k and it has a bandwidth of 614GB/s. So the DGX Spark is a joke, really.
In case it wasn't clear, Apple is interesting in this space because it has a shared memory architecture so the GPU can use all the memory.
Many, myself include, expect there to be no refresh to the 5000 series consumer GPUs this year, which would otherwise happen based on product cycles. So no 5080 Super, for example. And I wouldn't expect a 6090 before 2028 realistically.
One thing Apple hasn't done yet is release the M5 Mac Studios, which are widely expected in Q3 this year. They are interesting because, for example, the M3 Ultra has a memory bandwidth of 819GB/s and previously had a max spec of 512GB but that got discontinued (and the 256GB version also got discontinued more recently).
So many expect an M5 Max Mac Studio with 1TB/s+ bandwidth and specs up to 256GB or 512GB, probably for ~$10k later this year.
You really have to use this hardware almost 24x7 for it to be economical because otherwise H100 computer hours are probably cheaper.
But what happens when the next generation of GPUs comes out to the trillions in AI DC investment? It's going to halve its value. That's over $1 trillion in capex that will disappear overnight, effectively.
I think Apple is the dark horse here because they have no interest in NVidia's psuedo-monopoly. I'm just waiting for them to realize it.
Now CUDA is an issue here still but I think as time goes on it's going to be less of an issue. Memory is still a huge constraint both in terms of price and just general supply because NVidia can justify paying way more for it than you can, probably.
It's still sad to see that 128GB (2x64GB) DDR5 kits are almost $2k now and werre $400 a year ago. Expect that to continue until this bubble pops (which IMHO it will) and we're likely in a global recession.
So the other issue is models. OpenAI and Anthropic are built on proprietary models. Their entire valuation depends on this moat. I don't think this last so both companies are doomed because open source models are going to be sufficiently good.
We can already do some reasonably cool stuff on local hardware that isn't that expensive and even more so once you get to $5-10k hardware. That's going to be so much better in 2 years that I'm hesitant to spend any amount of money now.
Plus the code for running these things is getting better. Just in the last month there have been huge speed ups in local LLMs with MTP.
zozbot234 12 hours ago [-]
> So the DGX Spark is a joke, really.
Not at all sure about that. They have really good compute, and DeepSeek V4 (with antirez's 2-bit expert layer quant) may be able to leverage that compute via parallel inference - the jury is still out on that. Now if you had said Strix Halo/Strix Point or perhaps the Intel close equivalents, that would've been a slightly stronger case.
regexorcist 12 hours ago [-]
> So many expect an M5 Max Mac Studio with 1TB/s+ bandwidth and specs up to 256GB or 512GB, probably for ~$10k later this year.
This is what I'm really waiting for. It will enable models comparable to current SOTA at the enthusiast price range.
heydryft 11 hours ago [-]
[flagged]
artursapek 15 hours ago [-]
I'm someone who is trying to build a subscription-based business to cover underlying LLM costs, and very hopeful I can one day just sell a permanent license to the software instead with customers using local LLMs to power it.
sgt 17 hours ago [-]
I guess Google got that memo!
cubefox 14 hours ago [-]
Local AI is a bit like wind parks. Everyone is in favor, except if they are in your own backyard. There was recently a huge outcry when Chrome shipped a local 4 GB AI model:
https://news.ycombinator.com/item?id=48019219
I have to conclude that people would like to have powerful local AI but it should at the same time only be a tiny model. In which case it wouldn't be powerful.
skillsora 34 minutes ago [-]
[flagged]
KurSix 36 minutes ago [-]
[dead]
xiaosong001 3 hours ago [-]
[flagged]
hona_mind 4 hours ago [-]
[flagged]
moveax3 33 minutes ago [-]
[dead]
thesuperevil 4 hours ago [-]
[flagged]
davmar7878 4 hours ago [-]
[dead]
shouvik12 10 hours ago [-]
[flagged]
qwertmax 13 hours ago [-]
[flagged]
9 hours ago [-]
Amber-chen 10 hours ago [-]
[flagged]
haltonlabsops 7 hours ago [-]
[flagged]
lbrauer 12 hours ago [-]
[flagged]
ElenaDaibunny 4 hours ago [-]
[dead]
throwaway613746 13 hours ago [-]
[dead]
debpalash 13 hours ago [-]
[dead]
system2 3 hours ago [-]
[flagged]
barrkel 14 hours ago [-]
Local models are extraordinarily expensive if you're not maximizing throughput, and you're not going to be maximizing it.
Local models need to be resident in expensive RAM, the kind that has fat pipes to compute. And if you have a local app, how do you take a dependency on whatever random model is installed? Does it support your tool calling complexity? Does it have multimodal input? Does it support system messages in the middle of the conversation or not? Is it dumb enough to need reminders all the time?
Spend enough time building against local models and you'll see they're jagged in performance. You need to tune context size, trade off system message complexity with progressive disclosure. You simply can't rely on intelligence. A bunch of work goes into the harness.
Meanwhile, third party inference is getting the benefits of scale. You only need to rent a timeslice of memory and compute. It's consistent and everybody gets the same experience. And yes, it needs paying for, but the economics are just better.
LPisGood 14 hours ago [-]
> And if you have a local app, how do you take a dependency on whatever random model is installed?
Reading the tea leaves here, it will probably be common for OS’s to have built in models that can be accessed via API. Apple already does this.
bheadmaster 14 hours ago [-]
> And if you have a local app, how do you take a dependency on whatever random model is installed?
Why not ship your own model? In the age of Electron apps, 10GB+ apps are not unheard of.
_heimdall 14 hours ago [-]
Personally I wouldn't want a couple dozen apps installed all with their own model.
It seems easier to have industry specs that define a common interface for local models.
I also assume the OS can, or would need to, be involved in proving the models. That may not be a good thing depending on your views of OS vendors, but sharing a single local model does seem more like an OS concern.
alex7o 14 hours ago [-]
I mean the openai API is the industry standard for allowing apps to communicate with models, llama-server has it, oMLX has it, ollama has it, vLLM has it, lmstudio as well. I don't think this is such a hard thing to do, but it requires people to set it up.
_heimdall 14 hours ago [-]
I don't know enough about that API surface to know if its a particularly good one for the use cases we'd have, but yes defining a universal spec for all implementors to support wouldn't be a big lift and is done in plenty of other areas already.
alex7o 14 hours ago [-]
There is no other way than shipping your own model, because you will want an abstracted API over the inference, and you don't know what the user has installed. Also you can ship 9b fp4 model but it all just depends
_heimdall 14 hours ago [-]
Knowing what's installed would have to be an OS API. If LLMs provide a standard API surface to the OS, likely including metadata related to feature support.
LPisGood 14 hours ago [-]
You can know what the user has installed if the OS developer offers something.
crazygringo 13 hours ago [-]
I don't know why you are being downloaded. These are precisely the facts that advocates for local models completely ignore.
Local models are absolutely going to be the future for things like simple automation and classification tasks that run occasionally and don't need to rely on internet access.
But for all of the serious stuff where you are doing knowledge work, the models will simply continue to be too big, and too slow to run locally.
The article says:
> Use cloud models only when they’re genuinely necessary.
But at least for me, they're genuinely necessary for 99+% of my LLM usage.
At the end of the day, the constraint here really is efficiency and cost.
Privacy can be ensured with the legal system, the same way that businesses that compete with Google still have no problem storing their data in Google Workspace and Google Cloud. The contractual guarantees of privacy are ironclad, and Google would lose its entire cloud business overnight as its customers fled if it ever violated those contractual agreements (on top of whatever penalties they allow for).
barrkel 2 hours ago [-]
Downvotes: IKR? It's a signal that there's a lot of motivated reasoning.
I don't think that many people have built apps against these models.
I mean, I use a heavily quantized version of qwen3 for image classification, caption generation, prompt expansion etc. for image generation, instruction-driven edits, and so on. You can go a long way when you don't need a lot.
A model that can do tool calls - any tool calls at all - can look reasonably cool once you put it in a harness where there's enough immediate context to take action. You can get carried away by anything happening at all. But golly gosh it's a long way short of intelligence available in the bigger models.
And the lighter you make your harness, giving the model more free reign, more autonomy, you get a big jump in capability combined with a big jump in failure modes when the model is dumb.
Rendered at 10:47:30 GMT+0000 (Coordinated Universal Time) with Vercel.
It's here, right now. I'm running quantized Qwen and Gemma on a decent, but three years old gaming rig (think RTX 3080 12GB and 32 GB RAM). Yes, it's slow, it has a small context window. But it can (given a proper harness) run through my trip photos and categorize them. It can OCR receipts and summarize spendings. It can answer simple questions, analyze code and even write code when little context is required. Probably I could get a half-decent autocomplete out of it, if I bother with VS Code integration. "128 GB VRAM on a MacBook Pro or a Strix Halo" is already a minimum viable setup for agentic coding, I think.
> And then we'll have the equilibrium we already have with the "classic cloud": you either self-host or pay for flexibility and speed.
Currently, it works exactly the other way. The cloud versions are orders of magnitude cheaper than self hosting, because sharing can utilize servers much more efficiently. Company can spend half a million bucks on a rig running GLM 5.1, and get data security, flexibility and lack of censorship, but oh it's so expensive compared to Anthropic per-seat plans.
Also the fact that an M5 version will be coming, and they likely know they are going to sell out on day one (I expect we'll see a price correction from Apple for higher end configs of M5 studios, base price will probably stay the same), so they need to build up stock reserves.
Isn't that a function of RAM supply not being available now?
Who runs IDE with LLM agents accessing your local filesystem, on bare metal?
Or am I alone to run everything LLM related on my VM just for development work. Then because of ZED genius decision, you need to share your GPU to VM, then some important features will not work, like snapshots. So you also need workaround for this, etc.
Too much hassle, Zed is not for me.
But I'm anti-Apple, so maybe that's the reason :)
Btw, even "ImHex" devs realized this and they're providing version without acceleration for VM use. They're using ImGui. Using it for local desktop app UI is also ridiculous, imho. Whatever.
Maybe the future is a selection of local, specific stack trained models?
I mean I've been forcing my good old 1080ti to run local models since a short while after llama was first leaked.
But I wouldn't say "local models are here" in the same way as "year of the Linux desktop!111"
Until someone can just go out and buy some sort of "AI pod" that they can take home, plug in and hit one button on a mobile app to select a model (or even just hide models behind various personas) then I wouldn't say it's quite there yet.
It's important that the average consumer can do it, I think the limitations for that are: things are changing too quickly, ram+compute components are exceedingly expensive now, we're still waiting on better controls/harnesses for this stuff to stop consumers not just from shooting themselves in the foot, but blowing their foot clean off.
Would be interesting to see a Taalas-like chip in a product, albeit there's so many changes going on atm with diffusion based models, Google's Turboquant (which as someone who has had to almost always run quantized models, makes a lot of sense to me).
The USB drive light is flickering, showing something is happening. It's been about 8 hours since I entered the prompt and I've gotten about 10 tokens back so far. I'm going to leave it running overnight and see what happens.
What did you use to do this, something standard like llamacpp or something else like vllm or your own contraption ?
I mean, inference engine might need to get some tweaks, to support whatever compute is available. But then, if you put a few terabytes of disk for swap, and replace RAM to bigger sticks if possible, it should work? Slowly, of course, but there is no reason it should not to.
Reciprocal?
I use an anaconda environment, though would have preferred an "uv" environment, on Linux and automate the startup sequence using the following script (start_comfy.sh) from the term rather than manually starting the environment from same said term:
#!/bin/bash
#
# temporary shell version
eval "$(conda shell.bash hook)"
conda activate comfy-env
comfy launch -- --lowvram --cpu-vae
Here are some of the images: https://imgbox.com/nqjYhdx3 https://imgbox.com/93vSWFic https://imgbox.com/qs1898dz
I'm hesitant to increase the sizes of the renders as that will surely stress my laptop's components.
I tried oMLX and OpenCode a few weeks ago and the 65k context window was useless, it tried to analyze a very small codebase before going full on agentic and ran out of context window immediately
I don't have time to tweak 1,000 permutations of settings just re-prove that its not as smart as Opus 4.6
I need out the box multimodal behavior as similar as typing claude in the command line and its so not there yet
but I'm open to seeing what people's workflows are
It's usable. I set it loose on the postgres codebase, told it to find or build a performance benchmark for the bloom filter index and then identify a performance improvement. It took a long time (overnight), but eventually presented an alternate hashing algorithm with experimental data on false positive rate, insertion speed and lookup speed. There wasn't a clear winner, but it was a reasonable find with rigorous data.
I gave it the reference C implementation, the LTFS spec from SNIA, and asked it to use the C implementation to verify the correctness of the Go code.
LTFS is a pretty straightforward spec, so it made a very reasonable port within about 2 days. It's now working on implementing the iSCSI initiator (client) to speak with my tape drive directly, without involving the kernel.
Edit: the model is Qwen3.6-35B
FWIW I think Gemma 4 31b is more likely to be of use to me than Sonnet, idfk, maybe it's a skill issue but I love Opus 4.7, undisputed king, but Sonnet seems borderline useless and I basically think of it as on the same level as Qwen 35b MoE.
But they diverge greatly on other particular ones whenever the ViT tower and the apriori knowledge of the world is crucial. I wish Gemma was on par but both me and Google know they not.
I'm going to switch to local LLMs for most stuff soon.
Thot_experiment is saying that his 2016 Toyota Prius is a great and reliable car for his daily commute and running errands.
Whereas everyone is screeching about its capability gap with a Lockheed Martin F35 lightning.
(of course if i'm being honest 640kB is fine, i'm sure tons of the world's commerce is handled by less for example, the delta between a system with 640kb of ram and a modern one is near nil for many people, the UX on a PoS terminal does not require more than that for example, the hacker news UX could also be roughly the same)
Doubtful. The increase in demand is greatly outpacing supply, and all signs point to a continued acceleration in demand
> If I could drop $10,000 to have an effectively permanent opus 4.7 subscription today, I would.
lol well obviously, but realistically that price point is going to be closer to $100k, with a perpetual $1k a month in power costs.
If we think about the near future, something like Kimi2.6 is within the realm of Opus 4.6 today, but requires closer to $700k in hardware to run.
> For those of us a bit crazy, we are running KimiK2.6, GLM5.1
Yes, those can compare to Opus, but you can't run those unquantized for less than $400k in hardware.
A single M3 maxed can run a Q2 Kimi 2.6, though thats with a hardly degraded perplexity.
2x M3s with RDMA can run a lossless Kimi2.6 at Q4, but with CPU only you would get okayish decode but horrible (+1m) TTFT, that wouldnt be a great _interactive_ experience.
If you believe what you read here, the gap is closing fast.
For niche applications, sure. For general use, I think the tendency towards the best model being used for everything will–to the model publishers' delight–continue. It's just much easier to get a feel for Opus and then do everything with it, versus switch back and forth and keep track of how Haiku came up with novel ways to dumbfuck this Sunday evening.
Fixed that for you. Right now most models produced are based on floating point maths and probabilities, which is "expensive" to do math on.
Microsoft has researched 1-bit LLMs which can run much more efficiently, and on much cheaper hardware[1].
If this research is reproducable and reusable outside their research models, this means the cost of running self-hosted LLMs will be reduced by an order of magnitude once this hits mainstream.
[1] https://github.com/microsoft/BitNet
However that's not the real battle here. The real battle is control of information to operate over.
While I might have access to a decent model - I don't have the huge integrated databases of everything that companies like Google have, and increasingly governments will accumulate.
As a citizen AI operating of these large datasets is where the concern should be.
10 years ago I was using 16GB in my MBP and today it's 48GB. It's just a 3x increase during mostly a bonanza period.
And the Mac Studio was available with 512GB until ram got scarce and they cut the max in half recently.
There's plenty of demand for RAM right now. We'll see how this turns out.
Because late stage capitalism demands endless growth in order to pay executives and shareholders (especially those late to the train) more and more YoY.
And those requirements for growth mean that cost cutting is needed. Over the past few decades cost _have_ been cut, building things more efficiently, components becoming cheaper, larger volumes in mass manufacturing.
But we have already reached a point where there are no other places to cut than the quality of the product itself. Look to shrinkflation in food and other places - look at how "live action" versions are being made of previously animated movies, how game franchises from 2 decades ago are being brought back from the dead, the huge influx of remasters etc.
Why? Because it's cheaper to revive/reuse an existing IP than it is to create a new one + it guarantees success with the drooling consumer masses. And cheaper = more Ferraris for the multi millionaire/billionaire execs.
See how much Mario movie made? Just wait...bet you there'll be a live action version. ;)
This will depend on how much inference happens for consumer (desktop, local) vs enterprise ("cloud"), vs consumer mobile (probably also cloud).
I would assume that the proportion of "consumer, local" is small relative to enterprise and mobile.
I guess, it'll most likely be an AI processing and everything else becoming API.
In case of GPTs and Claudes of the world. They'll be just using an Indexing APIs and KB on top of their LLMs.
The question is would you choose to save $10 a day if it causes your inference to slow down 10x and waste 2 hours a day waiting on stuff.
To sell tokens profitably you'd need to be able to run inference at 150 tokens per second for less than $1,000 USD a month.
I don't think people realize how expensive it is to host decently capable models and how much their use of capable models is subsidized.
You can only squeeze so many parameters on consumer grade hardware(that's actually affordable, two 4090s is not consumer grade and neither is 128gb macbooks, this is incredibly expensive for the average person, and the models you can still run are not "good enough" they are still essentially useless).
People are betting their competency on a future where billionaires are forever generous, subsidizing inference at a 10-1 20-1 loss ratio. Guess what, that WILL end and probably soon. This idea that companies can afford to give you access to 2mm in GPUs for 5 hours a day at a rate of $200.00 a month is simply unsustainable.
Right now they are trying to get you hooked, DON'T FALL FOR IT. Study, work hard, sweat and you'll reap the benefits. The guy making handmade watches, one a month in Switzerland makes a whole lot more than the guy running a manufacturing line make 50k in China. Just write your own fkin code people.
Don't bet your future on having access to some billionaire's thinking machine. Intelligence, knowledge and competency isn't fungible, the llm hype is a lie to convince you that it is.
With the new DeepSeek V4 series and its uniquely memory-light KV cache you can even extend this to parallel inference in order to hide memory bandwidth bottlenecks and increase compute intensity.
This is perhaps not so useful on a 128GB or 96GB RAM Apple Silicon device (I've seen recent reports of DS4 runs with even one agent flow hitting serious thermal and power limits on these devices, so increasing compute intensity will probably not be helpful there) but it will become useful with 64GB devices or lower that have to stream from a slow disk, or with things like the DGX Spark or to a lesser extent Strix Halo, that greatly overprovision compute while being bottlenecked on memory bandwidth.
It’s currently unsupported on Llama.cpp and vllm doesn’t support GPU+CPU MoE, so unless all of you have an array of DGX Sparks in your bedroom, what’s the secret sauce?!
i don't comprehend why people are in such disbelief at how much better this stuff runs on a mac studio than on NVIDIA hardware with 1/5th the VRAM. look, what can i say? NVIDIA is a bigger rip off than Apple is!
You are going off vibes alone, this is easily verified, please go verify.
What makes you think they have zero reason to subsidize, because the providers aren't a household names you assume they wouldn't operate at a loss? Whats your logic here? You make no sense.
Also, a lot of money is being made on input tokens and cached tokens, which are much cheaper to compute.
DeepSeek published their math for serving the V3/R1 models. They were 535% profitable: https://github.com/deepseek-ai/open-infra-index/blob/main/20...
If Anthropic and OpenAI are subsidizing the metered API usage, their model is going to end up just as successful as MoviePass. They are burning enough money on the training costs already.
If you have a machine running at 150 tok/ps you can only make $5820 a month at $15 per 1mm running 24/7. It costs a hell of a lot more than 6k a month to run Claude 4.7 @ 150 tok/ps on that machine 24/7.
This math is a bit off, because you have input tokens too, but regardless its still not profitable especially for how long it takes to turn around a request and the caching is probably not all that profitable.
Serving models on dedicated hardware is not the same as your at home 150t/s thing. Inference is measured in thousands of tokens / s in aggregate (i.e. for all the sessions in parallel). That's how they make money.
If you have a machine running at 150 tok/ps you can only make $5820 a month at $15 per 1mm running 24/7. It costs a hell of a lot more than 6k a month to run Claude 4.7 @ 150 tok/ps on that machine 24/7.
This math is a bit off, because you have input tokens too, but regardless its still not profitable especially for how long it takes to turn around a request and the caching is probably not all that profitable.
The reason it works: each time you read the model (memory bound) to calculate the next token, you can also update multiple requests (compute bound) while at it. It's also much more energy-efficient per token.
[1] https://aimultiple.com/gpu-benchmark
The idea that everyone is spinning up a $2 million in GPUs to scan their email inbox, search the web or avoid learning something is still ridiculous to me regardless.
Not if you're OK with 4-bit quantization. More like $30K-$50K one time.
Spring for 8 RTX6000s instead of 4, and you can use the full-precision K2.6 weights ( https://github.com/local-inference-lab/rtx6kpro/blob/master/... ).
I don't think cloud models are going away; the hardware for good perf is expensive and higher param count models will remain smarter for a looong time. Even if the hardware cost for kind-of-usable perf fell to only $10k, cloud ones will be way faster and you'd need a lot of tokens to break even.
I think local AI will win in its niche by repurposing users' existing hardware, especially as cloud hardware itself gets increasingly bottlenecked in all sorts of ways and the price of cloud tokens rises. You don't have to care about "bad" performance when you've got dedicated hardware that runs your workloads 24/7. Time-critical work that also requires the latest and greatest model can stay on the cloud, but a vast amount of AI work just isn't that critical.
There will not ever be a monthly subscription for LLM tokens. The economics isn't there.
Local tokens will always be cheaper.
Well your thinking is completely vibes based and not cemented in any reality I exist in.
They're not smarter, they just know more stuff.
You probably don't need knowledge about Pokemon or the Diamond Sutra in your enterprise coding LLM.
The "smarts" comes from post-training, especially around tool use.
> Just write your own fkin code people
Bro is nostalgic for googling random stack overflow threads for 10 days to figure out a bug the agent fixes in an hour.
The cost of cloud compute actually hasn't gone down for old hardware all that much, it still costs $500.00 a year rent 4 core i7700k that's 10 years old. Don't expect much more valuable hardware, like modern GPUs to deflate in price all that quickly.
There's 3 fabs in the world that make ddr7 and they aren't going to be selling their stock to consumers going forward, it will be purchased by datacenters almost entirely and stay in them until EOL.
Your brain is going to atrophy (this is proven), they'll raise the price to something thats closer to break even and you'll be forced to pay it because you no longer have those muscles.
I think that is a very narrow perspective. Enormous numbers of consumers own $50,000 cars, but a pair of $2000 GPUs is "not consumer"?
I agree with your view that cheap tokens on SOTA are a trap-- people should use local AI or no AI.
$50k is a median priced car in the US. I'd guess >99.9% of people do not own $4000 of GPUs. I consider myself a computer person and I dont think I even own $4000 of computer hardware in total
A top-spec MacBook Pro is >$4k, so I assure you that plenty of computer people do own $4k of computer hardware.
Hell, most tech folks are wandering around with a ~$1k smartphone in their pocket too.
A car is super useful, so is an AI. But even if we decide cars are incomparably more useful a great many people pay much more than $4000 over the minimum viable car, and that's money that could be deployed to secure access to private, secure, and autonomous AI facilities. A few thousand dollars in computing is consumer hardware, or at least could easily be with more reason and awareness driving adoption.
People spend a LOT of money in things less useful than local copy of qwen3.6-27b can be.
A friend an I had previously worked on an entropy extraction scheme and he recently got around to making a writeup about our work: https://wuille.net/posts/binomial-randomness-extractors/
I instructed the agent to read the URL, implement the technique in C++ for 32-bit registers, then make a SIMD version that interleaves several extractors in parallel for better performance. It implemented it (not hard since there was an implementation there that it read), then wrote more extensive tests. Then it vectorized it. It got confused a few times during debugging because the algorithm uses some number theory tricks so that overflows of intermediate products don't matter and it was obviously trained a lot on ordinary code were such overflows are usually fatal. I instructed it to comment the code explaining why the overflows are fine and had it continue which mostly solved its confusion.
It successfully got the initial 12MB/s scalar implementation to about 48MB/s. Then I told it to keep optimizing until it reaches 100MB/s. I came back the next day and it had stopped after 6 hours when it achieved just over 100MB/s. Reading what it did: it went off looking at disassembly, figured out what hardware it was running on, and reading microarch timing tables online and made some better decisions, tried a lot of things that didn't work, etc. (And of course, the implementation is correct).
I'm pretty skeptical about AI and borderline hateful of many people who (ab)use it and are deluded by it-- but I think this experience shows that a small local model can be objectively useful.
(oh and this experience was also while I only had the model running at 19tok/s)
Running the model in a loop where it can get feedback from actually testing stuff allows you to make progress in spite of making many mistakes.
I could have done this work myself but I didn't have to and I certainly spent less time checking in and prodding it than it would have taken me to do it. In my case I wondered how much faster parallel extractors using SIMD might be-- an idle curiosity that would have gone unanswered if not for the AI.
Congrats, but you're in the 0.0001% thats not just frying their brains, fapping to their local models or doing various magic tricks like a toddler entertained by playing with velcro.
At the end of the day you lost an opportunity to improve yourself and excercise your brain, maybe the opportunity cost is worth it idk, but Im going to keep taking things slow.
Handmade swiss watches > mass manufactured immitations. Handmade clothes > walmart clothes.
There are plenty of other uses that people have been making for a long time-- e.g. I know someone who uses a fine tuned local model to sort their incoming email and scan their outgoing messages for accidental privacy leaks.
I don't agree with your assessment on an opportunity lost-- I got my reps in on the original work, the AI gave an incremental step forward which made the whole exercise somewhat more valuable to me with minimal additional cost. I think this improves the cost vs benefit in a way that makes me more likely to try other pointless activities, knowing that when I run out of gas I can toss it to AI to try some variations.
Sometimes you're also 27 steps deep on a nested subproblem and you're really just trying to solve sometime. Even in finr craftsmanship not every step needs to be about maximum craftsmanship. :) Sometimes it's just good to get something done.
I think this is much like any other tool. One can carve furniture using only hand tools, but the benefits of a router are hard to dispute. Both approaches exist in the world and sometimes both are used in concert.
As far as people frying their brains with AI -- you don't need local models for that, plenty of people are driving themselves into deep personally and socially destructive delusion just using the chat interfaces.
I agree with you, there's a way to use them responsibly like your router anology, I just think most aren't doing this correctly and its a slippery slope. I'll contend that you probably have used them responsibly in your example.
- text-to-speech - speech-to-text - dictionary - encyclopedia - help troubleshooting errors - generate common recipes and nutritional facts - proofread emails, blog posts - search a large trove of documents, find information, summarize it (RAG) - manipulate your terminal/browser/etc - analyze a picture or video - generate a picture or video - generate PDFs, documents, etc (code exec) - simple programming - financial analysis/planning - math and science analysis - find simple first aid/medical information - "rubber ducking" but the duck talks back
A quarter of those don't need more than a gig of RAM, the rest benefit from more RAM. Technically you don't even need a GPU, it just makes it faster. I do half that stuff on my laptop with local models every day.
That said, it really doesn't need to be local. I like the idea that I can do all that stuff offline if I'm traveling, but I usually have cell service, and the total tokens is pretty cheap (like $2/month for all my non-coding AI use).
If there's a newline in my comment, why not retain it? Whyyyyy?!
Until then, I'm going to keep sending my JSON to the server farm in Virginia because it's the only place that can serve me a model that actually works for my uses.
I have a lot of fun with the local models and seeing what they can do.
I appreciate the SOTA models even more after my local experiments. The local models are really impressive these days, but the gap to SOTA is huge for complex tasks.
The IQ2 quants that fit into 128GB machines are very degraded.
1- Do a particular task with great capability (due to its constrained, limited scope) 2- Do it in such a way, it integrates gracefully in your workflow without ever requiring you to know you are using an LM.
There is a difference between outsourcing your workflow to AI and actually utilizing it.
Check this: https://www.distillabs.ai/blog/we-benchmarked-12-small-langu...
Reason being is that many workloads for AI are dynamically mixed, where training from multiple subjects comes into play and you just can't know exactly what mix will be required for each task ahead of time.
I was hoping loras would do this for us as well but they don't really seem to have worked out for llms (compared to in the image/video diffusion space).
Perhaps some future model will have some sort of "core" that can load/unload portions of itself dynamically at runtime. Like go for a very horizontal architecture/hundreds of MoE and unload/load those paths/weights once a parent value meets or exceeds some minimum, hmmm.
I think the future will probably be a hybrid of:
1. local AI for simple, private, everyday tasks
2. online AI for very hard or long tasks
https://felixrieseberg.github.io/clippy/
local LLMs builds tool that does exactly what user wants, how it wants it, which is bext UX
this becomes AI literacy
LLMs already nicely bridge the gap form "I want this" to "here's a local page that does it".
examples of tools i have built that requires almost very low tech knowledge * push a button on my phone to take screenshot in my mac (when i watch videos) * help me exercise, gamify it for me * "help me track time spent online to how it impacts what i do in real life, built a tool that rewards and me points me towads things that make me DO things online" * i want to improve my writing, give me exercises and build addiitonal tools (leading to an "append only" digital keyboard i use to exercise )
local AI can already create these tools, and no external company is ever going to beat me/the-user because instead of getting features i don't want, or that almost do what i want, or that do something that advantages the company they just do what I want
Repositories of tools-as-ideas created by others are quite often just index.html and ... that's all? manage data in localstorage, end of it?
Online inferences is still needed for large data (audio/video/images) processing. For now? we don't know, history suggests we'll have the capabilities to do that locally "soon". Or maybe not :)
The main issue is "online for collaboration". Not same user across different devices, that is easy. MeteorJS-style approaches (making local copies of part of dbs, reconcile to remote/origin) seems to be an interesting possibility at small scale, since once you have the right primitives in place you can go horizontally everywhere.
I can’t wait to run my models locally. The sooner I can do my shit without some American mega corp gulping down all my data, the better.
I haven't seen a text-based model sharing site spring up yet (perhaps they already have and I don't know about it yet). Civitai, being focused on image-generation, has the obvious advantage that it's easy to show off impressive results from the model on the front page of the website, and judging what someone's home-grown fine-tuned LLM will produce is a lot harder. But at some point I expect a Civitai equivalent site for text models, especially code-based ones, to become popular. That will seriously undercut Anthropic, OpenAI, et al, and will probably force them to find a price equilibrium.
Because once you're competing with "I spend $2,500 up front on a powerful video card, download an open-source model for free, and then I get pretty much everything I need for free" (additional power cost of running that video card isn't nothing, but probably not noticeable in your power bill compared to what you're already using)... then suddenly $200/month means your customers are thinking "after one year I would have been better off with the homegrown solution". The only way they'll continue to pay $200/month is if Claude/GPT/Gemini/whoever is truly head-and-shoulders above the "pay upfront once for hardware then use it for free afterwards" models available. And that's going to be doable, perhaps, but tough.
Huggingface.
The reason HF doesn’t also compete for image gen is probably some combination of momentum from Civit AI and HF not wanting to deal with the moderation headache.
But for a site sharing code-generation models, it's a very different scenario. I'm curious to see what will happen in that space.
The dependency we have with anthropic and openai for coding for instance is insane. Most accept it because either they don't care, or they just hope chinese will never stop open weights. The business model of open weights is very new, include some power play between countries and labs, and move an absurd amount of money without any concrete oversight from most people.
It's a very dangerous gamble. Today incredible value is available for nearly everyone. But it may stop without any warning, for reason outside our control.
What stops you from running the best open weighted LLMs currently available on consumer grade hardware for the rest of time? They're good enough for 95% of use cases, and they don't have a used by date. From what I can see, the "danger" is not having the next tier that comes out, but the impact of that is very low.
For quite a lot of use cases, the current systems arguably do get worse over time if not continually updated. The knowledge cutoff date will start to hurt more and more as the weights age in a hypothetical scenario where you are stuck with them forever.
Coding, one of the most popular usescases today, would not be great if it say only understood java to a version from years ago etc.
https://en.wikipedia.org/wiki/Knowledge_cutoff
Or will human readable code be less and less of a thing as AI learns it's own, more terse language to talk to other AI's.
This LLM trained only and entirely on pre-1930s texts was able to code Python programs when given only a short example:
https://talkie-lm.com/introducing-talkie
Pockets are too deep, it will only change once everyone is out of money.
Side note though, it’s the speed that bothers me more than the reasoning. Qwen 3.5 is awesome, but my Claude subscription can tear through similar workloads an order of magnitude faster than my local LLM can when using Haiku. That’ll matter a lot to some people.
They're not at all, not even close. Especially when you consider the use cases for people who are paying for LLM services today.
Uh… the hardware requirements? And stop acting like some dog shit 8B model the average Joe can run on a laptop is even close to being comparable to what Claude or even Codex can currently do.
I have pretty good hardware and I’ve tinkered with the best sub-150B models you can use and they are awful compared to Anthropic/OAI/Grok.
honest question, i'm very interested in this, but too casual as of now to know any better.
I'm not, you've actually illustrated my point. LLMs in 2022 were very impressive. By 2024 the general public was finding them an acceptable replacement for many research driven tasks and massive shortcuts for other tasks (coding, image work, document preperation, etc).
Those models are absolutely runnable on consumer hardware now, and we were extremely happy with the results. It's no different to how we used to think CRTs were amazing or early smartphones, but going back now they seem awful.
We're long past "danger". If what we have is the best we'll ever have open source, we're already in an excellent position.
No they weren't. They were a gimmick - it is only in the past 6 or so months that frontier models have started to do stuff beyond mere gimmicks when it comes to coding, and you could make the argument that Mythos has been the first 'Holy shit' moment that we've had that has stepped us beyond 'Yeah that's really neat but...'
> Those models are absolutely runnable on consumer hardware now,
A sub 50B model is awful and can't even write proper English sentences half the time, to say nothing of how bad its world knowledge is. Try the 32B Gemma 4 local model for a week and then go back to Claude and then get back to me.
> We're long past "danger". If what we have is the best we'll ever have open source, we're already in an excellent position.
Not sure what to tell you other than that you and I have very different standards. What we have locally right now is barely more than a glorified autocomplete, and it feels worse than using ChatGPT 2 years ago because the context window is less and it doesn't have good webhooks on consumer setups. Another thing I'd say is that you clearly have no clue what 'consumer hardware' means, or what consumers that can even get this stuff running locally would have to do to get it to even rival the frontier models in terms of their usability (most consumers are't going to just boot into Ubuntu and run this thing from a command line) flow, to say nothing of the hardware requirements. I'd love to never use Claude or Gemini or ChatGPT again for both privacy and money reasons, but the quality of outputs and depth of thinking and writing ability between even the very best local models you can run right now is many orders of magnitude less than what you get using distributed frontier models, and those 'very best' local models require a top of the line machine that 99.9999% of consumers don't have and would never consider buying. The cloud models all have like a trillion(!) parameters now. It isn't even close.
I sure hope the local side of things massively improves over the next 2-3 years, but based on how this has gone my guess is that in 3 years you'll be lucky, if you have very top of the line hardware, to get benchmark performance that we had 6 months ago with the frontier models. The distributed hardware/memory gap is just too big.
Note that we are talking about 95% of everyone's use cases, not your specific use cases (which could require better models all the time).
The huge difference to open source is that you can't just train an LLM with free time and motivation. You need lots of data and a lot of compute.
I sure want to be wrong on that, I definitely like the open-weight version of the future more
In the same way you can imagine the Chinese government pushing the release of deepseek etc to make sure no one thinks the US has “won” and to keep everyone aware that a foreign model might leapfrog in the short term future etc.
At some point though if OpenAI/Antropic/Google plateau or go bust then the open source sponsorship becomes less likely, as making it open source was a weapon not a principle.
Effectively they are saying "yea don't crowd our data centers with small queries, go ahead and send your frontier questions to our frontier models. Oh btw those us models? You can run something about as good for free from us if you want hah." It's a power and marketing move. It's also insanely smart to keep up with it to remain sustainable as a brand. Especially given how small their investments into this are.
Look at anthropics growing pains. Deepseek has other hosts spreading their brand for free while they grow. Brilliant honestly. In my opinion it makes anthropic and openai look clueless on a lot of levels.
China is playing a different game here. To them this is commoditizing their compliment and building good will. The Chinese economy doesn't teter on the brink of collapse to deliver frontier grade LLMs. Nope, Alibaba just made qwen because it needs it. It needs efficient models. Similarly, in China they manufacture and automate so much more than the US ever could. LLMs to them are a topping not the whole meal like they are in the us.
They're state companies, not some kind of ethical VC charity fund project.
If the US’s fascist experiment continues past the current president, we’ll absolutely be nationalizing frontier companies or exerting equivalent control.
I'm glad I get reminded that TDS is real, but everyone forgets that Bush, Obama, and Biden all did things with executive power that Congress ignored or provided little real oversight for. And Congress has proven over the last several decades that their oversight is rather meaningless for the goals of American voters rather than special interests.
But it's all Trump's fault is much more convenient.
Absolutely not. There is huge difference in the their behaviors.
> But it's all Trump's fault is much more convenient.
It is not just Trumps fault. Trump is logical consequence of what conservative party became. J.D.Vance and Miller are as much fascists if not more. The whole party worked for this for years and created this.
> And Congress has proven over the last several decades that their oversight is rather meaningless for the goals of American voters rather than special interests.
Of course congress in general is not the place to stop republican party from their fascists goals, because republicans in the congress support Trump 100%. They stand by project 2025 100%. They are doing oversight all right when it comes to blocking democrats.
The idea that the party that made Trump big, promoted ideas he build on and created project 2025 is supposed to be counterbalance to itself is absurd.
https://try.works/#why-chinese-ai-labs-went-open-and-will-re...
It did work for Deepseek for sure and it seems to move the needle for Xiaomi's MiMo; but will it be enough for Qwen and Gemma? Those are the models you can actually run without going all-in on AI (but only with gaming GPUs and such).
The compute required to run these models is still very far out of reach for the average consumer, yet known enthusiast, therefore they still sell inference, whilst also getting consumer goodwill for providing open weights.
China? Im getting ready to watch the URKL (universal robot knockout league) go on. The USA is dicking around with failed robot dogs.
The USA has been a failed country, coasting on massive inertia. But the tech avenues from a article I cant find showed the USA 8/64 areas excelling. China was 56/64 areas excelling.
Smart people in China design fast manufacturing lines for $25k/yr.
Smart people in the US design bond hedging strategies or ad-pixel trackers for $250k/yr.
China is in the stage the US was in 60 years ago, and eventually those high paying, high impact jobs will suck the intelligence out of all the "blue collar" work. Just like it did in the US.
Dodging politics, the power structures in us industry need serious revamping.
USA exports and exported services, especially in IT. And a lot. USA has nothing to export is true only if you intentionally ignore stuff USA exports.
So, the business model of open models is the same as closed models: Sell inference. Open source is marketing for that inference.
https://try.works/#why-chinese-ai-labs-went-open-and-will-re...
The Open Source AI Definition (OSAID) is quite ridiculous, I prefer the Debian ML policy for defining freedoms around AI.
https://salsa.debian.org/deeplearning-team/ml-policy/
Frontier US labs could still have an advantage for a long time, but many use cases would start gravitating towards Chinese models if they 10x the data centers and provide similar quality inference for a third of the cost.
Not everything good in our society needs to have a "business model". People still work on it. It's FINE.
Donations. Have you donated lately?
Wikipedia is cheap compared to creating and training models.
I don’t think donations will suffice at all.
As an example, we had millions of web developers download and install Firebug before browsers shipped their own dev tools. Donations over the course of multiple years would have paid my salary for a month if I were not a volunteer.
But from the “it’s fine” point of view, models will be baked into your OS.
Then later models will be embedded into hardware. Likely only OS makers models.
DeepSeek said it spent $5.6M [1] on training V3, which doesn't sound too much for a near-SOTA model.
An open source entity can come up with a hybrid business model, such as requiring a small fee from those who want to host the model as a business for the first n months following the release of a new model, but making it fully free for individuals.
[1] https://arxiv.org/pdf/2412.19437
This is what I do not understand as well and advertising the knowledge and more advanced model is also the only thing that comes to my mind.
Since a month I am using gemma4 locally successfully on a MBP M2 for many search queries (wikipedia style questions) and it is really good, fast enough (30-40t/s) and feels nice as it keeps these queries private. But I don't understand why Google does this and so I think "we" need to find a better solution where the entire pipeline is open and the compute somehow crowdfunded. Because there will be a time when these local models will get more closed like Android is closing down. One restriction they might enforce in the future could be that they cripple the models down for "sensitive" topics like cybersecurity or health topics. Or the government could even feel the need to force them to do so.
It builds good will also. it also shows research prowess.
For China it's different. They need to show Americans who don't trust them at all because of propaganda that they have no tricks up their sleeve. It also doesn't hurt when Chinese companies drop models for free people can run at home that are about as good as sonnet. Serious mic drop.
Running AI models on local hardware was exploratory at first, and if it's so easy today it's thanks to open source. It's a little bit coincidental that we have this today, and that mainstream hardware have this capability. The fact that a phone can run very small models is exploratory or some kind of marketing opportunity at best.
Why would hardware company ships cards with more AI capabilites (like more VRAM) in the foreseable future ? On what ground does the marketing for on device AI will keep generating interest ? For something as important, it's very uncertain. But above all, it should not depends on these brittle justifications.
Showing good will in distribution and research prowess today is positive communication, but it can be exactly the oppositite if/when an attack using those small models will reach a high value target.
For China the cultural difference is so huge, it's difficult to say. I would think they first and foremost need to show to evryone inside and outside of China that they match american models. Second, i would say that when americans prefer few very powerfull companies on the get go because they can leverage a lot of capital rapidly to industrialize, China will prefer leveraging a lot of smaller companies exploring a lot of things simultanously (so doing a lot of research), THEN creating legislation to let only the best (or a few) to survive effectively. In the end it's the same result (monopoly or oligopoly), but China may have a stronger core (research) and America may have stronger productive capital, that may be proved obsolete... In the long run, in either side it's a gamble, again.
I disagree on the second point. I think most Americans don't prefer fewer competition, that's a bit antithetical to the free market.
I doubt the Chinese government cares as much about controlling a few companies as you think they do.
China has a few things going for it beyond research. They are mission driven, they actually have needs for this technology, their needs will forward their entire economy as they are the world's largest manufacturers. They are also huge exporters and have buckets of customer support for various languages.
China also has considerably stronger infrastructure for electricity, etc. even with an nividia embargo they are doing more than showing up.
I don't think it's a matter of who "wins". There is no winning. I think China stands to gain far more from LLMs than the US does, and they have proven they don't need the us to do it, even with he us trying to sabotage it's every move into the space. The game is already more or less over in my mind.
If anything I see LLMs as having a huge market in China, and now the US can't even sell it to them.
All I care about is, if I have to use this technology, let me run it locally to avoid the surveillance capitalism aspect. That seems to be the real reason the us has propped up it economy in anticipation for this technology. Yet it doesn't long term benefit the us nor me.
I don't think local will necessarily be open-weight. And then it's not that different from personal computing: you're giving up the big lucrative corporate mainframe, thin-client model for "sell copies to a ton of individuals."
So it'd be someone else (an Apple, or the next-year equivalent of 1976 Apple) who'd start eating into that. There are a few on-device things today, but not for much heavy lifting. At first it's a toy, could maybe become more realized in a still-toy-like basis like a fully-local Alexa; in the future it grows until it eats 80-90% of the OpenAI/Anthropic use cases.
Incumbents would always rather you pay a subscription or per-use forever, but if the market looks big enough, someone will try to disrupt it.
The cost to transmit text is basically free and instantaneous. The rent (i.e. a GPU in a data center) vs buy is going to favor rent until buy is a trivial expense. Like 50-100 range.
Even then a LLM that just works is easier than dealing with your own
Video game streaming is the closest thing, and it's never really taken off. (And this, IMO, is a good comparison because it's a pretty similar magnitude up-front-cost, $500-$4000.)
Once the local-AI-is-good-enough (Sonnet level for a lot of basic tasks, say) for a $1k up-front investment the appeal of having something that can chew on various tasks 24/7 w/o rate limits, API token budget charge concerns, etc, is going to unlock a lot of new approaches to problems. Essentially more fully-baked line-of-business OpenClaw-type things. Or the smart home automation bot of Siri's dreams. You can more easily make that all private and secure when all the compute is local: don't give any outside network access. Push data into the sandbox periodically via boring old scripts-on-cronjobs, vs giving any sort of "agentic" harness external access. Have extremely limited data structures for getting output/instructions back out. I'd never want to pass info about my personal finances into a third party remote model; but I'd let a local one crunch numbers on it.
Even if you need Opus/Mythos/whatever level for certain tasks, if 95% of everything else you'd pay Anthropic or OpenAI for can now be done on things you own w/o third party risk... what does that do to the investment appeal of building better AI appliances to sell end users vs building better centralized models?
I think "what if today's LLM performance, but running entirely under your control and your own hardware" opens up a LOT of interesting functionality. Crowdsource the whole world's creativity to figure out what to do with it, vs waiting for product managers and engineers at 3 individual companies to release features.
Anyways, who's spending $1k for a LLM machine when they can spend $20 (or 0) on a subscription? And who's having an LLM crunching away 24/7 anyways? Anyone who is going to do something like that probably wants a cutting edge model.
It'll (probably) get to a point where the hardware is cheap enough and advancement levels off. But we're a ways from that and even then when a data center is 20ms away why not offload heavy compute that's mostly text in text out.
How many crowdfunded projects do you know that have raised even one percent of that? Who’s going to be in charge of collecting that scale of money? Perhaps some sort of company formed for the benefit of humanity, which will promise to be a non-profit? Some sort of “Open” AI?
Oh, wait.
I can't say that you are lying and you are not exactly exaggerating either. It is true that a new SOTA model -- from literal scratch -- it would be expensive.
But, and it is not a small but, is the starting point really zero?
Much like the current Twitter model, being able to put your thumb on the scale of "truth". Bake a stronger bias towards their preferred narrative directly into the model. Could be as "benign" as training it to prefer Azure over AWS. Could be much worse.
Sometimes there are things where the public good is best served with public expenditure.
Not every country is in a crypto-libertarian race to hoard power and wealth.
Meanwhile, in the EU, the model would be collectively financed, trained by a competent, neutral agency... and then completely lobotomized in the name of "the children," "safety," "IP rights," "correct speech," dozens of individual countries' legal and regulatory requirements, and any number of additional vocal, noncontributing NGOs.
So no one would get rich off of the public model, but no one would get much of anything else out of it, either.
As another reply suggests, there's a reason why things happen in the USA first. Even when they don't, the prime movers move here as soon as they can. Or at least they used to.
1. Innovate, create, and offer it all at sweetheart prices to the public while you rack up debt.
2. Shovel in more money and either buy out or outlast the competition. Become dominant. Lock in your users any which way you can.
3. Enshittify and cash in.
The deals Anthropic, OpenAI, etc. offer won't stay this good much longer. Don't let them lock you in. Failing that, you should budget more for the same service. You're going to need it. Having an open alternative running on your own hardware offers non-negligible peace of mind.
Read through a 1970s-era issue of Popular Electronics or Byte, and then spend some time surfing /r/LocalLlama. You'll get a sense of real-time deja vu, like you're watching history unfold again.
How serious a risk is poisoned weights?
Can we leverage the cryptobros into using LLM training as a proof of work?
Having an LLM use a web search tool isn't the same thing as researching a topic, IMO, because it's so ephemeral and needs constant reinforcement. LLMs aren't learning machines, they're static ones.
informatics aren't magic, you'll never be able to compress """knowledge""" into a small model in a way equivalent to the 1.5 TB model
A self hosted inference solution that offer good tenant isolation guarantees (ideally zero trust) and is easy enough to deploy and maintain (think Plex for AI) would be my choice for privacy. Now to be honest I have done zero research about this and have zero idea how feasible that is, maybe it already exists and there's some discord servers I should join?
Edit: I don't need to mention it here but what's incredible is that open models are in the ballpark of the best commercial models so supposedly, the hardest part by far is already solved.
>that open models are in the ballpark of the best commercial models
This is basically true for certain tasks. As an example, chat interfaces are not well poised to take advantage of higher model intelligence than what the best open source models already provide. But coding harnesses still benefit from greater model intelligence and even more so, the reinforcement learning that tightly interlinks the provider's coding harness (claude-code, codex) with the model's tool calling interfaces is another reason for discrepancy in effectiveness even when controlled for model intelligence. The opencode founder (open source coding harness that supports different model providers) was recently complaining about the challenges making the harness work well with different providers: https://x.com/thdxr/status/2053290393727324313
I agree local models are great, and it’s cool that Apple has models built in now. But I feel like it basically has to be an OS level feature or users are going to get upset. I’d certainly rather have a small utility call out to OpenAI than download its own model.
[1]: https://news.ycombinator.com/item?id=48019219
- Self hosting is expensive. It involves expensive machines with GPUs that cost hundreds per month if you use cloud based ones. You might need multiple of those. And you need people to mind those machines and they are even more expensive per month.
- If you run stuff on your laptop, it consumes a lot of resources and energy. I have qwen running on my laptop. Even minimal usage turns my laptop in a radiator. Nice as a demo, but I can't have it this hot all the time. It would run out of battery, and it's probably not great for longevity of components in the laptop.
- Models are evolving quickly and the self hosted smaller ones aren't as good when it comes to things like tool usage, reasoning, etc. Being able to switch tot he latest model is valuable.
- It's easier to get your use case working with one of the top models than with one of the smaller self hosted ones.
- If you get the wrong hardware, it might not be able to run the latest models very soon.
- Self hosting models is mostly a cost optimization. It only becomes relevant if you hit a certain scale.
- You have alternatives in the form of hosted models via a wide range of service providers. Some of those are EU based and offer all the things you'd be looking for if you are offering your services there. Including legal requirements.
- Reinventing what these companies do in house is technically challenging and possibly more expensive than self hosting models because now you need a lot of engineering capacity dedicated to that. And legal. And all the rest.
If, like most companies/people, you are at the experimenting stage, the cheapest and fastest is just getting an API key from an API provider of your choice. You can take it from there if your experiment actually works. And then it's mostly about optimizing cost. If your API usage goes to the thousands per month or worse, it becomes a cost/quality trade off.
The additional up-front cost for hardware designed to run an LLM in addition to normal workload is unlikely to be accepted by most consumers.
The scale will be very constrained (like Apples on-device models which are small, heavily quantized, and have a small 4K token context window). It’s also terrible for battery life.
AI as it is implemented today is simply just computationally expensive and unless you put in dedicated hardware (like the ANE) for only this purpose - a large cost driver - I don’t really see it getting large scale adoption.
Companies will probably need a server-backed solution as fallback if they want reasonable user experience, so why even invest in diverse hardware support.
Now today, AI is very expensive and not readily accessible to most people without paying a good amount.
The early internet became now you can just get a free phone from phone companies so long as you get their extras. Then you get a ton of subscriptions and ad-ons, but you don’t have to spend money, could just use youtube with ads etc.
Local AI would similarly shift this dynamic to paying for access to plug-in’s and tools for your local AI to be able to use. Like how the subscription model works right now.
With local model advancements, such as specifically Qwen 3.6 35B A3B, this future is becoming more likely by the year IMO.
TFA is focused on whether big models are necessary for what users want. There's some evidence they may never actually be reliable enough unless a) mechanistic interpretation matures far enough or b) our multi-agent systems all become multi-model.
For (a), advancement in MI might fix problems with big models, but would also mean we can maybe get unified representations, and just slice and dice the useful stuff out of huge models, getting only what we need without the junk. Ability to isolate problems won't really come without bringing the ability to isolate functional subsystems. Only want logic? Only vision? Just cut it out of the big monster and enjoy reduced costs and surface area for problems.
For (b), just look at stuff like the evil vector, or the category of hallucinations specific to tool-use. Without a complete solution for helpful/honest/harmless alignment, it seems likely that creativity and rigor (and many other things) are fundamentally at odds. If you start to need many models for everything anyway, why do we need the huge expensive do-everything ones? So specialization also becomes a pressure to shrink everything towards minimal reliable experts
They need to be able to do a small task well and they need to be able to run reasonably on consumer-class devices. Even better if they can run on mobile phones.
In my experiments with local LLMs I noticed that while increasing the size of the model is nice the real thing that turns a barely useless model into something useful is the ability to use tools. Giving my models the ability to search the web and fetch web pages did way more to solve hallucinations than getting a bigger model. And it doesn't have a training cutoff. Sure, the bigger model is probably better at using tools but I often find the smaller models to be good enough.
Knowledge and clean data sets are becoming increasingly valuable, and free community knowledge is drying up. The next big programming language won’t have years of Stack Overflow posts to train on.
Maybe we will see some kind of licensing deals where owners of good datasets charge you a fee to let your AI search them.
Why should connecting small models to big models result in higher output quality than just running the big models without the small models?
Assuming we end up in a future where people pay to run multiple smaller models on their machines for specific tasks (e.g. A summariser model, a python coding model, or however fine grained/macro you want to go), the people training those models will need to turn a profit.
So how much will that cost? And how often will consumers have to pay? Models have a very short self life. Say you have a dedicated python coding model - that needs re-training every time there's a significant update to the language itself, any popular packages, related technologies (e.g. servers, cloud infra etc). So how often will users need to "upgrade" to the lastest version? It's going to be "frequently".
And it still needs the language stuff on top of that. Users aren't going to interact with a python coding model by writing python. They're going to use natural language. So the model needs all that stuff. And they're going to give it problems to solve. What if you asked the model "Write me a Bezier curve function". It needs to know about bezier curves, which have nothing to do with Python. So where do these LLM providers draw the line on what makes it into the training data and what doesn't?
And if an LLM doesn't know what a Bezier curve is, that's not going to stop it from just hallucinating an answer. If a significat proportion of prompts resulted in a response that said "Sorry, I don't know what you're talking about", then people will just stop using it. The utility of these things will be quickly overshadowed by the frustrations.
The way these frontier models have been introduced and promoted has set unrealistic expectations, and there's no putting the genie back in the bottle.
Commoditizing complements. If Anthropic/OpenAI/etc is eating your lunch, make it work with cheap local LLMs , you can beat them on price by having local inference you don't pay (nor need data centers for), and try to keep your (user/data) moat.
The more Anth/OAI disrupt, the more likely this is to happen. If they don't disrupt enough (.ie: grow as an ecosystem to defend against incentives to commoditize), then yes, those incentives are removed, but they also leave money on the table, which they need.
Not only at business level, but also geopolitical (to a lesser extent? or not since lots of open weight models comes form China?).
On the other hand… v4 flash model is actual magic compared to what was available 2 years ago. If the rate of improvement stays as is, we’ll get a similar performance in a ~120B model in a year, which is viable (if expensive) for everyman hardware. Possibly you’ll be able to run its equivalent on a ~$1200 laptop by 2028, which for me-in-2020 would sound straight out of a scifi movie. A good harness that lets the model fetch data from other sources like a local wikipedia copy from kiwix could do a lot for factual knowledge, too; there’s only so much you can encode in the model itself, but even a cheapish (pre-curent prices) 2TB drive can hold an immense amount of LLM-accessible data.
Big caveat: I don’t see local models for programming or generally demanding agentic tasks being worth it anytime soon. You likely want bleeding edge models for it, and speed is far more important. Chat at 20tok/s is fine; working on even a small codebase at 20tok/s, especially on a noticeably weaker model, is just a waste of time. Maybe it’s a PEBKAC but I have no idea how people make any meaningful use out of qwen 3.6.
This is the wrong way of putting it. Local inference with SOTA models is all about slowing down compute for the sake of fitting on bespoke repurposed hardware. You don't need to go fast if you have the whole machine to yourself 24/7. Cloud AI vendors can't match that kind of economics.
As OP says, it shines in constrained environments where the model is transforming user-owned data. Definitely less useful for anything more open-ended.
Maybe it would do better with the new Gemma 4 models, which the Chrome devs have been hinting at moving to. And why the API doesn't let you introspect / pick the model, I'm still not sure.
Yup, that's the plan. No local model, no webpage; more, better and cheaper adtech extortion/surveillance for vendors while everyone else pays for the juice and hardware degradation.
This has been the case for way longer than openAI and Anthropic has been around with services like AWS, Cloudflare, etc.
I consider it to be very careless to entrust your emails, your chats, your calendar, your notes, your calls, your pictures, your contacts, your location history, your waking hours, your files, your TODO list, i.e. stuff including your health data to the for-profit AI companies. The temptation to earn money with your data is just too great, plus the risk of the data being stolen and sold illegally.
Local AI should be the default. For everone who can't do local AI, we need confidential compute. Yes, it has been hacked before. But it's making it a lot harder.
Still, we all do it with Google. (I don't do it anymore but i did it for mostly two decades so I include myself)
Oh yeah , it feels independent and not lazy , sure
The problem is that it's much easier to use the SOTA models (especially if they are subsidized) instead of spending time fixing the knobs with the local one.
I just realized this with coding agents, yeah, you probably shouldn't always use latest version at xhigh, but you will end doing it because you do the job in less time, with less "effort" and basically at the same price.
I guess we'll see a real effort for local AI only when major vendors will start billing based on actual token usage.
That's not a problem, that's a feature; I have something like 8 tabs open to different free-tier providers. ChatGPT, Claude and Gemini are the SOTA ones.
I have no problem maxing one out, then moving to the next. I can do this all day, have them implement specific functions (or classes) in my code. The things is, because I actually know how to write and design software, I don't need to run an agent in a loop to produce everything in a day, I can use the web chatbots with copy/paste to literally generate thousands of lines of code per hour while still having a strong mental model of the code that I can go in and change whatever I need to.[1]
---------------------
[1] Just did that this morning on a Python project: because I designed what I needed, each generation was me prompting for a single function. So when I needed to add something this morning I didn't even bother asking an chatbot to do it, I just went ahead directly to the correct place and did it.
You can't do that if you generate the entire thing from specs.
The feature of using all these SOTAs to exhaustion on the free tiers is burning their VC money!
The more I use for free, the more of their money I burn, the closer we'll get to actual 3rd-party and independent setups (local or otherwise).
I have a sneaking suspicion this is kinda like the situation with Linux in the 90s, where it kinda worked but it reeeeeally wasn't ready for the home user, but you had a lot of people who would insist to your face everything was fine, mostly for ideological reasons.
I'm currently running both Sonnet 4.6 and Qwen 3.6-27b on the same codebase (via OpenCode, the parameters were carefully tuned to have a good quality/context size ratio), and on this project, they both struggle with complex non-trivial tasks, and both work flawlessly otherwise. Sonnet 4.6 understands the intent better if my task is ambiguously formulated, but otherwise the gap is pretty small for coding under a harness.
Different usage patterns - you want to issue a single spec then walk away and come back later (when it has consumed $10k worth of API tokens inside your $200/m subscription) to a finished product.
Many people issue a spec for a single function, a single class or similar. When you break it down like that, the advantages of SOTA models shrinks.
What do you mean "trust it"? It sounds like you want to vibe-code (never look at the output), and maybe for that you need SOTA, but like I said in a different comment, I can easily generate 1000s of lines of code per hour just prompting the chatbots.
I don't, because I actually review everything, but I can, and some of those chatbots are actually SOTA anyway.
With subpar models I must be more careful on providing instructions and check it step by step because the path it chose is wrong, or I didn't ask for or the agent stuck in a loop somewhere.
I’ve begun to suspect that most people are probably running different hardware. Sure, you run the latest deep flash on your brand new M5 128G maybe you get acceptable performance?
But honestly, how many people have an extra $9000 laying around these days?
Right now, running with acceptable performance is kind of a luxury. I wish the people who always say - “This is great!” - would realize that not everyone has their hardware.
A smaller cheaper local model can delivery most the value for coding, while we still use some services for code review and security compliance.
Once the VC money runs out and they start to charge the real price, the C-level will have to impose budges or limits. The current pissing contest over who can expend the most tokens is both ridiculous and shortsighted
The obvious optimization for the case presented would be to generate all the summaries on a server instead of in the client. Then the totally used compute would scale with the number of articles instead of number of users.
The promised mega-data center deals are meant to boost valuations today, not serve tons of customers three years from now.
Seriously. I have never ever seen so many people so willingly drink the marketing kool-aid from companies selling their product before. It's scarier to me than any threats of AI actually disrupting society (because it is so far from being capable of doing that).
Basically small and medium models that are crazy well trained for their sizes.
Then we have a lot of specular decoding stuff like MTP and others coming to speed up responses, and finally better quantisation to use less memory.
Local LLM is the future, and the larger labs know that the open models will eat their lunch once people realise that the gap is only a few months. If we were good with LLMs a couple months ago, we're good with the open models now.
That's irrelevant to my decision to use local or not.
I didn't read "and how were those models trained" as "Are we there yet?"
Just totally forgetting that the frontier models themselves stole an insane amount to get to where they are.
It's theft all the way across the board, and when someone tries to make the argument that open models theft is bad, but Altman or Amodei's theft is good.. they are revealing a lot about themselves
I have to assume current architectures aren't optimal though, the idea that we stumbled into the one and only optimal solution seems almost impossible.
If you project out that hardware just a couple of years, and the trained models out a couple of years, you end up in a place where it makes so much more sense to run them locally, for all sorts of latency, privacy, efficacy, and domain-specific reasons.
Not all that different from the old terminal & mainframe->pc shifts.
Finally - hardware has seemingly gotten out ahead of software that most folks use - watching YouTube, listening to music, playing a game or two. There was a time when playing an mp3 or watching a 4k video really taxed all but the nicest systems. Hardware fixed that problem, like it very well could this one.
Definitely not the high end local LLMs. The small ones, yes, absolutely.
> If you project out that hardware just a couple of years
One of the biggest bottlenecks for LLMs is memory capacity and bandwidth. With the current glut for memory, it's unlikely we'll see lots of advancements in terms of average memory available or its bandwidth on regular (not super high end devices) in the coming years.
Alternatively, it's possible we get dedicated SMLs for e.g. phone specific use cases, that are optimised and run well.
Right now it feels like we have all the pieces but nobody integrating all that into an amazing experience.
Based on what I understand about how the former works, I would assume that the latter has the same properties and failure modes.
Damned if they do, damned if they don't.
You can also…turn it off.
Chrome silently elected people into it _and_ downloaded the model without asking because they decided that’s something they (chrome) fancied doing.
The difference should be pretty obvious.
This comment is quite dishonest about the nature of the discussion.
Also why doesn't their task manager show that it's actually the one downloading? Why does it go out of it's way to hide this activity?
Since I have conky on my desktop I could catch this immediately, and take the action I preferred with my own computer, which was to _immediately_ disable it.
https://developer.chrome.com/blog/new-in-chrome-148#prompt-a...
https://www.google.com/chrome/ai-innovations/
They have absolutely not been shy about any of this.
Please show me where in either of those documents it explains it's going to download a 4GB model.
It's a totally separate tab that opens. It's got nothing to do with what you use as your homepage.
I'm on gentoo. I have to update chrome manually. I updated it. On update I _never_ get a "what's new" page. I've had this profile for more than a decade so I have no actual idea why, but, I can absolutely tell you, I do *not* get one. After update it started consuming all my bandwidth. This use did not show in it's task manager. I have a metered connection. This is a problem for me. I worried it was a compromised plugin. I had to spend 10 minutes in Firefox discovering why chrome was doing this then going to the configuration and disabling this.
This was a disappointing experience. I'm sorry you feel differently; other than stating the obvious, I seriously have no idea what you and the other corporate defense squad members are trying to achieve with this gaslighting nonsense.
Not to mention that the LLM that I choose to run requires a monster machine and is infinitely more capable than whatever google chose to put on their browser?
I mean, none of this affects me because I don't use chrome, obviously, but you don't see the difference? Bewildering.
This is what makes me continuously doubt and rewrite the local-first approach to inline chat in my editor. Next edit/ code complete makes more sense due to latency advantage. But chat is hard.
It's fast and feels good to run locally, but output quality is just not ChatGPT etal.
I may personally be of modest intelligence, but to acquire the intelligence that I do have, I did not need to train on every book ever written, every Wikipedia article ever written, every blog post ever written, every reference manual ever written, every line of code ever written, and so on. In fact, I didn't train on even 1% of those materials, or even 0.00000000001% of those. The texts themselves were demonstrably not a prerequisite for intelligence.
At minimum, given that it only took me about 20 years of casual observation of my surroundings to approximate intelligence, this is proof positive that the only "dataset" you need is a bunch of sensors and the world around you.
And yes, of course, the human brain does not start from zero; it had a few million years of evolution to produce a fertile plot for intelligence to take root. But that fundamental architecture is fairly generic, and does not at all seem predicated on any sort of specific training set. You could feasibly evolve it artificially.
A universal translator with image and voice recognition and a decent breadth of encyclopedic knowledge in only a small fraction of an English Wikipedia dump(6GB/20+GB) is not "huge".
It is probably closer to the theoretical limit than anyone could have expected.
In the future, when regular home computers have the capabilities of modern servers, we'll be able to train the entire LLM at home.
You don't have any guarantees in terms of data, that's true, you rely on the provider. But this is similar to a database or other services where you don't have the knowledge or resources to run them yourself. Hardware cost is an additional factor here.
If on the other hand your idea works out and the model fits the use case, you can always decide to move to a dedicated infrastructure later.
All of this being said, it seems Claude gave up this "constitution" it used to train on? I remember trying to get it to help me code some video editing tools, and it was convinced I was pirating videos and so wouldn't help me anymore in that session.
I tried Cline and couldn't get it working well and part of this was that at the time it expected OpenAIs output format.
https://news.ycombinator.com/item?id=48050751
A specialist handrolls a cut-down framework to power a 1 or 2 bit quantised version of a cut-down sort-of-frontier model.
It can be yours if you have 128GB or 256GB of RAM.
This is why I believe OAI and Anthropic I’ve been so aggressive at offering services outside of their pure models like Claude Design. This is what will be competitive and keeping people subscribed.
- and for the web / javascript / svelte applications?
- suggestions for local OCR for bulk images?
Small models are still in their infancy, and there's still much to sort out about and around them, as well
Is there a solution for this? I'm currently just making users download onnx models if they want a feature, but it's not smooth UX
* What is the answer to local AI for native apps on Windows?
* What is the answer to local AI for Linux?
This is a big opportunity for Linux, given the high quality of open-weight models. I hope some answer emerges before designs fracture and we get a dozen mutually incompatible answers.
run an ai api endpoint on a unix domain socket
I think the Quixotic accelerationists of AI are more or less a vocal minority of the people who make software, and the choice of online APIs over local systems is largely a choice made for users, rather than developer’s laziness.
You can do more and better with private AI today than with local models. There is no getting around that. Even if local AIs get better, being on the cutting edge of LLM performance is often a very worthy investment.
Most people won’t settle for a product if it’s not the very best and incredibly convenient. That’s a high bar, and local AI often doesn’t meet those standards.
HN’s insistence on treating all users like they are open-source, privacy-first, self-hosted Linux fanatics is painfully corny.
... uh?
The goal is that you would assign roles to models based on tasks, capabilities and observed performance. The router would then take care of model selection in the background.
It's tricky though. Probably have another two weeks before I can release the runtime.
I have a preview up at https://role-model.dev/
You can follow me on Twitter if you want updates (see profile)
We are at least 5 years away from that. And DRAM needs a substantial breakthrough in cost reduction.
> “But Local Models Aren’t As Smart”
> Correct.
> But also so what?
> Most app features don’t need a model that can write Shakespeare, explain quantum mechanics, and pass the bar exam. They need a model that can do one of these reliably: summarize, classify, extract, rewrite, or normalize.
> And for those tasks, local models can be truly excellent.
I have tried quite a bunch of local models, and the reality is that it's not just a matter of of "it's a small model that should be hostable easily". Its also a matter of whats your acceptable prefill TTFT and decode t/s.
All the local models I used, on a _consumer grade_ server (32GB DDR5, AMD Ryzen) have been mostly unusable interactively (no use as coding agent decently possible), and even for things like classification, context size is immediatly an issue.
I say that with 6m experience running various local models for classifying and summarizing my RSS feeds. Just offline summarizing ans tagging HN articles published on the front page barely make the queue sustainable and not growing continuously.
2) It's probably not the time/place to trouble-shoot your "consumer grade server" LLM experience, but if you're running on CPU (you don't mention a GPU) then yeah, your inference speed will be slow.
3) Counterpoint: my consumer-grade Macbook Pro (M1 Max, 64GB) runs Qwen3.6-35B-A3B fast enough to be very usable for regular interactive coding support. (And it would fly with smaller models performing simpler tasks.)
Used to take me maybe 10-20 minutes per sheet.
Then I got codex to whip up a script that sends each sheet to a fairly low parameter locally running LLM and I have the yaml in a couple seconds.
My dream is to bootstrap myself to local productivity with providers… I know I’ll never get there because hedonic treadmill etc, but I do feel there’s lots more juice to squeeze. I just need to invest more time into AI engineering…
``` harbor pull unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_XL
# Open WebUI -> llama.cpp + SearXNG for Web RAG + OpenTerminal as sandbox harbor up searxng webui llamacpp openterminal ```
That's it, it's already better than Claude's or ChatGPT's app.
A useful framing over “local vs cloud AI” can be split along two axes: does the task touch private data, and does it need frontier intelligence? You can use frontier models for developing the software (doesn’t touch data), but open-source models running locally for ops: maintenance, debugging and monitoring (touches data). If you need to fall back to frontier intelligence at some point for a particularly hard to resolve problem, you can still rely on local models for pre-transforming and filtering input in a way that's privacy-preserving or satisfies some constraint before it’s sent off to the cloud for processing. OpenAI's privacy filter is a good example of a model that can be used to mask PII and secrets and that can run locally: https://openai.com/index/introducing-openai-privacy-filter/, before sending any data externally for processing.
Another framing for local vs frontier closed which the article mentions is whether the task saturates model capability. With certain tasks like PDF processing or voice or summarization, adding more intelligence isn't necessarily useful. Arguably we've approached that point for chat interfaces already with frontier open-source models. But for coding and ops through well structured tool use inside a coding capable harness, we're still a ways away.
Tangentially, a contrarian take here is that AI can actually enable more privacy preserving software if you’re so inclined. You can just build personalized software and it lowers the barrier to entry and the effort required to self host. SaaS complexity often comes from scaling and supporting features for all types of customers, and if you're building software for personal use, you don't need all that additional complexity. Additionally, foundational and infra software that is harder to vibecode with AI is often already open source.
Well there’s your problem, control needs to go the other way. If you want your app to be AI-enabled, you need to make it easy for AI to control your app. Have you used OpenClaw? It’s awesome!
Isn’t this true of any application that accesses anything not running on your computer? This is just describing what it means to add an API call to your app. Nothing to do with AI (?)
Not saying it’s _wrong_ either – maybe it doesn’t use a backend of its own (the client downloads content directly from some predefined set of sites), maybe there is functionality to adjust how the summaries work that benefit from doing it on device, etc. Just doesn’t convince me that ”local AI should be the norm”.
Dont quite think its ready yet.
proceeds to brutalise the reader with an 88-point headline font.
Work? I don't want it local at all. I want it all cloud agent.
If you are simply measuring Watt Cost per Token, you are missing the mark drastically. You have to measure quality output per Watt.
It sounds reasonably difficult to benchmark this, maybe I'm wrong though.
If we could even get something like GPT 5.5 running locally that would be quite useful.
It would be nice if model makers could at minimum embrace test harnesses, and stretch goal if they’re going to change underlying formats then at least land compatible readers in the big engines (e.g. llama.cpp and vllm)
who can afford a house?
Welcome back to 2014. Let us now continue yelling at the cloud.
1. Local models are likely to be more power-expensive to run (per-"unit-of-intelligence") than remote models, due to datacenter economies of scale. People do not like to engage with this point, but if you have environmental concerns about AI, this is a pretty important one.
2. Using dumb models for simple tasks seems like a good idea, but it ends up being pretty clear pretty quick that you just want the smartest model you can afford for absolutely every task.
And you can't take comfort in knowing that you, personally, will remain in control of your own computing. The majority will let the range and direction of their thoughts and output be determined by the will of the tech giant whose AI they adopt. And that will shape society.
Streaming Services are getting worse and more expensive. I don't see a single report suggesting piracy is decreasing, it seemingly is only increasing now.
When costs increase, quality decreases people look for alternatives. The advent of faster broadband enabled Napster and MP3 sharing. I think this could have a resurgence if the peices align correctly (a new bitorrent client, a new torrent site, something to break the status quo).
How this related to AI, I don't know, although I wouldn't be set on the idea that we will never have local AI as the norm. There is a lot more movement in this space then there is for local streaming imo.
When I say 'moat' I don't mean moat specific to a company vis-a-vis other companies, but 'moat' specific to the set of inference providers vis-a-vis self-hosted local inference.
The moat consists primarily of being able to batch inference requests.
If we pretend people weren't interested in long context-lengths, there would be a moat for inference providers. who can batch many requests so that streaming the model weights (regardless if from system RAM to GPU RAM; or from GPU RAM to GPU cache SRAM) can be amortized over multiple requests.
However people do want longer memory than the native context length.
One approach is continual learning (basically continue training by using the past conversation as extra corpus material; interspersed with training on continuations from the frozen model, so it doesn't drift or catastrophically forget knowledge / politeness / ...).
However this is very expensive for inference providers, since they would have to multiply model weight storage with the number of users U=N. For a single user the memory cost of continual learning is much less since they only need to support a single user, and are returned some of the memory cost through elimination of KV-caches, and returned higher quality answers compared to subquadratic approximations of quadratic attention.
An advantage of continual learning is that the conversation / code base / context is continuously rebaked into model weights, and so doesn't need KV caches! It doesn't need imperfect approximations to quadratic attention, it attends through working knowledge being updated.
Nothing prevents local LLM users from implementing this and benefiting from the dropped requirements of KV caches and enjoying true quadratic attention implicitly over the whole codebase, or many overlapping projects indeed.
The only remaining moat of inference providers vis-a-vis continual learning local LLM's is the batching advantage, plus the gradient update costs for continual learning minus the KV storage and compute costs, minus the performance loss due to inexact approximations to quadratic attention.
This points towards a stronger incentive for local hosting than currently realized (none of the popular local LLM tools currently support continual learning, once this genie is out of the bottle it will be a permanent decrease of the inference provider moat, the cost of which can't be expressed merely in hardware or energy costs, since it is difficult to quantify the financial loss of inexact approximations to quadratic attention, the financial loss due to limited effective context length and the concomitant loss in quality of the result)
Does it really work?
And local inference requires fairly beefy hardware, that is FAR from ubiquitous across today's userbases. Local models are also still far dumber than what frontier labs can serve.
Weird that this is getting such a tidal wave of upvotes.
NVidia segments the market by limiting the amount of memory on GPUs. It currently tops out at 32GB (on a 5090) but it has excellent memory bandwidth (~1.8TB/s). If you want more than the you need to buy an RTX Pro (eg RTX 6000 Pro w/ 96GB for ~$10K) or you get into high high end solutions like H100, H200, etc that have significantly more memory and even higher bandwidth on HBM memory (eg 3.2TB/s+).
NVidia has released the DGX Spark w/ 128GB of memory for ~$4k. The problem is the memory bandwidth. It's only 273GB/s, which is less than the M5 Pro (307GB/s) but more than the M5. You can buy a 16" Macbook Pro with an M5 Max and 128GB of memory for $6k and it has a bandwidth of 614GB/s. So the DGX Spark is a joke, really.
In case it wasn't clear, Apple is interesting in this space because it has a shared memory architecture so the GPU can use all the memory.
Many, myself include, expect there to be no refresh to the 5000 series consumer GPUs this year, which would otherwise happen based on product cycles. So no 5080 Super, for example. And I wouldn't expect a 6090 before 2028 realistically.
One thing Apple hasn't done yet is release the M5 Mac Studios, which are widely expected in Q3 this year. They are interesting because, for example, the M3 Ultra has a memory bandwidth of 819GB/s and previously had a max spec of 512GB but that got discontinued (and the 256GB version also got discontinued more recently).
So many expect an M5 Max Mac Studio with 1TB/s+ bandwidth and specs up to 256GB or 512GB, probably for ~$10k later this year.
You really have to use this hardware almost 24x7 for it to be economical because otherwise H100 computer hours are probably cheaper.
But what happens when the next generation of GPUs comes out to the trillions in AI DC investment? It's going to halve its value. That's over $1 trillion in capex that will disappear overnight, effectively.
I think Apple is the dark horse here because they have no interest in NVidia's psuedo-monopoly. I'm just waiting for them to realize it.
Now CUDA is an issue here still but I think as time goes on it's going to be less of an issue. Memory is still a huge constraint both in terms of price and just general supply because NVidia can justify paying way more for it than you can, probably.
It's still sad to see that 128GB (2x64GB) DDR5 kits are almost $2k now and werre $400 a year ago. Expect that to continue until this bubble pops (which IMHO it will) and we're likely in a global recession.
So the other issue is models. OpenAI and Anthropic are built on proprietary models. Their entire valuation depends on this moat. I don't think this last so both companies are doomed because open source models are going to be sufficiently good.
We can already do some reasonably cool stuff on local hardware that isn't that expensive and even more so once you get to $5-10k hardware. That's going to be so much better in 2 years that I'm hesitant to spend any amount of money now.
Plus the code for running these things is getting better. Just in the last month there have been huge speed ups in local LLMs with MTP.
Not at all sure about that. They have really good compute, and DeepSeek V4 (with antirez's 2-bit expert layer quant) may be able to leverage that compute via parallel inference - the jury is still out on that. Now if you had said Strix Halo/Strix Point or perhaps the Intel close equivalents, that would've been a slightly stronger case.
This is what I'm really waiting for. It will enable models comparable to current SOTA at the enthusiast price range.
I have to conclude that people would like to have powerful local AI but it should at the same time only be a tiny model. In which case it wouldn't be powerful.
Local models need to be resident in expensive RAM, the kind that has fat pipes to compute. And if you have a local app, how do you take a dependency on whatever random model is installed? Does it support your tool calling complexity? Does it have multimodal input? Does it support system messages in the middle of the conversation or not? Is it dumb enough to need reminders all the time?
Spend enough time building against local models and you'll see they're jagged in performance. You need to tune context size, trade off system message complexity with progressive disclosure. You simply can't rely on intelligence. A bunch of work goes into the harness.
Meanwhile, third party inference is getting the benefits of scale. You only need to rent a timeslice of memory and compute. It's consistent and everybody gets the same experience. And yes, it needs paying for, but the economics are just better.
Reading the tea leaves here, it will probably be common for OS’s to have built in models that can be accessed via API. Apple already does this.
Why not ship your own model? In the age of Electron apps, 10GB+ apps are not unheard of.
It seems easier to have industry specs that define a common interface for local models.
I also assume the OS can, or would need to, be involved in proving the models. That may not be a good thing depending on your views of OS vendors, but sharing a single local model does seem more like an OS concern.
Local models are absolutely going to be the future for things like simple automation and classification tasks that run occasionally and don't need to rely on internet access.
But for all of the serious stuff where you are doing knowledge work, the models will simply continue to be too big, and too slow to run locally.
The article says:
> Use cloud models only when they’re genuinely necessary.
But at least for me, they're genuinely necessary for 99+% of my LLM usage.
At the end of the day, the constraint here really is efficiency and cost.
Privacy can be ensured with the legal system, the same way that businesses that compete with Google still have no problem storing their data in Google Workspace and Google Cloud. The contractual guarantees of privacy are ironclad, and Google would lose its entire cloud business overnight as its customers fled if it ever violated those contractual agreements (on top of whatever penalties they allow for).
I don't think that many people have built apps against these models.
I mean, I use a heavily quantized version of qwen3 for image classification, caption generation, prompt expansion etc. for image generation, instruction-driven edits, and so on. You can go a long way when you don't need a lot.
A model that can do tool calls - any tool calls at all - can look reasonably cool once you put it in a harness where there's enough immediate context to take action. You can get carried away by anything happening at all. But golly gosh it's a long way short of intelligence available in the bigger models.
And the lighter you make your harness, giving the model more free reign, more autonomy, you get a big jump in capability combined with a big jump in failure modes when the model is dumb.