Next.js App Router + React Server Components Demo

NHacker Next

new
past
show
ask
show
jobs
submit

▲iPhone 17 Pro Demonstrated Running a 400B LLM (twitter.com)

69 points by anemll 2 hours ago | 30 comments

causal 37 minutes ago [-]

Run an incredible 400B parameters on a handheld device.

0.6 t/s, wait 30 seconds to see what these billions of calculations get us:

"That is a profound observation, and you are absolutely right ..."

intrasight 10 minutes ago [-]

Better than waiting 7.5 million years to have a tell you the answer is 42.

Aurornis 28 seconds ago [-]

I thought you were being sarcastic until I watched the video and saw those words slowly appear

WarmWash 15 minutes ago [-]

I don't think we are ever going to win this. The general population loves being glazed way too much.

baal80spam 10 minutes ago [-]

> The general population loves being glazed way too much.

This is 100% correct!

pier25 29 minutes ago [-]

https://xcancel.com/anemll/status/2035901335984611412

cj00 58 minutes ago [-]

It’s 400B but it’s mixture of experts so how many are active at any time?

simonw 57 minutes ago [-]

Looks like it's Qwen3.5-397B-A17B so 17B active. https://github.com/Anemll/flash-moe/tree/iOS-App

firstbabylonian 1 hours ago [-]

> SSD streaming to GPU

Is this solution based on what Apple describes in their 2023 paper 'LLM in a flash' [1]?

1: https://arxiv.org/abs/2312.11514

simonw 56 minutes ago [-]

Yes. I collected some details here: https://simonwillison.net/2026/Mar/18/llm-in-a-flash/

zozbot234 34 minutes ago [-]

A similar approach was recently featured here: https://news.ycombinator.com/item?id=47476422 Though iPhone Pro has very limited RAM (12GB total) which you still need for the active part of the model. (Unless you want to use Intel Optane wearout-resistant storage, but that was power hungry and thus unsuitable to a mobile device.)

Aurornis 48 seconds ago [-]

> Though iPhone Pro has very limited RAM (12GB total) which you still need for the active part of the model.

This is why mixture of experts (MoE) models are favored for these demos: Only a portion of the weights are active for each token.

simonw 24 minutes ago [-]

Yeah, this new post is a continuation of that work.

_air 19 minutes ago [-]

This is awesome! How far away are we from a model of this capability level running at 100 t/s? It's unclear to me if we'll see it from miniaturization first or from hardware gains

Tade0 23 seconds ago [-]

Only way to have hardware reach this sort of efficiency is to embed the model in hardware.

This exists[0], but the chip in question is physically large and won't fit on a phone.

[0] https://www.anuragk.com/blog/posts/Taalas.html

ashwinnair99 1 hours ago [-]

A year ago this would have been considered impossible. The hardware is moving faster than anyone's software assumptions.

cogman10 1 hours ago [-]

This isn't a hardware feat, this is a software triumph.

They didn't make special purpose hardware to run a model. They crafted a large model so that it could run on consumer hardware (a phone).

pdpi 55 minutes ago [-]

It's both.

We haven't had phones running laptop-grade CPUs/GPUs for that long, and that is a very real hardware feat. Likewise, nobody would've said running a 400b LLM on a low-end laptop was feasible, and that is very much a software triumph.

smallerize 36 minutes ago [-]

The iPhone 17 Pro launched 8 months ago with 50% more RAM and about double the inference performance of the previous iPhone Pro (also 10x prompt processing speed).

33 minutes ago [-]

mannyv 21 minutes ago [-]

The software has real software engineers working on it instead of researchers.

Remember when people were arguing about whether to use mmap? What a ridiculous argument.

At some point someone will figure out how to tile the weights and the memory requirements will drop again.

snovv_crash 10 minutes ago [-]

The real improvement will be when the software engineers get into the training loop. Then we can have MoE that use cache-friendly expert utilisation and maybe even learned prefetching for what the next experts will be.

rwaksmunski 47 minutes ago [-]

Apple might just win the AI race without even running in it. It's all about the distribution.

dzikimarian 15 minutes ago [-]

Because someone managed to run LLM on an iPhone at unusable speed Apple won AI race? Yeah, sure.

naikrovek 10 minutes ago [-]

whoa, save some disbelief for later, don't show it all at once.

raw_anon_1111 39 minutes ago [-]

Apple is already one of the winners of the AI race. It’s making much more profit (ie it ain’t losing money) on AI off of ChatGPT, Claude, Grok (you would be surprised at how many incels pay to make AI generated porn videos) subscriptions through the App Store.

It’s only paying Google $1 billion a year for access to Gemini for Siri

detourdog 33 minutes ago [-]

Apple’s entire yearly capex is a fraction of the AI spend of the persumed AI winners.

devmor 15 minutes ago [-]

Which is mostly insane amounts of debt leveraged entirely on the moonshot that they will find a way to turn a profit on it within the next couple years.

Apple’s bet is intelligent, the “presumed winners” are hedging our economic stability on a miracle, like a shaking gambling addict at a horse race who just withdrew his rent money.

qingcharles 15 minutes ago [-]

Plus all those pricey 512GB Mac Studios they are selling to YouTubers.

jee599 44 minutes ago [-]

[dead]

anemll 2 hours ago [-]

[flagged]

lostmsu 1 hours ago [-]

This has nothing to do with Apple, and everything to do with MoE and that everyone forgot you can re-read the necessary bits of the model from disk for each token.

This is extremely inefficient though. For efficiency you need to batch many requests (like 32+, probably more like 128+), and when you do that with MoE you lose the advantage of only having to read a subset of the model during a single forward pass, so the trick does not work.

But this did remind me that with dense models you might be able to use disk to achieve high throughput at high latency on GPUs that don't have a lot of VRAM.

simopa 1 hours ago [-]

It's crazy to see a 400B model running on an iPhone. But moving forward, as the information density and architectural efficiency of smaller models continue to increase, getting high-quality, real-time inference on mobile is going to become trivial.

Rendered at 16:07:02 GMT+0000 (Coordinated Universal Time) with Vercel.