Interesting, but what exactly did it do and what does it mean? Like, did it simply convert a 397B model into a 20b model? Or is this still a 397B model that now only uses around 6GB while running?
simonw 3 hours ago [-]
Yeah the details on this look pretty thin. Best I could see was this snippet from the screenshot:
> Key technique: selective expert streaming via direct I/0. Only ~10 of 512 experts per layer are loaded from SSD per token (~1.8GB I/0 per token at 1.4 GB/s effective bandwidth). Non-expert weights (~5GB) are pinned in DRAM. LRU expert cache provides 44%+ hit rate.
> This paper tackles the challenge of efficiently running LLMs that exceed the available DRAM capacity by storing the model parameters in flash memory, but bringing them on demand to DRAM. Our method involves constructing an inference cost model that takes into account the characteristics of flash memory, guiding us to optimize in two critical areas: reducing the volume of data transferred from flash and reading data in larger, more contiguous chunks.
d-woods 3 hours ago [-]
[dead]
quietbuilder 3 hours ago [-]
44% cache hit rate is low. Over half the expert loads are cold reads off SSD, so at 1.4 GB/s effective bandwidth and ~1.8GB I/O per token, 4.74 tok/s checks out — but it'll drop with longer context or heavier reasoning.
Running 397B on consumer hardware is genuinely impressive for a proof of concept. A year ago this wasn't a thing. But I keep wondering whether a well-quantized 70B that fits entirely in RAM would just be faster in practice. No I/O bottleneck, consistent throughput, smaller model but actually usable.
0x457 3 hours ago [-]
Interesting. Reminds me how Gemma 3N with PLE caching works.
3 hours ago [-]
Rendered at 20:02:43 GMT+0000 (Coordinated Universal Time) with Vercel.
> Key technique: selective expert streaming via direct I/0. Only ~10 of 512 experts per layer are loaded from SSD per token (~1.8GB I/0 per token at 1.4 GB/s effective bandwidth). Non-expert weights (~5GB) are pinned in DRAM. LRU expert cache provides 44%+ hit rate.
It's apparently using ideas from: https://arxiv.org/abs/2312.11514
> This paper tackles the challenge of efficiently running LLMs that exceed the available DRAM capacity by storing the model parameters in flash memory, but bringing them on demand to DRAM. Our method involves constructing an inference cost model that takes into account the characteristics of flash memory, guiding us to optimize in two critical areas: reducing the volume of data transferred from flash and reading data in larger, more contiguous chunks.
Running 397B on consumer hardware is genuinely impressive for a proof of concept. A year ago this wasn't a thing. But I keep wondering whether a well-quantized 70B that fits entirely in RAM would just be faster in practice. No I/O bottleneck, consistent throughput, smaller model but actually usable.