NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
Apple: Embarrassingly Simple Self-Distillation Improves Code Generation (arxiv.org)
bensyverson 51 minutes ago [-]
Really fascinating how this works; it's basically context-aware decoding. From the paper:

> Code interleaves fork positions, where several continuations are genuinely plausible and may correspond to different solution approaches, with lock positions, where syntax and semantics leave little ambiguity but a low-probability distractor tail still remains… The best global decoding setting is therefore necessarily a compromise; we call this tension the precision-exploration conflict.

In other words, just like us, the model needs to shift from "exploration" in "fork" mode (divergent thinking to produce a creative solution) to "precision" in "lock" mode (producing syntactically correct code).

What this paper shows is that their simple technique (SSD) can improve the ranking of optimal tokens in both lock and fork positions, meaning the model is more likely to explore when it should be exploring, and more likely to be precise when it needs to be.

I love that we're still learning the emergent properties of LLMs!

stingraycharles 38 minutes ago [-]
Seems like this is true for not just code but for all content being generated? Albeit for code it’s more well-defined, but the fork / lock mechanism works for a lot more problem domains.
bensyverson 32 minutes ago [-]
That would seem intuitively true; it certainly applies to written language, where a clause could go off in another direction, but at other positions the correct grammar/syntax is unambiguous.
bryanrasmussen 31 minutes ago [-]
thinking - well if we think of lock as happening in a narrative, then I think we can see there can be points where "everything you know is wrong" which essentially allows you to go back into a sort of fork mode and work towards another lock.

Completely artistic creation, creating something that does not exist and that cannot produce things out of itself, means that locking can be more diffuse, not as settled.

stingraycharles 28 minutes ago [-]
I think this seems similar to what Anthropic had been doing since the latest few Opus releases, which is interleaved thinking; CoT reasoning in the middle of a message. But they operate at different layers.
wg0 42 minutes ago [-]
After TurboQuant and Gemma 4, came across the following video[0] running Gemma on local machine at 50 token/second.

That already looks like Sonnet 3x and 4 level capabilities to me where the model in question (Gemma 4) set ups whole python project with a UI and installs python libraries using uv etc.

Add this Simple Self Distillation to the picture and by 2028 I see cheaper coding model providers with much more generous usage limits in the future and power users would be mostly running their own models anyway.

Anyone using these models as "non-deterministic transpilers" from natural language to code (experienced engineers who can write code themselves) would probably not be paying to any AI providers.

[0] https://www.youtube.com/watch?v=-_hC-C_Drcw

spiderfarmer 4 minutes ago [-]
I always wonder how much smaller and faster models could be if they were only trained on the latest versions of the languages I use, so for me that is PHP, SQL, HTML, JS, CSS, Dutch, English, plus tool use for my OS of choice (MacOS).

Right now it feels like hammering a house onto a nail instead of the other way around.

khalic 1 hours ago [-]
Incredible, will translate to better coding models in the near future.

We really need to develop better tools to understand what's happening inside these NNs. Working with high-D spaces is not something we're good at, and we're basically throwing stuff at it and seeing if it sticks.

0x3f 1 hours ago [-]
Haven't read the paper yet, but it is interesting how seemingly simple many breakthroughs in ML are. Even transformers are like that. Maybe it's hindsight bias.

I suppose we just don't have a deeper underlying theory to lean on and help us 'design' anything.

christophilus 29 minutes ago [-]
A lot of discoveries are like that. In fact, simplicity is often the hallmark of correctness, and complexity is often a sign that our understanding is incomplete and we’re still stumbling towards the right model. Not always, but often. It’s been a good rule of thumb in my programming career.
heeton 17 minutes ago [-]
[dead]
vishnugupta 5 minutes ago [-]
Can someone please eli5 this to a friend web developer? I read the abstract but couldn’t understand much.
l5870uoo9y 34 minutes ago [-]
> Our method, simple self-distillation (SSD), is embarrassingly simple: sample solutions from the base model with specified temperature and truncation, then fine-tune on those raw, unverified samples via standard cross-entropy loss.

So you prompt the base model for answer and then rerun the prompt with the answer from the first run?

ACCount37 25 minutes ago [-]
No. There's no "answer" really.

They use self-distillation to shift the output distribution of the model towards that of the same model, but running with different temperature/truncation settings in sampling.

This effectively "folds" the logit tail truncation behavior into the model itself.

Not entirely unlike a few "model controlled sampling settings" things I've seen in what it does, but different in execution.

roger_ 55 minutes ago [-]
Skimmed this but don't have an intuitive understanding of why this works and how temperature and truncation factor in.
drooby 28 minutes ago [-]
Fascinating...

This feels eerily similar to sleep consolidation or synaptic pruning

1 hours ago [-]
smallerize 40 minutes ago [-]
I don't suppose they published the improved models?
dist-epoch 1 hours ago [-]
[flagged]
avaer 48 minutes ago [-]
I definitely pay more attention to papers affiliated with Chinese companies; the economics seem to be more conducive to doing good academic work and publishing it. I would say the same for companies like Apple (where TFA came from).

But to filter based on author's names sounds pretty darn racist.

ptidhomme 53 minutes ago [-]
I used to have the opposite rule in my signal processing field : the more Chinese names, the less innovation was there.

They seemed like they had to be churning out papers and any little adaptation to existing research triggered a new publication.

But it may have changed now.

0x3f 1 hours ago [-]
That's... almost every AI paper.
57 minutes ago [-]
amelius 55 minutes ago [-]
So

"Made in China, designed by Apple in California"

should be:

"Made in China, designed by Chinese people in California"?

ape4 55 minutes ago [-]
Shouldn't a scientific paper be using metric units (like 30T) rather than 30B
jofzar 1 hours ago [-]
> simple self-distillation (SSD):

Sorry apple, SSD is already taken, you can't use that acronym.

love2read 1 hours ago [-]
You're right, I offer these alternatives:

Consistency Preservation Update (CPU)

Guided Probability Update (GPU)

History-aware Distillation Driving (HDD)

Probability Smoothing Update (PSU)

drittich 24 minutes ago [-]
I used to invent TLAs on the spot for fun, and when someone asked what it was, would respond, "It's a PUA", eventually revealing that meant "previously unknown acronym". It was even more annoying that it sounds.
ape4 1 hours ago [-]
ATT=All TLAs are Taken
politelemon 58 minutes ago [-]
It's cringe worthy to see that the original paper itself is editorialised.

Title should be: Simple Self-Distillation Improves Code Generation

StevenWaterman 50 minutes ago [-]
"Embarrassingly" has a history as a technically meaningful word roughly equivalent to "maximally", see "Embarrassingly parallel"

https://en.wikipedia.org/wiki/Embarrassingly_parallel

Aurornis 46 minutes ago [-]
The phrase embarrassingly parallel has a history in computer science.

Many computer science paper titles allude to past titles in other CS papers.

Calling it “cringe worthy” is unnecessarily mean. There is context and history you don’t understand.

gottheUIblues 28 minutes ago [-]
"Embarrassingly" considered harmful?
cbm-vic-20 1 minutes ago [-]
"Embarrassingly" considered harmful is all you need.
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 12:52:56 GMT+0000 (Coordinated Universal Time) with Vercel.