One thing extremely worth noting that the article does not:
The reason "temperature" is called such is because softmax is mathematically identical to the Boltzmann distribution [1] from thermodynamics, which describes the energy states of an ensemble of particles in equilibrium. In terminology more well understood by ML folks, the particles' energies will be distributed as the softmax of their negative energies. divided by their temperatures (in Kelvin). Units are scaled by the Boltzmann constant (k_B).
Setting an LLM's temperature to zero is mathematically the same thing as cooling an ensemble of particles to absolute zero: in physics, the particles are all forced to their lowest energy state, in LLMs, the model is forced to deterministically predict the single most likely logit/token.
Corollary: the reason a heating element glows red when it is hot is because if you take the expectation value (mean) of this softmax distribution, it grows with temperature, and when the energy gets high enough you get visible photons in the visible spectrum that have a mean value in the red wavelengths. Incandescent bulbs are white because temperature T is even higher, the softmax distribution's mean moves higher and flattens out, and it roughly covers the whole visible spectrum somewhat more uniformly. Likewise, if you set an LLM's temperature to an absurdly high number, it produces a very wide spectrum of mostly nonsense tokens (i.e. "white light").
On a tangential note, I keep noticing
"why x matters", "it's crucial here" that just remind me of Claude. Recently Claude has been gaslighting me in complex problems with such statements and seeing them on an article is low-key infuriating at this point. I can't trust Claude anymore on the most complex problems where it sometimes gets the answer right but completely misses the point and introduces huge complex blocks of code and logic with precisely "why it matters", "this is crucial here".
Rendered at 09:58:33 GMT+0000 (Coordinated Universal Time) with Vercel.
The reason "temperature" is called such is because softmax is mathematically identical to the Boltzmann distribution [1] from thermodynamics, which describes the energy states of an ensemble of particles in equilibrium. In terminology more well understood by ML folks, the particles' energies will be distributed as the softmax of their negative energies. divided by their temperatures (in Kelvin). Units are scaled by the Boltzmann constant (k_B).
Setting an LLM's temperature to zero is mathematically the same thing as cooling an ensemble of particles to absolute zero: in physics, the particles are all forced to their lowest energy state, in LLMs, the model is forced to deterministically predict the single most likely logit/token.
Corollary: the reason a heating element glows red when it is hot is because if you take the expectation value (mean) of this softmax distribution, it grows with temperature, and when the energy gets high enough you get visible photons in the visible spectrum that have a mean value in the red wavelengths. Incandescent bulbs are white because temperature T is even higher, the softmax distribution's mean moves higher and flattens out, and it roughly covers the whole visible spectrum somewhat more uniformly. Likewise, if you set an LLM's temperature to an absurdly high number, it produces a very wide spectrum of mostly nonsense tokens (i.e. "white light").
[1] https://en.wikipedia.org/wiki/Boltzmann_distribution
On a tangential note, I keep noticing "why x matters", "it's crucial here" that just remind me of Claude. Recently Claude has been gaslighting me in complex problems with such statements and seeing them on an article is low-key infuriating at this point. I can't trust Claude anymore on the most complex problems where it sometimes gets the answer right but completely misses the point and introduces huge complex blocks of code and logic with precisely "why it matters", "this is crucial here".