Honest feedback - I was really excited when I read the opening. However, I did not come away from this without a greater understanding than I already had.
For reference, my initial understanding was somewhat low: basically I know a) what embedding is basically b) transformers work by matrix multiplication, and c) it's something like a multi-threaded Markov chain generator with the benefit of prior-trained embeddings
I had a similar feeling, I think a little magic was lost by the author trying to be as concise as possible, which is no real fault of their own as it can go down the rabbit hole very quickly.
Instead I believe this might work better as a guided exercise where a person can work on it over a few hours rather than being spoon-fed it over the 10 minute reading time. Or breaking up the steps into "interactive" sections that more clearly demarcate the stages.
Regardless I'm very supportive of people making efforts to simplify this topic, each attempt always gives me something that I either forgot or neglect.
onename 21 hours ago [-]
Have you checked out this video from 3Blue1Brown that talks bit about transformers?
I've seen it but I don't believe I've watched it all the way through. I will now
imtringued 7 hours ago [-]
I personally would rather recommend people to just look at these architectural diagrams [0] and try to understand them. There is the caveat that they do not show how attention works. For that you need to understand softmax(QK^T)V and multi head attention being a repetition of this multiple times. GQA, MHA, etc just messes around with reusing Q or K or V in clever ways.
It might be meant for the folks who are not well versed in transformers.
hunter2_ 13 hours ago [-]
Similarly, I was really excited when I read the headline here on HN and thought this would be about the electrical device. I wonder if the LLM meaning has eclipsed the electrical meaning at this point, as a default in the absence of other qualifiers, in communities like this.
zxexz 12 hours ago [-]
It does seem to. I’ve been working on some personal projects where I’ve needed to look up and research transformers quite a bit (the kind that often has a ferrite core) and it has been frustrating. Frustrating not just trying to search for the wire datasheets, etc., but also because I often have to use the other transformer via service to find what I’m looking for because search is so enshittified by the newer definition.
21 hours ago [-]
nikki93 16 hours ago [-]
Pasting a comment I posted elsewhere:
Resources I’ve liked:
Sebastian Raschka book on building them from scratch
here’s a basic impl that i trained on tinystories to decent effect: https://gist.github.com/nikki93/f7eae83095f30374d7a3006fd5af... (i used claude code a lot to help with the above bc a new field for me) (i did this with C and mlx before but ultimately gave into the python lol)
but overall it boils down to:
- tokenize the text
- embed tokens (map each to a vector) with a simple NN
- apply positional info so each token also encodes where it is
- do the attention. this bit is key and also very interesting to me. there are three neural networks: Q, K, V – that are applied to each token. you then generate a new sequence of embeddings where each position has the Vs of all tokens added up weighted by the Q of that position dot’d with the K of the other position. the new embeddings are /added/ to the previous layer (adding like this is called ‘residual’)
- also do another NN pass without attention, again adding the output (residual)
there’s actually multiple ‘heads’ each with a different Q, K, V – their outputs are added together before that second NN pass
there’s some normalization at each stage to keep the numbers reasonable and from blowing up
you repeat the attention + forward blocks many times, then the last embedding in the final layer output is what you can sample based on
i was surprised by how quickly this just starts to generate coherent grammar etc. having the training loop also do a generation step to show example output at each stage of training was helpful to see how the output qualitatively improves over time, and it’s kind of cool to “watch” it learn.
this doesn’t cover MoE, sparse vs dense attention and also the whole thing about RL on top of these (whether for human feedback or for doing “search with backtracking and sparse reward”) – i haven’t coded those up yet just kinda read about them…
now the thing is – this is a setup for it to learn some processes spread among the weights that do what it does – but what those processes are seems still very unknown. “mechanistic interpretability” is the space that’s meant to work on that, been looking into it lately.
meindnoch 8 hours ago [-]
I'd be surprised if anyone understood transformers from this.
runamuck 6 hours ago [-]
I love how you represent each token in the form of five stacked boxes, with height, weight etc. depicting different values. Where did you get this amazing idea? I will "steal" it for plotting high dimensionality data.
neuroelectron 10 hours ago [-]
For me, I feel like this could use a little bit more explanation. It's brief and the grammar or cadence is very clunky.
busymom0 22 hours ago [-]
I'd also recommend another article on this topic of LLMs discussed a few days ago. I read it to the finish line and understood everything fully:
For reference, my initial understanding was somewhat low: basically I know a) what embedding is basically b) transformers work by matrix multiplication, and c) it's something like a multi-threaded Markov chain generator with the benefit of prior-trained embeddings
Instead I believe this might work better as a guided exercise where a person can work on it over a few hours rather than being spoon-fed it over the 10 minute reading time. Or breaking up the steps into "interactive" sections that more clearly demarcate the stages.
Regardless I'm very supportive of people making efforts to simplify this topic, each attempt always gives me something that I either forgot or neglect.
https://youtu.be/wjZofJX0v4M
[0] https://huggingface.co/blog/vtabbott/mixtral
Resources I’ve liked:
Sebastian Raschka book on building them from scratch
Deep Learning a Visual Approach
These videos / playlists:
https://youtube.com/playlist?list=PLoROMvodv4rOY23Y0BoGoBGgQ... https://youtube.com/playlist?list=PLoROMvodv4rOwvldxftJTmoR3... https://youtube.com/playlist?list=PL7m7hLIqA0hoIUPhC26ASCVs_... https://www.youtube.com/live/uIsej_SIIQU?si=RHBetDNa7JXKjziD
here’s a basic impl that i trained on tinystories to decent effect: https://gist.github.com/nikki93/f7eae83095f30374d7a3006fd5af... (i used claude code a lot to help with the above bc a new field for me) (i did this with C and mlx before but ultimately gave into the python lol)
but overall it boils down to:
- tokenize the text
- embed tokens (map each to a vector) with a simple NN
- apply positional info so each token also encodes where it is
- do the attention. this bit is key and also very interesting to me. there are three neural networks: Q, K, V – that are applied to each token. you then generate a new sequence of embeddings where each position has the Vs of all tokens added up weighted by the Q of that position dot’d with the K of the other position. the new embeddings are /added/ to the previous layer (adding like this is called ‘residual’)
- also do another NN pass without attention, again adding the output (residual) there’s actually multiple ‘heads’ each with a different Q, K, V – their outputs are added together before that second NN pass
there’s some normalization at each stage to keep the numbers reasonable and from blowing up
you repeat the attention + forward blocks many times, then the last embedding in the final layer output is what you can sample based on
i was surprised by how quickly this just starts to generate coherent grammar etc. having the training loop also do a generation step to show example output at each stage of training was helpful to see how the output qualitatively improves over time, and it’s kind of cool to “watch” it learn.
this doesn’t cover MoE, sparse vs dense attention and also the whole thing about RL on top of these (whether for human feedback or for doing “search with backtracking and sparse reward”) – i haven’t coded those up yet just kinda read about them…
now the thing is – this is a setup for it to learn some processes spread among the weights that do what it does – but what those processes are seems still very unknown. “mechanistic interpretability” is the space that’s meant to work on that, been looking into it lately.
> How can AI ID a cat?
https://news.ycombinator.com/item?id=44964800
Also, the Illustrated Transformer: https://jalammar.github.io/illustrated-transformer/
Also, this HN comment has numerous resources: https://news.ycombinator.com/item?id=35712334