What Is This?
In June 2017, eight researchers at Google Brain submitted a paper to arXiv called "Attention Is All You Need." It proposed a new neural network architecture that dispensed entirely with the dominant approach to processing language — recurrent networks that read sequences word by word — and replaced it with a mechanism called self-attention that could process all tokens in a sequence simultaneously.
The paper has since accumulated over 75,000 citations. GPT-4, Claude, Gemini, Llama, Mistral, and every other large language model in production is built on this architecture. Every AI coding tool you use — Cursor, GitHub Copilot, Claude Code, Bolt, v0 — runs on a transformer. It is the most consequential machine learning paper of the century so far, and almost nobody who uses these tools daily understands what it actually did.
Understanding transformers doesn't require a mathematics degree. The core insight is conceptually simple. But it changes how you think about every AI tool you use — why they excel at some things, fail at others, and why your prompts work the way they do.
The problem transformers solved:
Before 2017, the dominant approach to language modelling was Recurrent Neural Networks (RNNs) and their variants (LSTMs, GRUs). These models processed sequences word by word, left to right. Each step fed information from the previous step forward. The problem: information from early in the sequence got progressively diluted as the model moved through later tokens. By the time an RNN reached the end of a long paragraph, it had largely forgotten the beginning. And because each step depended on the previous one, RNNs couldn't be parallelised — training was slow.
Transformers eliminated both problems.
How self-attention actually works:
Imagine you're reading the sentence: "The animal didn't cross the street because it was too tired." What does "it" refer to? The animal, not the street. Your brain figured that out by relating "it" to other words in the sentence — attending to "animal" more than "street" when interpreting the pronoun.
Self-attention is the machine equivalent of this. For every word (token) in the input, the model computes three things:
- Query (Q): What am I looking for? What context do I need to understand my role in this sentence?
- Key (K): What do I have to offer? What information can I contribute to understanding other tokens?
- Value (V): If I'm relevant, what actual information should I contribute?
For each token, its Query is compared (via dot product) against the Keys of every other token in the sequence. This produces an attention score — how much should this token attend to every other token? Those scores are normalised with a softmax function and used to weight the Value vectors. The result is a new representation for each token that incorporates information from every other token, weighted by relevance.
This happens in parallel for every token at once — no word-by-word processing. The whole sequence is processed simultaneously. This is why transformers can be trained on the massive datasets that make LLMs possible.
Multi-head attention: Rather than running one attention operation, transformers run several in parallel — typically 12, 16, or 96 "heads" depending on model size. Each head can specialise: one might learn syntactic relationships, another semantic similarity, another positional patterns. The outputs are concatenated and projected into a final representation.
The full architecture: Transformer blocks stack these components: self-attention → normalisation → feed-forward network → normalisation. The feed-forward layer (a small neural network applied independently to each token after attention) is where much of the model's factual knowledge is stored. Encoders (BERT) read the whole sequence and build rich contextual representations. Decoders (GPT) generate one token at a time, attending only to tokens that came before (masked attention). Encoder-decoders (T5) combine both.
Why Does It Matter?
- Context windows are not free. Attention operates over every pair of tokens in the context window — it's O(n²) in computational complexity. Double the context length → four times the compute. This is why 200K-token context windows are impressive engineering achievements, not free features. It also explains why models get slower and more expensive as prompts get longer. When you paste a 50-page document into Claude, you're paying for attention across every token pair.
- Temperature controls randomness, not creativity. The softmax function at the end of the attention mechanism can be made more or less "peaked" by a scaling parameter — temperature. High temperature flattens the distribution: the model samples more randomly from less likely tokens. Low temperature sharpens it: the model picks the most probable token. Low temperature = deterministic, precise, conservative outputs. High temperature = diverse, surprising, occasionally unhinged outputs. This is why low temp works for code (you want the right answer) and higher temp works for brainstorming (you want variety). Most APIs default to around 1.0.
- Prompt structure matters because attention is computed over your whole prompt. There's a documented primacy and recency bias in transformer attention: tokens at the very start and very end of a prompt receive more attention weight than tokens buried in the middle. This is called "lost in the middle." Instructions placed at the very beginning and repeated at the end get followed better than instructions placed only in the middle of a long prompt. This isn't a preference — it's a mathematical property of how attention distributes.
- Hallucination is a fundamental feature, not a bug to be patched. Transformers generate text by predicting the next most probable token given everything before it. They are not retrieving facts from a database — they are following statistical patterns learned from training data. If the training data contained confident-sounding sentences about X being Y, the model will produce confident-sounding sentences about X being Y regardless of whether Y is true. Hallucination is not a failure to look up the right answer; it's the system working as designed, just without a ground truth anchor. Retrieval-Augmented Generation (RAG) and tool use partially address this by giving the model access to verified information at query time.
- Scaling is why the race is about compute, not algorithms. Transformers scale remarkably predictably — double the parameters, double the training data, use more compute, and performance improves on a roughly log-linear curve. This is the "scaling hypothesis" that drove GPT-3, GPT-4, and the current generation of frontier models. The architecture itself hasn't changed dramatically since 2017. The race has been won on scale, not innovation. This is why Nvidia matters so much.
Key People & Players
Ashish Vaswani, Noam Shazeer, et al. (Google Brain/Google Research, 2017) — The eight authors of "Attention Is All You Need." Most have since left Google to found or join AI startups. The paper was not initially considered landmark — it solved a specific translation problem, and its broader implications took the field several years to appreciate fully.^1
Andrej Karpathy — Former OpenAI/Tesla, now independent. His "Let's build GPT from scratch" video (2023) is the single best technical explanation of how transformers work for people who want to understand the mechanics without a formal ML background.^2
Ilya Sutskever — OpenAI co-founder and Chief Scientist through GPT-4. More than anyone, he championed the scaling hypothesis: that transformer architecture + more compute + more data = better models, essentially indefinitely. His conviction drove OpenAI's trajectory. Now running Safe Superintelligence.^3
George Hotz / Andrej Karpathy / Sebastian Raschka — Three educators who have produced the most accessible technical writing and code walkthroughs of transformer internals for practitioners who want to actually understand the engine.
The Current State
The transformer architecture has not fundamentally changed since 2017. What has changed is scale (100M parameters → 1T+), training data (books → the internet + code + multimodal data), inference improvements (KV caching, speculative decoding, quantisation), and extensions (long context, multimodality, tool use, function calling).
The current frontier: mixture of experts (MoE) architectures — instead of activating the entire model for every token, route each token through a specialist subset of parameters. GPT-4, Mixtral, and Gemini 1.5 are believed to use MoE. More efficient scaling than dense transformers.
The next transition: Test-time compute scaling (reasoning models — OpenAI o1, DeepSeek R1, Claude 3.7 Sonnet). Rather than making the model bigger, let it "think longer" at inference — generate chains of thought, check its own work, explore multiple paths before committing to an answer. This changes the economics: you trade inference compute for a smaller model with better reasoning, rather than training compute for a larger model with more memorised patterns.
The architecture of the tools you use is transformer-all-the-way-down. The context window is the field of view. Attention is what you're pointing it at. Your prompt is the query. Understanding this doesn't make you a researcher — but it makes you a meaningfully better practitioner.
Best Resources to Learn More
- Andrej Karpathy: "Let's build GPT from scratch" — 2 hours. Builds a working GPT from first principles in Python. The best technical explainer in existence.^4
- Sebastian Raschka: "Self-Attention from Scratch" — Code-first walkthrough of the attention mechanism. Excellent if you want the maths made legible.^5
- The Illustrated Transformer by Jay Alammar — The standard visual explainer. Best for understanding the architecture before touching code.^6
- "Attention Is All You Need" — original paper (arXiv) — Surprisingly readable for a technical paper. The introduction and conclusion are accessible to any curious non-specialist.^7
- The Coming Wave by Mustafa Suleyman — Less technical, more strategic: what the transformer revolution means at civilisational scale.^8