Transformer
Also known as: transformer architecture, transformer model, attention mechanism
What is the Transformer Architecture?
The transformer is a neural network architecture introduced in the 2017 paper “Attention Is All You Need” by researchers at Google. It replaced recurrent and convolutional approaches as the dominant architecture for language processing and has since become the foundation of virtually every large language model, including GPT, Claude, Gemini, and Llama. Its core innovation is the self-attention mechanism, which allows the model to weigh the relevance of every token in a sequence against every other token simultaneously, rather than processing inputs one step at a time.
How It Works
A transformer consists of an encoder and a decoder, though many modern LLMs use only the decoder half. The self-attention mechanism computes attention scores between all pairs of tokens in parallel, enabling the model to capture long-range dependencies in text far more effectively than earlier architectures. Multi-head attention runs several attention computations simultaneously, letting the model attend to different types of relationships (syntactic, semantic, positional) at once. This parallelism also makes transformers highly efficient on GPU and TPU hardware, which is why they scale so well with increased compute.
Why Transformers Matter
The transformer unlocked the scaling era of AI. Because attention is computed in parallel rather than sequentially, transformers can be trained on massive datasets across thousands of accelerators. This property, combined with the scaling laws discovered by researchers at OpenAI and DeepMind, demonstrated that simply making transformers bigger and feeding them more data produced predictable improvements in capability. Every major frontier model today is a transformer variant, and the architecture’s dominance shows no signs of fading. Understanding transformers is essential for anyone working with or building on top of modern AI systems.
Related Reading
- Deep Learning - The broader field transformers belong to
- Pre-training - How transformers learn from data
- Scaling Laws - Why bigger transformers work better