Transformer
A transformer is a neural network architecture that uses self-attention mechanisms to process sequential data, forming the basis of all modern large language models.
The transformer architecture, introduced in the 2017 paper "Attention Is All You Need" by Vaswani et al., revolutionized natural language processing by replacing recurrent neural networks with a mechanism called self-attention. This allows the model to look at all parts of an input simultaneously rather than processing it sequentially, making training far more parallelizable and enabling models to capture long-range dependencies in text.
A transformer consists of an encoder that processes input text and a decoder that generates output text, though many modern LLMs use decoder-only architectures. The key innovation is the attention mechanism, which lets the model weigh the importance of different words in relation to each other. For example, in the sentence "The cat sat on the mat because it was tired," attention helps the model understand that "it" refers to "the cat."
Transformers have become the dominant architecture not just for language but also for computer vision (Vision Transformers), audio processing, and even protein structure prediction. Their ability to scale efficiently with more data and compute is what enabled the current generation of powerful AI systems.
Real-World Examples
- •BERT, an encoder-only transformer used for text classification and search
- •GPT series, decoder-only transformers used for text generation
- •T5, an encoder-decoder transformer used for translation and summarization
- •Vision Transformer (ViT), applying the architecture to image recognition