Computer ScienceAI & Machine LearningAdvanced

Transformer (AI)

Also known as:Attention ModelSelf-Attention NetworkSeq2Seq Transformer

The Transformer is a deep learning architecture introduced by Vaswani et al. in 2017 that relies entirely on self-attention mechanisms rather than recurrence or convolutions to model relationships between all positions in a sequence in parallel. It consists of an encoder–decoder structure with multi-head attention, positional encodings, and feed-forward layers. Transformers are the foundation of modern large language models including BERT, GPT, T5, and PaLM, and have also been applied to vision, audio, and multimodal tasks.

Key Formula

Attention(Q, K, V) = softmax(Q × Kᵀ / sqrt(d_k)) × V

LaTeX: \text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V

Symbol	Meaning	Unit
Q	Query matrix	embedding dimensions
K	Key matrix	embedding dimensions
V	Value matrix	embedding dimensions
d_k	Dimension of key vectors (scaling factor)	count
\text{softmax}(\cdot)	Normalises scores to a probability distribution	dimensionless

Transformer Components and Their Functions

Component	Function	Key Property	Location
Self-Attention	Relate all token pairs	O(n²) but parallelisable	Encoder & Decoder
Multi-Head Attention	Run h attention heads in parallel	Captures diverse relationships	Encoder & Decoder
Positional Encoding	Inject sequence order via sine/cosine	Order-aware without RNN	Input embedding
Feed-Forward Layer	Position-wise MLP (2 layers)	Non-linear transformation	Each block
Layer Normalisation	Normalise residual stream	Training stability	After each sublayer
Cross-Attention	Decoder attends to encoder output	Seq-to-seq conditioning	Decoder only

Interactive Tools

The Illustrated Transformer (Jay Alammar)

Open Tool

Hugging Face Transformers Library

Open Tool

Brilliant.org Attention Mechanisms

Open Tool

Original Transformer model architecture diagram from "Attention Is All You Need" (Vaswani et al., 2017)

Wikimedia Commons, CC BY-SA

Related Terms

Computer Science

Natural Language Processing

Natural language processing (NLP) is a field of artificial intelligence focused on enabling computers to understand, interpret, and generate human language in a useful way. It combines computational linguistics with machine learning and deep learning to process text and speech data. Core tasks include tokenisation, named entity recognition, sentiment analysis, machine translation, and question answering.

Computer Science

Convolutional Neural Network

A convolutional neural network (CNN) is a deep learning architecture designed for processing structured grid data such as images, using learnable convolutional filters that detect spatial features like edges, textures, and shapes. The network stacks convolutional layers (feature extraction) with pooling layers (spatial downsampling) and fully connected layers (classification). CNNs revolutionised computer vision after AlexNet won the ImageNet competition in 2012 with significantly lower error rates than prior methods.

Computer Science

Transfer Learning

Transfer learning is a machine learning technique where a model trained on one large task is adapted (fine-tuned) for a different but related task, leveraging previously learned representations instead of training from scratch. It dramatically reduces the data and computation required for new tasks by reusing features such as edges in vision models or syntactic patterns in language models. Transfer learning is foundational to modern AI, enabling pre-trained models like ResNet, BERT, and GPT to be fine-tuned for specialised applications with small datasets.

The name "Transformer" was chosen by Vaswani et al. (Google Brain, 2017) in the landmark paper "Attention Is All You Need." It alludes to transforming representations via attention rather than recurrent processing. The word derives from Latin transformare (to change shape).

transformerattention-mechanismdeep-learningnlplarge-language-models