Attention Mechanism (Bahdanau 2014)
eββ = score(sβ, hβ) [alignment score]
Ξ±ββ = softmax(eββ) [attention weights]
cβ = β Ξ±ββ Β· hβ [context vector at step t]
score variants:
dot: hβα΅hβ
general: hβα΅Wβhβ
concat: vβα΅ tanh(Wβ[hβ;hβ])
- Solves the encoder bottleneck by allowing decoder to look at ALL encoder states.
- Self-attention: Q, K, V all come from same sequence.
Transformer Self-Attention
Attention(Q,K,V) = softmax( QΒ·Kα΅ / βdβ ) Β· V
Q = XWQ, K = XWK, V = XWV
Multi-Head: concat h heads, project: MH(Q,K,V) = Concat(headβ,...,headβ)Β·Wα΄Ό
- βdβ scaling prevents vanishing gradients in softmax.
- Each layer: Self-Attention + Feed-Forward + Residual + LayerNorm.
- Positional encodings: sinusoidal or learned (no inherent order awareness).
Transformer Attention Types
| Type | Where | Q / K / V source |
| Self-attention (encoder) | Encoder | Previous encoder sub-layer |
| Masked self-attention | Decoder | Previous decoder layer (masked = no future) |
| Cross-attention | Decoder | Q from decoder, K&V from encoder output |
BERT (Devlin et al. 2018)
Architecture: Bidirectional Transformer encoder
Pre-training tasks:
1. MLM (Masked LM): predict 15% masked tokens
2. NSP (Next Sentence Prediction): is B next after A?
Input: [CLS] sent_A [SEP] sent_B [SEP]
BERT-base: 12 layers, 768 hidden, 12 heads, 110M params
BERT-large: 24 layers, 1024 hidden, 16 heads, 340M params
- Fine-tune for downstream tasks by adding a classifier on [CLS] token.
- Bidirectional = sees full context (unlike GPT which is causal/left-to-right).
GPT (Decoder-only Transformer)
- 12-layer decoder-only transformer, 768 dims, 12 attention heads.
- Masked self-attention (cannot see future tokens).
- Trained on language modeling (predict next token).
BART (Denoising Seq2Seq)
- Full encoder-decoder transformer.
- Pre-training: corrupt text (mask, shuffle, delete) β reconstruct original.
- Best for: summarization, translation, generation.
Evaluation Metrics
BLEU: precision of n-gram overlap (reference vs. hypothesis)
BLEU = BP Β· exp(β wβ log pβ) [penalizes short outputs]
ROUGE: recall-oriented (for summarization)
ROUGE-N = n-gram recall ROUGE-L = longest common subsequence
BERTScore: semantic similarity using BERT embeddings
Perplexity: PP(W) = P(wβ...wβ)^(-1/n) [lower = better LM]