NLP & IR — Full Course Cheatsheet

📊 L1: TF-IDF & Text Classification Lecture 1

TF-IDF Fundamentals

TF(t,d) = count(t in d) / total_terms(d) IDF(t) = log( N / df(t) ) [N = corpus size] TF-IDF(t,d) = TF(t,d) × IDF(t)

BoW: each doc = vector of term counts. Ignores order & co-occurrence.
IDF: high for rare terms, low for common ones (e.g., "the").
TF-IDF: balances local term importance vs global rarity.

BM25 (Okapi BM25)

BM25(t,d,q) = IDF(t) × [ tf(t,d)·(k1+1) / (tf(t,d) + k1·(1 - b + b·|d|/avgdl)) ] Default: k1=1.2, b=0.75 k1 → term freq saturation b → length normalization

Text Classification Pipeline

Textual Data → Feature Extraction → Term Weighting → Model Learning → Evaluation

Naive Bayes Classifier

ĉ = argmax P(c) · ∏ P(xᵢ|c) [MAP estimate] P(w|c) = (count(w,c)+1) / (count(c)+|V|) [Laplace smoothing]

Assumes: features (words) are independent given the class.
Strengths: fast, good baseline, robust to irrelevant features, handles missing values.
Won KDD-Cup 97 (1st & 2nd place among 16 algorithms).

Feature Selection Methods

Method	Best for	Notes
TF-IDF	General weighting	Simple baseline
Chi-Square (χ²)	Text classification	Needs large dataset
Mutual Information	Small datasets	Better than χ² on small data
L1-Regularization	Sparse data	Auto eliminates irrelevant
PCA / LSA	Dense embeddings	Reduces dimensionality

Information Gain (Feature Selection)

I(w) = -∑Pᵢlog(Pᵢ) + F(w)∑pᵢ(w)log(pᵢ(w)) + (1-F(w))∑(1-pᵢ(w))log(1-pᵢ(w)) where: pᵢ(w)=P(class i | word w), Pᵢ=P(class i), F(w)=fraction of docs with w

SVM Key Points

Finds maximum margin hyperplane between classes.
Uses kernel trick for non-linear boundaries.
Linear SVM is the gold standard for text classification.

Evaluation Metrics

Precision = TP/(TP+FP) Recall = TP/(TP+FN) F1 = 2·P·R/(P+R) Micro-F1: aggregate globally Macro-F1: avg per class (equal weight)

📉 L2: Dim Red, PLSI, SVD & Embeddings Lecture 2

pLSI (Probabilistic LSI)

P(d,w) = ∑ₜ P(t) · P(d|t) · P(w|t) t = latent topic, d = document, w = word Learned via EM algorithm

Soft clustering: each word/doc can belong to multiple topics.
Problem: number of parameters grows with corpus (no proper generative model).

SVD / LSA (Latent Semantic Analysis)

A = U · Σ · Vᵀ A: term-document matrix (m×n) U: term-topic matrix Σ: singular values (importance) V: doc-topic matrix Truncated SVD (k topics): Aₖ = Uₖ Σₖ Vₖᵀ → dimensionality reduction

Captures synonymy: similar words cluster in latent space.
Problem: expensive O(n²), matrix must be recomputed if vocab changes, sensitive to word frequency imbalance.

Word2Vec (Mikolov et al. 2013)

CBOW: context words → predict target word Skip-gram: target word → predict context words No hidden layer (1000× speedup vs earlier models) Context window: both history & future

Captures semantic analogy: king - man + woman ≈ queen.
Skip-gram > CBOW on all tasks (Levy et al. 2015).

GloVe

Minimizes: ∑ f(Xᵢⱼ)(wᵢᵀw̃ⱼ + bᵢ + b̃ⱼ - log Xᵢⱼ)² Xᵢⱼ = co-occurrence count of words i and j

Global co-occurrence statistics + local context window.

FastText

Extends Word2Vec with character n-grams (subword units).
Handles out-of-vocabulary (OOV) words — important for morphologically rich languages (Greek!).

SVD vs Word2Vec vs GloVe

Model	Best at
SVD	Similarity tasks
Word2Vec	Analogy tasks
GloVe	Somewhere between

⚠️ No single algorithm consistently wins; hyperparameter tuning often matters more than algorithm choice.

Document Distance: WMD

dᵢ = cᵢ / ∑cⱼ (normalized word freq) c(i,j) = ‖xᵢ - xⱼ‖₂ (embedding distance) WMD = min ∑ Tᵢⱼ·c(i,j) s.t. ∑ⱼTᵢⱼ = dᵢ, ∑ᵢTᵢⱼ = d'ⱼ (Transportation / Linear programming problem, O(m³ log m))

🕸 L3: Graph-of-Words, IR & Classification Lecture 3

Graph-of-Words (GoW)

Gd = (Vd, Ed) Nodes = unique terms in document Edges = co-occurrence within sliding window w Edge weight = number of co-occurrences Replace TF with node in-degree (centrality)

Challenges BoW: captures word order, distance, dependence.
Directed edges preserve text flow; undirected capture co-occurrence.
Larger window w → denser graph.

TW-IDF (Term Weight IDF)

TW-IDF(t,d) = tw(t,d) / (1-b + b·|d|/avdl) × log((N+1)/df(t)) tw = in-degree of node t in GoW (vs tf in TF-IDF) Default: b=0.75, k1=1.2

Centrality Measures

Measure	Formula / Idea
Degree	Cᵈ(i) = k(i) (# of connections)
Closeness	Cc(i) = (n-1)/∑d(i,j) — inverse mean distance
Betweenness	Cb(i) = ∑ σ(s,t\|i)/σ(s,t) — fraction of shortest paths through i
PageRank/Eigenvector	Dominant eigenvector of adjacency matrix

k-Core Decomposition

k-core(G): largest subgraph where every node has ≥ k neighbors within it. Algorithm: while ∃ node x with deg(x) < k → remove x. O(m) time. Main core = highest k-core (most cohesive subgraph)

Keyword extraction: keep main core → top cohesive terms.
Better than PageRank for capturing "spreading influence".

CoreRank

CRank(v) = ∑_{u∈N(v)} core(u) Sum of core numbers of a node's neighbors. Finer granularity than k-core; comparable to TextRank but with cohesiveness.

K-Truss Decomposition

Each edge in K-truss participates in ≥ K-2 triangles.
Higher coherence than k-core; triangle-based.

Graph Classification for Text

gSpan: frequent subgraph mining (long-distance n-grams as features).
MC + gSpan + SVM: mine frequent subgraphs in main cores → higher efficiency.
Subgraph of size n ≈ long-distance n-gram (word inversion + subset matching).

Sub-Event Detection (Twitter)

Graph of tweets: nodes=terms, edges=co-occurrence within tweet k-core score per term → sum of core numbers = event signal If Σ core numbers at time t > threshold → sub-event detected

Weight-Core approach: best F1 (micro: 0.68, macro: 0.72) on FIFA 2014 dataset.

Graph Kernels for Document Similarity

Shortest-path kernel: compare all-pairs shortest paths between two graph-of-words.
Polynomial-time alternatives to graph isomorphism (NP-hard) or edit distance.

🧠 L4: Deep Learning for NLP Lecture 4

CNN for NLP

Input: sentence matrix (n × d) where d = embedding dim Convolution: filter W ∈ ℝʰˣᵈ over window of h words Feature map: cᵢ = f(W · xᵢ:ᵢ₊ₕ₋₁ + b) Max-over-time pooling: ĉ = max{c} (captures most important feature)

CNN-non-static achieves best on most benchmarks (Kim et al. 2014).
Good for local feature detection (n-gram patterns).

RNN (Recurrent Neural Networks)

hₜ = f(Wₓₓxₜ + Wₕₕhₜ₋₁ + b) yₜ = Wₕᵧhₜ

RNN type	Use case
One-to-one	Image classification
One-to-many	Image captioning
Many-to-one	Sentiment analysis
Many-to-many (async)	Machine translation (seq2seq)
Many-to-many (sync)	Video frame labeling

Problem: vanishing gradients over long sequences.

LSTM (Long Short-Term Memory)

fₜ = σ(Wf·[hₜ₋₁, xₜ] + bf) ← forget gate iₜ = σ(Wᵢ·[hₜ₋₁, xₜ] + bᵢ) ← input gate C̃ₜ = tanh(Wc·[hₜ₋₁, xₜ] + bc) ← candidate cell Cₜ = fₜ⊙Cₜ₋₁ + iₜ⊙C̃ₜ ← cell state oₜ = σ(Wo·[hₜ₋₁, xₜ] + bo) ← output gate hₜ = oₜ⊙tanh(Cₜ) ← hidden state

Cell state = long-term memory; hidden state = short-term.
GRU: simplified LSTM (2 gates: reset, update). Fewer params.

Encoder-Decoder (Seq2Seq)

Encoder (bidirectional LSTM/CNN) → fixed context vector → Decoder (unidirectional RNN).
Bottleneck: single context vector loses info for long sequences → fixed by Attention.

ELMo

Bidirectional LSTM trained on large corpus.
Contextualized embeddings: same word gets different vector in different contexts.
Forward LM + Backward LM → combine all layers.

🔍 L5: Attention, BERT, BART & Metrics Lecture 5

Attention Mechanism (Bahdanau 2014)

eₜₛ = score(sₜ, hₛ) [alignment score] αₜₛ = softmax(eₜₛ) [attention weights] cₜ = ∑ αₜₛ · hₛ [context vector at step t] score variants: dot: hₜᵀhₛ general: hₜᵀWₐhₛ concat: vₐᵀ tanh(Wₐ[hₜ;hₛ])

Solves the encoder bottleneck by allowing decoder to look at ALL encoder states.
Self-attention: Q, K, V all come from same sequence.

Transformer Self-Attention

Attention(Q,K,V) = softmax( Q·Kᵀ / √dₖ ) · V Q = XWQ, K = XWK, V = XWV Multi-Head: concat h heads, project: MH(Q,K,V) = Concat(head₁,...,headₕ)·Wᴼ

√dₖ scaling prevents vanishing gradients in softmax.
Each layer: Self-Attention + Feed-Forward + Residual + LayerNorm.
Positional encodings: sinusoidal or learned (no inherent order awareness).

Transformer Attention Types

Type	Where	Q / K / V source
Self-attention (encoder)	Encoder	Previous encoder sub-layer
Masked self-attention	Decoder	Previous decoder layer (masked = no future)
Cross-attention	Decoder	Q from decoder, K&V from encoder output

BERT (Devlin et al. 2018)

Architecture: Bidirectional Transformer encoder Pre-training tasks: 1. MLM (Masked LM): predict 15% masked tokens 2. NSP (Next Sentence Prediction): is B next after A? Input: [CLS] sent_A [SEP] sent_B [SEP] BERT-base: 12 layers, 768 hidden, 12 heads, 110M params BERT-large: 24 layers, 1024 hidden, 16 heads, 340M params

Fine-tune for downstream tasks by adding a classifier on [CLS] token.
Bidirectional = sees full context (unlike GPT which is causal/left-to-right).

GPT (Decoder-only Transformer)

12-layer decoder-only transformer, 768 dims, 12 attention heads.
Masked self-attention (cannot see future tokens).
Trained on language modeling (predict next token).

BART (Denoising Seq2Seq)

Full encoder-decoder transformer.
Pre-training: corrupt text (mask, shuffle, delete) → reconstruct original.
Best for: summarization, translation, generation.

Evaluation Metrics

BLEU: precision of n-gram overlap (reference vs. hypothesis) BLEU = BP · exp(∑ wₙ log pₙ) [penalizes short outputs] ROUGE: recall-oriented (for summarization) ROUGE-N = n-gram recall ROUGE-L = longest common subsequence BERTScore: semantic similarity using BERT embeddings Perplexity: PP(W) = P(w₁...wₙ)^(-1/n) [lower = better LM]

🤖 L6: Large Language Models Lecture 6

LLM Training Pipeline

Pretraining → Supervised Fine-Tuning → RLHF → (GRPO/DAPO for reasoning)

Stage	Goal	Issue
Pretraining (CLM)	Predict next token	Misaligned with user intent
Instruction tuning (SFT)	Follow instructions	Still not truly aligned
RLHF (PPO/DPO)	Alignment, safety, behavior	Aligned model can't reason
GRPO/DAPO/RLOO	Reasoning (math, code)	Resource-intensive

InstructGPT key finding: 1.3B model preferred over 175B GPT-3 (100× fewer params) after alignment.
DPO: Direct Preference Optimization — NOT RL, but treats alignment as supervised.

Positional Encodings

Sinusoidal: fixed periodic signals (original Transformer).
Learned: position vectors trained like token embeddings.
RoPE: Rotary PE — used in LLaMA/Mistral; rotates Q/K vectors by position.

Prompting Techniques

Technique	Description
Zero-shot	No examples; task described in instruction only
Few-shot	2-5 examples in the prompt before the question
Chain-of-Thought (CoT)	Model explains reasoning step-by-step
Self-consistency	Sample multiple CoT paths, take majority vote
ReAct	Reason + Act; interleave thinking with tool calls

LoRA (Low-Rank Adaptation)

W' = W + BA (W frozen, B and A are trainable) A ∈ ℝʳˣᵈ, B ∈ ℝᵈˣʳ, r << d Params: 2dr << d² (much fewer than full fine-tuning) Merge at inference: W_final = W + BA (no latency overhead)

Reduces hardware requirement by ~3×. Effective as full fine-tuning.
Based on intuition that weight changes have low "intrinsic rank".

Scaling Laws

L(x) = ax⁻ᵏ + ε [power law: loss vs compute/data/params] Chinchilla law: optimal training = 20 tokens per parameter (e.g., 70B model → 1.4T tokens)

Scaling D (data), C (compute), N (params) → predictable improvement.
World's total usable text: ~4.6T – 17.2T tokens (scaling wall incoming).

Factors Beyond Scale

Data quality & diversity often beats raw size.
Mixture-of-Experts (MoE): only subset of params active per token.
Context window size (128k, 1M tokens) can matter more than params.

🇬🇷 L7: Greek LLMs & RAG Lecture 7

RAG — Retrieval-Augmented Generation

Query → Retriever → Retrieved chunks → LLM → Answer + Citations Two-stage: Indexing (offline) + Retrieval/Generation (online)

RAG Pipeline

Phase	What happens
Indexing (offline)	Split docs → overlapping chunks → embed → store in vector DB
Retrieval (online)	Embed query → fetch top-K candidates → re-rank
Merge	Select top-N chunks within token budget; label as S1, S2…
Generation	LLM answers from context; refuse/flag if evidence weak

RAG Variants

Naive RAG: basic retrieve + concatenate + generate.
Advanced RAG: query rewriting, hybrid search, re-ranking.
FLARE (Forward-Looking Active RAG): generate tentatively, retrieve when uncertain.
Self-RAG: model decides when to retrieve and reflects on retrieved docs.

RAG Pros & Cons

Pros:

Reduces hallucinations
Citable sources
Cheaper than fine-tuning
Up-to-date knowledge
Scalable (millions of docs)

Cons:

Complex chunking/search
Context window limits
Quality depends on retriever
Susceptible to counterfactual noise

Greek NLP Challenges

Rich morphology → many word forms per lemma → data sparsity.
OOV problem → FastText (subword) helps significantly.
Limited Greek training data → multilingual models + Greek-specific fine-tuning.
Greek BERT: pre-trained BERT on Greek corpus.
Applied to Diavgeia: Greek govt transparency platform (RAG use case).

RAG Explainability (RAG-EX)

Generic framework to explain which retrieved documents influenced the answer.
Helps users verify and trust RAG system outputs.

🤝 L8: LLM-based Agents Lecture 8

Agent Architecture

Agent = Augmented LLM + Tools + Memory + Planning Loop: Perception → Planning → Action → Environment → (repeat)

Perception: receives observations from environment.
Planning: decides next action (often via CoT/ReAct).
Tools: APIs, code execution, web search, databases.
Memory: in-context (prompt), external (vector DB), episodic.

Agent Orchestration Patterns

Pattern	Planning	Agents	Best for
Flat ReAct	Implicit	1	Short tasks
Plan + Loops	Explicit plan object	1 (+ replan)	Structured, predictable
Hierarchical	Manager decomposes	Many	Large, parallelizable

ReAct Loop

repeat: thought ← LLM reasons about current state action ← LLM selects tool + arguments observation ← tool returns result append to context until done or T_max

Tool Calling

Tools described in the system message (name, description, input/output schema).
LLM trained via: (a) human annotations (SFT), (b) self-supervised (Toolformer).
MCP (Model Context Protocol): standard for connecting LLMs to external tools (~97M monthly SDK downloads, ~2000 servers).

Key Agent Frameworks & Tools

LangChain: connects LLMs to tools, APIs, DBs.
LangGraph: structured multi-step agent workflows with decision paths.
Coding agents: Claude Code, Cursor, Codex, Devin, Aider.
Computer-use agents: Claude Computer Use, OpenAI Operator, Manus.
MCP servers: Slack, GitHub, Google, Salesforce, Stripe, HubSpot, Notion…

Agent Safety & Challenges

Hallucinated tool calls: agent calls non-existent API.
Infinite loops: agent keeps retrying without progress → need T_max.
Prompt injection: malicious content in environment hijacks agent.
Context window: long trajectories exceed context → need memory management.

⚡ L9: LLMs at Scale Lecture 9

Token Generation: Prefill vs Decode

Prefill: process entire prompt in parallel → compute KV cache Decode: autoregressive generation, one token at a time TTFT = Time to First Token (dominated by prefill) ITL = Inter-Token Latency (dominated by decode) KV Cache: stores K, V matrices of previous tokens → no recomputation

Batching Strategies

Strategy	Description
Static batching	Wait until batch full; wastes GPU if sequences finish early
Continuous batching	Add new requests as slots free up (vLLM default)
Chunked prefill	Break long prefills into chunks; interleave with decodes

Quantization

Method	What's quantized	Effect
W8A16	Weights to INT8	2× memory savings
W4A16	Weights to INT4	4× memory, slight accuracy loss
KV8/KV4	KV cache	Scales long contexts

Use per-channel/per-group (32–128) granularity, not per-variable.
Handle outlier channels in FP16 (SmoothQuant, AWQ, GPTQ-style).
Benefits only realized with optimized INT4/INT8 kernels.

Key Inference Optimizations

Flash Attention: fused kernel, O(n) memory instead of O(n²); faster.
Speculative decoding: small draft model generates tokens, large model verifies in parallel → latency reduction.
Prefix caching: cache KV of common prefixes (e.g., system prompts).
PagedAttention (vLLM): GPU memory paging for KV cache → higher throughput.
Tensor parallelism: split model across GPUs (weights split by row/column).
Pipeline parallelism: different layers on different GPUs.

RAG at Scale (from L9 perspective)

Indexing: chunk → embed (dense) → vector DB (FAISS/Pinecone) Retrieval: ANN (Approximate Nearest Neighbor) search Hybrid: dense (semantic) + sparse (BM25) retrieval Re-ranking: cross-encoder on top-K candidates

Test-Time Compute Scaling

Instead of just scaling training, scale inference compute.
CoT / self-consistency: sample multiple solutions, pick best.
Best-of-N sampling: generate N candidates, rank by reward model.
MCTS / tree search: explore reasoning trees at inference time.

vLLM Stack

Model → Runtime (vLLM) → PagedAttention + Continuous Batching → Tensor Parallel (multi-GPU) → OpenAI-compatible API

⚡ Quick Reference — All Key Formulas & Comparisons

Core IR Formulas

TF-IDF = TF(t,d) × log(N/df(t)) BM25: score = IDF × tf(k1+1)/(tf+k1(1-b+b·|d|/avdl)) TW-IDF: same as BM25 but tw=in-degree MAP = mean of AP across queries P@k = precision at k retrieved docs

Deep Learning Cheat

CNN: conv → max-pool → FC → softmax RNN: hₜ = f(Wₓxₜ + Wₕhₜ₋₁) LSTM gates: forget, input, cell, output Attention: softmax(QKᵀ/√d)V Transformer: Self-Attn + FFN + Residual + LN

BERT vs GPT vs BART

Model	Type	Training	Best for
BERT	Encoder	MLM + NSP	Classification, NER, QA
GPT	Decoder	CLM (next token)	Generation
BART	Enc-Dec	Denoising	Summarization, Translation
T5	Enc-Dec	Text-to-text	Multi-task NLP

Centrality Quick Ref

Measure	Formula
Degree	Cᵈ(i) = k(i)
Closeness	Cc(i) = (n-1)/Σd(i,j)
Betweenness	Cb(i) = Σ σ(s,t\|i)/σ(s,t)
PageRank	PR(i) = (1-d)/N + d·Σ PR(j)/L(j)
k-core	largest subgraph with deg≥k
CoreRank	Σ core(u) for u∈neighbors(v)

LLM Fine-tuning Techniques

Method	Params Updated	Memory
Full fine-tuning	All	Very high
LoRA	A, B matrices (r≪d)	~3× less
Prompt tuning	Soft prompt tokens	Minimal
Adapter	Small FF modules	Low
Prefix tuning	Prefix key/values	Low

Classification Algorithms

Algorithm	Strengths	Weaknesses
Naive Bayes	Fast, robust, sparse	Independence assumption
SVM (linear)	Gold standard for text	Slow on huge data
kNN	No training	Slow at test time
CNN	Local features	Needs GPU
BERT+FT	SOTA	Expensive

📚 NLP & IR — Complete Exam Cheatsheet