πŸ“š NLP & IR β€” Complete Exam Cheatsheet

University of Ioannina Β· CSE Β· All 9 Lectures Β· Prof. Skianis Β· 2026

πŸ“Š L1: TF-IDF & Text Classification Lecture 1
TF-IDF Fundamentals
TF(t,d) = count(t in d) / total_terms(d) IDF(t) = log( N / df(t) ) [N = corpus size] TF-IDF(t,d) = TF(t,d) Γ— IDF(t)
  • BoW: each doc = vector of term counts. Ignores order & co-occurrence.
  • IDF: high for rare terms, low for common ones (e.g., "the").
  • TF-IDF: balances local term importance vs global rarity.
BM25 (Okapi BM25)
BM25(t,d,q) = IDF(t) Γ— [ tf(t,d)Β·(k1+1) / (tf(t,d) + k1Β·(1 - b + bΒ·|d|/avgdl)) ] Default: k1=1.2, b=0.75 k1 β†’ term freq saturation b β†’ length normalization
Text Classification Pipeline
Textual Data β†’ Feature Extraction β†’ Term Weighting β†’ Model Learning β†’ Evaluation
Naive Bayes Classifier
Δ‰ = argmax P(c) Β· ∏ P(xα΅’|c) [MAP estimate] P(w|c) = (count(w,c)+1) / (count(c)+|V|) [Laplace smoothing]
  • Assumes: features (words) are independent given the class.
  • Strengths: fast, good baseline, robust to irrelevant features, handles missing values.
  • Won KDD-Cup 97 (1st & 2nd place among 16 algorithms).
Feature Selection Methods
MethodBest forNotes
TF-IDFGeneral weightingSimple baseline
Chi-Square (χ²)Text classificationNeeds large dataset
Mutual InformationSmall datasetsBetter than χ² on small data
L1-RegularizationSparse dataAuto eliminates irrelevant
PCA / LSADense embeddingsReduces dimensionality
Information Gain (Feature Selection)
I(w) = -βˆ‘Pα΅’log(Pα΅’) + F(w)βˆ‘pα΅’(w)log(pα΅’(w)) + (1-F(w))βˆ‘(1-pα΅’(w))log(1-pα΅’(w)) where: pα΅’(w)=P(class i | word w), Pα΅’=P(class i), F(w)=fraction of docs with w
SVM Key Points
  • Finds maximum margin hyperplane between classes.
  • Uses kernel trick for non-linear boundaries.
  • Linear SVM is the gold standard for text classification.
Evaluation Metrics
Precision = TP/(TP+FP) Recall = TP/(TP+FN) F1 = 2Β·PΒ·R/(P+R) Micro-F1: aggregate globally Macro-F1: avg per class (equal weight)
πŸ“‰ L2: Dim Red, PLSI, SVD & Embeddings Lecture 2
pLSI (Probabilistic LSI)
P(d,w) = βˆ‘β‚œ P(t) Β· P(d|t) Β· P(w|t) t = latent topic, d = document, w = word Learned via EM algorithm
  • Soft clustering: each word/doc can belong to multiple topics.
  • Problem: number of parameters grows with corpus (no proper generative model).
SVD / LSA (Latent Semantic Analysis)
A = U Β· Ξ£ Β· Vα΅€ A: term-document matrix (mΓ—n) U: term-topic matrix Ξ£: singular values (importance) V: doc-topic matrix Truncated SVD (k topics): Aβ‚– = Uβ‚– Ξ£β‚– Vβ‚–α΅€ β†’ dimensionality reduction
  • Captures synonymy: similar words cluster in latent space.
  • Problem: expensive O(nΒ²), matrix must be recomputed if vocab changes, sensitive to word frequency imbalance.
Word2Vec (Mikolov et al. 2013)
CBOW: context words β†’ predict target word Skip-gram: target word β†’ predict context words No hidden layer (1000Γ— speedup vs earlier models) Context window: both history & future
  • Captures semantic analogy: king - man + woman β‰ˆ queen.
  • Skip-gram > CBOW on all tasks (Levy et al. 2015).
GloVe
Minimizes: βˆ‘ f(Xα΅’β±Ό)(wα΅’α΅€wΜƒβ±Ό + bα΅’ + bΜƒβ±Ό - log Xα΅’β±Ό)Β² Xα΅’β±Ό = co-occurrence count of words i and j
  • Global co-occurrence statistics + local context window.
FastText
  • Extends Word2Vec with character n-grams (subword units).
  • Handles out-of-vocabulary (OOV) words β€” important for morphologically rich languages (Greek!).
SVD vs Word2Vec vs GloVe
ModelBest at
SVDSimilarity tasks
Word2VecAnalogy tasks
GloVeSomewhere between

⚠️ No single algorithm consistently wins; hyperparameter tuning often matters more than algorithm choice.

Document Distance: WMD
dα΅’ = cα΅’ / βˆ‘cβ±Ό (normalized word freq) c(i,j) = β€–xα΅’ - xβ±Όβ€–β‚‚ (embedding distance) WMD = min βˆ‘ Tα΅’β±ΌΒ·c(i,j) s.t. βˆ‘β±ΌTα΅’β±Ό = dα΅’, βˆ‘α΅’Tα΅’β±Ό = d'β±Ό (Transportation / Linear programming problem, O(mΒ³ log m))
πŸ•Έ L3: Graph-of-Words, IR & Classification Lecture 3
Graph-of-Words (GoW)
Gd = (Vd, Ed) Nodes = unique terms in document Edges = co-occurrence within sliding window w Edge weight = number of co-occurrences Replace TF with node in-degree (centrality)
  • Challenges BoW: captures word order, distance, dependence.
  • Directed edges preserve text flow; undirected capture co-occurrence.
  • Larger window w β†’ denser graph.
TW-IDF (Term Weight IDF)
TW-IDF(t,d) = tw(t,d) / (1-b + bΒ·|d|/avdl) Γ— log((N+1)/df(t)) tw = in-degree of node t in GoW (vs tf in TF-IDF) Default: b=0.75, k1=1.2
Centrality Measures
MeasureFormula / Idea
DegreeCᡈ(i) = k(i) (# of connections)
ClosenessCc(i) = (n-1)/βˆ‘d(i,j) β€” inverse mean distance
BetweennessCb(i) = βˆ‘ Οƒ(s,t|i)/Οƒ(s,t) β€” fraction of shortest paths through i
PageRank/EigenvectorDominant eigenvector of adjacency matrix
k-Core Decomposition
k-core(G): largest subgraph where every node has β‰₯ k neighbors within it. Algorithm: while βˆƒ node x with deg(x) < k β†’ remove x. O(m) time. Main core = highest k-core (most cohesive subgraph)
  • Keyword extraction: keep main core β†’ top cohesive terms.
  • Better than PageRank for capturing "spreading influence".
CoreRank
CRank(v) = βˆ‘_{u∈N(v)} core(u) Sum of core numbers of a node's neighbors. Finer granularity than k-core; comparable to TextRank but with cohesiveness.
K-Truss Decomposition
  • Each edge in K-truss participates in β‰₯ K-2 triangles.
  • Higher coherence than k-core; triangle-based.
Graph Classification for Text
  • gSpan: frequent subgraph mining (long-distance n-grams as features).
  • MC + gSpan + SVM: mine frequent subgraphs in main cores β†’ higher efficiency.
  • Subgraph of size n β‰ˆ long-distance n-gram (word inversion + subset matching).
Sub-Event Detection (Twitter)
Graph of tweets: nodes=terms, edges=co-occurrence within tweet k-core score per term β†’ sum of core numbers = event signal If Ξ£ core numbers at time t > threshold β†’ sub-event detected
  • Weight-Core approach: best F1 (micro: 0.68, macro: 0.72) on FIFA 2014 dataset.
Graph Kernels for Document Similarity
  • Shortest-path kernel: compare all-pairs shortest paths between two graph-of-words.
  • Polynomial-time alternatives to graph isomorphism (NP-hard) or edit distance.
🧠 L4: Deep Learning for NLP Lecture 4
CNN for NLP
Input: sentence matrix (n Γ— d) where d = embedding dim Convolution: filter W ∈ β„Κ°Λ£α΅ˆ over window of h words Feature map: cα΅’ = f(W Β· xα΅’:α΅’β‚Šβ‚•β‚‹β‚ + b) Max-over-time pooling: Δ‰ = max{c} (captures most important feature)
  • CNN-non-static achieves best on most benchmarks (Kim et al. 2014).
  • Good for local feature detection (n-gram patterns).
RNN (Recurrent Neural Networks)
hβ‚œ = f(Wβ‚“β‚“xβ‚œ + Wβ‚•β‚•hβ‚œβ‚‹β‚ + b) yβ‚œ = Wβ‚•α΅§hβ‚œ
RNN typeUse case
One-to-oneImage classification
One-to-manyImage captioning
Many-to-oneSentiment analysis
Many-to-many (async)Machine translation (seq2seq)
Many-to-many (sync)Video frame labeling
  • Problem: vanishing gradients over long sequences.
LSTM (Long Short-Term Memory)
fβ‚œ = Οƒ(WfΒ·[hβ‚œβ‚‹β‚, xβ‚œ] + bf) ← forget gate iβ‚œ = Οƒ(Wα΅’Β·[hβ‚œβ‚‹β‚, xβ‚œ] + bα΅’) ← input gate CΜƒβ‚œ = tanh(WcΒ·[hβ‚œβ‚‹β‚, xβ‚œ] + bc) ← candidate cell Cβ‚œ = fβ‚œβŠ™Cβ‚œβ‚‹β‚ + iβ‚œβŠ™CΜƒβ‚œ ← cell state oβ‚œ = Οƒ(WoΒ·[hβ‚œβ‚‹β‚, xβ‚œ] + bo) ← output gate hβ‚œ = oβ‚œβŠ™tanh(Cβ‚œ) ← hidden state
  • Cell state = long-term memory; hidden state = short-term.
  • GRU: simplified LSTM (2 gates: reset, update). Fewer params.
Encoder-Decoder (Seq2Seq)
  • Encoder (bidirectional LSTM/CNN) β†’ fixed context vector β†’ Decoder (unidirectional RNN).
  • Bottleneck: single context vector loses info for long sequences β†’ fixed by Attention.
ELMo
  • Bidirectional LSTM trained on large corpus.
  • Contextualized embeddings: same word gets different vector in different contexts.
  • Forward LM + Backward LM β†’ combine all layers.
πŸ” L5: Attention, BERT, BART & Metrics Lecture 5
Attention Mechanism (Bahdanau 2014)
eβ‚œβ‚› = score(sβ‚œ, hβ‚›) [alignment score] Ξ±β‚œβ‚› = softmax(eβ‚œβ‚›) [attention weights] cβ‚œ = βˆ‘ Ξ±β‚œβ‚› Β· hβ‚› [context vector at step t] score variants: dot: hβ‚œα΅€hβ‚› general: hβ‚œα΅€Wₐhβ‚› concat: vₐᡀ tanh(Wₐ[hβ‚œ;hβ‚›])
  • Solves the encoder bottleneck by allowing decoder to look at ALL encoder states.
  • Self-attention: Q, K, V all come from same sequence.
Transformer Self-Attention
Attention(Q,K,V) = softmax( QΒ·Kα΅€ / √dβ‚– ) Β· V Q = XWQ, K = XWK, V = XWV Multi-Head: concat h heads, project: MH(Q,K,V) = Concat(head₁,...,headβ‚•)Β·Wα΄Ό
  • √dβ‚– scaling prevents vanishing gradients in softmax.
  • Each layer: Self-Attention + Feed-Forward + Residual + LayerNorm.
  • Positional encodings: sinusoidal or learned (no inherent order awareness).
Transformer Attention Types
TypeWhereQ / K / V source
Self-attention (encoder)EncoderPrevious encoder sub-layer
Masked self-attentionDecoderPrevious decoder layer (masked = no future)
Cross-attentionDecoderQ from decoder, K&V from encoder output
BERT (Devlin et al. 2018)
Architecture: Bidirectional Transformer encoder Pre-training tasks: 1. MLM (Masked LM): predict 15% masked tokens 2. NSP (Next Sentence Prediction): is B next after A? Input: [CLS] sent_A [SEP] sent_B [SEP] BERT-base: 12 layers, 768 hidden, 12 heads, 110M params BERT-large: 24 layers, 1024 hidden, 16 heads, 340M params
  • Fine-tune for downstream tasks by adding a classifier on [CLS] token.
  • Bidirectional = sees full context (unlike GPT which is causal/left-to-right).
GPT (Decoder-only Transformer)
  • 12-layer decoder-only transformer, 768 dims, 12 attention heads.
  • Masked self-attention (cannot see future tokens).
  • Trained on language modeling (predict next token).
BART (Denoising Seq2Seq)
  • Full encoder-decoder transformer.
  • Pre-training: corrupt text (mask, shuffle, delete) β†’ reconstruct original.
  • Best for: summarization, translation, generation.
Evaluation Metrics
BLEU: precision of n-gram overlap (reference vs. hypothesis) BLEU = BP Β· exp(βˆ‘ wβ‚™ log pβ‚™) [penalizes short outputs] ROUGE: recall-oriented (for summarization) ROUGE-N = n-gram recall ROUGE-L = longest common subsequence BERTScore: semantic similarity using BERT embeddings Perplexity: PP(W) = P(w₁...wβ‚™)^(-1/n) [lower = better LM]
πŸ€– L6: Large Language Models Lecture 6
LLM Training Pipeline
Pretraining β†’ Supervised Fine-Tuning β†’ RLHF β†’ (GRPO/DAPO for reasoning)
StageGoalIssue
Pretraining (CLM)Predict next tokenMisaligned with user intent
Instruction tuning (SFT)Follow instructionsStill not truly aligned
RLHF (PPO/DPO)Alignment, safety, behaviorAligned model can't reason
GRPO/DAPO/RLOOReasoning (math, code)Resource-intensive
  • InstructGPT key finding: 1.3B model preferred over 175B GPT-3 (100Γ— fewer params) after alignment.
  • DPO: Direct Preference Optimization β€” NOT RL, but treats alignment as supervised.
Positional Encodings
  • Sinusoidal: fixed periodic signals (original Transformer).
  • Learned: position vectors trained like token embeddings.
  • RoPE: Rotary PE β€” used in LLaMA/Mistral; rotates Q/K vectors by position.
Prompting Techniques
TechniqueDescription
Zero-shotNo examples; task described in instruction only
Few-shot2-5 examples in the prompt before the question
Chain-of-Thought (CoT)Model explains reasoning step-by-step
Self-consistencySample multiple CoT paths, take majority vote
ReActReason + Act; interleave thinking with tool calls
LoRA (Low-Rank Adaptation)
W' = W + BA (W frozen, B and A are trainable) A ∈ β„Κ³Λ£α΅ˆ, B ∈ β„α΅ˆΛ£Κ³, r << d Params: 2dr << dΒ² (much fewer than full fine-tuning) Merge at inference: W_final = W + BA (no latency overhead)
  • Reduces hardware requirement by ~3Γ—. Effective as full fine-tuning.
  • Based on intuition that weight changes have low "intrinsic rank".
Scaling Laws
L(x) = ax⁻ᡏ + Ξ΅ [power law: loss vs compute/data/params] Chinchilla law: optimal training = 20 tokens per parameter (e.g., 70B model β†’ 1.4T tokens)
  • Scaling D (data), C (compute), N (params) β†’ predictable improvement.
  • World's total usable text: ~4.6T – 17.2T tokens (scaling wall incoming).
Factors Beyond Scale
  • Data quality & diversity often beats raw size.
  • Mixture-of-Experts (MoE): only subset of params active per token.
  • Context window size (128k, 1M tokens) can matter more than params.
πŸ‡¬πŸ‡· L7: Greek LLMs & RAG Lecture 7
RAG β€” Retrieval-Augmented Generation
Query β†’ Retriever β†’ Retrieved chunks β†’ LLM β†’ Answer + Citations Two-stage: Indexing (offline) + Retrieval/Generation (online)
RAG Pipeline
PhaseWhat happens
Indexing (offline)Split docs β†’ overlapping chunks β†’ embed β†’ store in vector DB
Retrieval (online)Embed query β†’ fetch top-K candidates β†’ re-rank
MergeSelect top-N chunks within token budget; label as S1, S2…
GenerationLLM answers from context; refuse/flag if evidence weak
RAG Variants
  • Naive RAG: basic retrieve + concatenate + generate.
  • Advanced RAG: query rewriting, hybrid search, re-ranking.
  • FLARE (Forward-Looking Active RAG): generate tentatively, retrieve when uncertain.
  • Self-RAG: model decides when to retrieve and reflects on retrieved docs.
RAG Pros & Cons
Pros:
  • Reduces hallucinations
  • Citable sources
  • Cheaper than fine-tuning
  • Up-to-date knowledge
  • Scalable (millions of docs)
Cons:
  • Complex chunking/search
  • Context window limits
  • Quality depends on retriever
  • Susceptible to counterfactual noise
Greek NLP Challenges
  • Rich morphology β†’ many word forms per lemma β†’ data sparsity.
  • OOV problem β†’ FastText (subword) helps significantly.
  • Limited Greek training data β†’ multilingual models + Greek-specific fine-tuning.
  • Greek BERT: pre-trained BERT on Greek corpus.
  • Applied to Diavgeia: Greek govt transparency platform (RAG use case).
RAG Explainability (RAG-EX)
  • Generic framework to explain which retrieved documents influenced the answer.
  • Helps users verify and trust RAG system outputs.
🀝 L8: LLM-based Agents Lecture 8
Agent Architecture
Agent = Augmented LLM + Tools + Memory + Planning Loop: Perception β†’ Planning β†’ Action β†’ Environment β†’ (repeat)
  • Perception: receives observations from environment.
  • Planning: decides next action (often via CoT/ReAct).
  • Tools: APIs, code execution, web search, databases.
  • Memory: in-context (prompt), external (vector DB), episodic.
Agent Orchestration Patterns
PatternPlanningAgentsBest for
Flat ReActImplicit1Short tasks
Plan + LoopsExplicit plan object1 (+ replan)Structured, predictable
HierarchicalManager decomposesManyLarge, parallelizable
ReAct Loop
repeat: thought ← LLM reasons about current state action ← LLM selects tool + arguments observation ← tool returns result append to context until done or T_max
Tool Calling
  • Tools described in the system message (name, description, input/output schema).
  • LLM trained via: (a) human annotations (SFT), (b) self-supervised (Toolformer).
  • MCP (Model Context Protocol): standard for connecting LLMs to external tools (~97M monthly SDK downloads, ~2000 servers).
Key Agent Frameworks & Tools
  • LangChain: connects LLMs to tools, APIs, DBs.
  • LangGraph: structured multi-step agent workflows with decision paths.
  • Coding agents: Claude Code, Cursor, Codex, Devin, Aider.
  • Computer-use agents: Claude Computer Use, OpenAI Operator, Manus.
  • MCP servers: Slack, GitHub, Google, Salesforce, Stripe, HubSpot, Notion…
Agent Safety & Challenges
  • Hallucinated tool calls: agent calls non-existent API.
  • Infinite loops: agent keeps retrying without progress β†’ need T_max.
  • Prompt injection: malicious content in environment hijacks agent.
  • Context window: long trajectories exceed context β†’ need memory management.
⚑ L9: LLMs at Scale Lecture 9
Token Generation: Prefill vs Decode
Prefill: process entire prompt in parallel β†’ compute KV cache Decode: autoregressive generation, one token at a time TTFT = Time to First Token (dominated by prefill) ITL = Inter-Token Latency (dominated by decode) KV Cache: stores K, V matrices of previous tokens β†’ no recomputation
Batching Strategies
StrategyDescription
Static batchingWait until batch full; wastes GPU if sequences finish early
Continuous batchingAdd new requests as slots free up (vLLM default)
Chunked prefillBreak long prefills into chunks; interleave with decodes
Quantization
MethodWhat's quantizedEffect
W8A16Weights to INT82Γ— memory savings
W4A16Weights to INT44Γ— memory, slight accuracy loss
KV8/KV4KV cacheScales long contexts
  • Use per-channel/per-group (32–128) granularity, not per-variable.
  • Handle outlier channels in FP16 (SmoothQuant, AWQ, GPTQ-style).
  • Benefits only realized with optimized INT4/INT8 kernels.
Key Inference Optimizations
  • Flash Attention: fused kernel, O(n) memory instead of O(nΒ²); faster.
  • Speculative decoding: small draft model generates tokens, large model verifies in parallel β†’ latency reduction.
  • Prefix caching: cache KV of common prefixes (e.g., system prompts).
  • PagedAttention (vLLM): GPU memory paging for KV cache β†’ higher throughput.
  • Tensor parallelism: split model across GPUs (weights split by row/column).
  • Pipeline parallelism: different layers on different GPUs.
RAG at Scale (from L9 perspective)
Indexing: chunk β†’ embed (dense) β†’ vector DB (FAISS/Pinecone) Retrieval: ANN (Approximate Nearest Neighbor) search Hybrid: dense (semantic) + sparse (BM25) retrieval Re-ranking: cross-encoder on top-K candidates
Test-Time Compute Scaling
  • Instead of just scaling training, scale inference compute.
  • CoT / self-consistency: sample multiple solutions, pick best.
  • Best-of-N sampling: generate N candidates, rank by reward model.
  • MCTS / tree search: explore reasoning trees at inference time.
vLLM Stack
Model β†’ Runtime (vLLM) β†’ PagedAttention + Continuous Batching β†’ Tensor Parallel (multi-GPU) β†’ OpenAI-compatible API
⚑ Quick Reference β€” All Key Formulas & Comparisons
Core IR Formulas
TF-IDF = TF(t,d) Γ— log(N/df(t)) BM25: score = IDF Γ— tf(k1+1)/(tf+k1(1-b+bΒ·|d|/avdl)) TW-IDF: same as BM25 but tw=in-degree MAP = mean of AP across queries P@k = precision at k retrieved docs
Deep Learning Cheat
CNN: conv β†’ max-pool β†’ FC β†’ softmax RNN: hβ‚œ = f(Wβ‚“xβ‚œ + Wβ‚•hβ‚œβ‚‹β‚) LSTM gates: forget, input, cell, output Attention: softmax(QKα΅€/√d)V Transformer: Self-Attn + FFN + Residual + LN
BERT vs GPT vs BART
ModelTypeTrainingBest for
BERTEncoderMLM + NSPClassification, NER, QA
GPTDecoderCLM (next token)Generation
BARTEnc-DecDenoisingSummarization, Translation
T5Enc-DecText-to-textMulti-task NLP
Centrality Quick Ref
MeasureFormula
DegreeCᡈ(i) = k(i)
ClosenessCc(i) = (n-1)/Ξ£d(i,j)
BetweennessCb(i) = Ξ£ Οƒ(s,t|i)/Οƒ(s,t)
PageRankPR(i) = (1-d)/N + dΒ·Ξ£ PR(j)/L(j)
k-corelargest subgraph with degβ‰₯k
CoreRankΣ core(u) for u∈neighbors(v)
LLM Fine-tuning Techniques
MethodParams UpdatedMemory
Full fine-tuningAllVery high
LoRAA, B matrices (rβ‰ͺd)~3Γ— less
Prompt tuningSoft prompt tokensMinimal
AdapterSmall FF modulesLow
Prefix tuningPrefix key/valuesLow
Classification Algorithms
AlgorithmStrengthsWeaknesses
Naive BayesFast, robust, sparseIndependence assumption
SVM (linear)Gold standard for textSlow on huge data
kNNNo trainingSlow at test time
CNNLocal featuresNeeds GPU
BERT+FTSOTAExpensive
NLP & IR Cheatsheet Β· CSE UOI Β· 2026 Β· All lectures covered: L1 TF-IDF/Classif Β· L2 Dim Red/Embeddings Β· L3 GoW/IR/KE Β· L4 Deep NLP Β· L5 Attention/BERT Β· L6 LLMs Β· L7 Greek/RAG Β· L8 Agents Β· L9 Scale