Module 10: Evaluation & Benchmarking
Master LLM evaluations, MMLU benchmarks, HumanEval, LLM-as-a-Judge, Elo rating calculations, and SWE-bench code evaluations.
10.1 LLM Benchmarks
As LLMs scale, measuring their capabilities requires standardized academic benchmarks:
- MMLU (Massive Multitask Language Understanding): A massive dataset of multiple-choice questions spanning 57 subjects (elementary math, history, law, professional ethics). Tests broad world knowledge and academic reasoning.
- HumanEval: Developed by OpenAI. Consists of 164 hand-written coding problems in Python. The model generates code, and we verify accuracy using unit tests rather than token matching, outputting a **Pass@k** metric.
- GSM8K: A dataset of 8.5k high-quality linguistically diverse grade-school math word problems. Resolving these requires multi-step mathematical reasoning.
- LMSYS Chatbot Arena: A crowd-sourced benchmark platform. Users prompt two anonymous models side-by-side (blind test) and vote on the better response. Winning probabilities are compiled into an **Elo rating system** (similar to chess), establishing a human-preference leaderboard.
10.2 Automated Evaluation
Hiring human graders is slow and expensive. Modern pipelines use **LLM-as-a-Judge**: using an advanced frontier model (like GPT-4) to grade the outputs of other models.
G-Eval Framework: An LLM-as-a-judge framework that utilizes Chain-of-Thought prompts and explicit grading rubrics to evaluate qualitative properties (like coherence, conciseness, or tone) on a 1-5 scale.
Pairwise Battle Elo Calculation: When evaluating two models ($A$ and $B$) with current ratings $R_A$ and $R_B$, the expected probability of model $A$ winning against $B$ is calculated as:
If model $A$ wins, its rating is updated:
# Simulated LLM-as-a-judge prompt template
grading_rubric = """
Evaluate the response based on accuracy and clarity.
Score from 1 (poor) to 5 (excellent).
[Prompt]: {prompt}
[Response]: {response}
Output your evaluation in this format:
Reasoning: <your step-by-step thinking>
Score: <integer score>
"""
# The scoring pipeline parses the response and extracts the integer score...
def parse_score(model_output):
# Search for "Score: X" in output
for line in model_output.split("
"):
if line.startswith("Score:"):
return int(line.split(":")[-1].strip())
return None
10.3 Safety & Bias Evaluation
Models must be checked for safety guardrails:
- TruthfulQA: Measures whether a model generates truthful responses or repeats common human myths, conspiracy theories, or rumors.
- ToxiGen: A large-scale benchmark containing thousands of toxic prompts to test whether guardrails reject hate speech, harassment, and toxic generation queries.
- BBQ (Bias Benchmark for QA): Evaluates social biases against demographic groups (gender, race, religion) across ambiguous and unambiguous context narratives.
- Red Teaming: Systematically attacking a model (simulating jailbreak exploits) to discover vulnerabilities before publication.
10.4 Domain-Specific Evals
General benchmarks fail to measure capabilities in specialized environments:
- SWE-bench: A benchmark that tests models on real software engineering issues gathered from GitHub. The model is given a full repository codebase and a bug report, and must generate a PR containing code fixes that pass the codebase's existing unit tests.
- AgentBench: A multi-dimensional benchmark designed to evaluate agents in interactive environments (operating system terminals, web browsers, databases).
Next Steps
Proceed to Module 11: Production & MLOps for GenAI to learn how to deploy and monitor LLMs.
