Model training and evaluation

Notes on evaluating LLM reliability, calibration, evidence/grounding, and interpreting benchmark results for single-turn and multi-step outputs.

Notes on evaluation and reliability for LLM outputs (single-turn and multi-step): calibration, evidence/grounding, and how to interpret benchmarks.

Core articles