Model training and evaluation
Notes on evaluating LLM reliability, calibration, evidence/grounding, and interpreting benchmark results for single-turn and multi-step outputs.
Notes on evaluation and reliability for LLM outputs (single-turn and multi-step): calibration, evidence/grounding, and how to interpret benchmarks.
Core articles
Core pages
Fluency Is Not Factuality Why LLMs Can Sound Right and Be Wrong
Why fluent LLM outputs can still be wrong, and how to enforce evidence-locked answers (retrieval + provenance + fail-closed gates).
Sycophancy in LLM Assistants: What It Is, How Training Creates It, and Why It Shows Up in Production
A technically grounded explanation of sycophancy (belief-agreement bias): what it is, what the evidence supports about prevalence, how preference optimization can produce it, and what changes in training and release practice reduce it.
Theory of mind in LLMs — what benchmarks test (and what they don’t)
Evidence-anchored overview of how ToM is defined in psychology, how it is operationalized for LLM evaluation, and what current results do and do not justify.
Orders of Intentionality and Recursive Mindreading Definitions and Use in LLM Evaluation
A precise reference for nested mental-state attribution (“orders of intentionality” / “recursive mindreading”) and how these constructs are operationalized in evaluations of humans and LLMs—without implying mechanism-level Theory of Mind.
Why “Almost Human, But Not Quite” Feels Wrong: From Clowns to AI-Generated Images and Text
Two separable mechanisms behind the “something feels off” reaction: cue-level perceptual mismatch (uncanny/cue conflict) vs AI-label effects on credibility and sharing.