I Built TruthGuard: A Benchmark That Exposes When AI Lies With Confidence
Most AI models don't just get things wrong — they get things wrong while sounding completely sure of themselves. I built TruthGuard to measure exactly that.
I Built TruthGuard: A Benchmark That Exposes When AI Lies With Confidence
There's a problem with AI that nobody talks about enough.
It's not that AI gets things wrong. Every system gets things wrong. The real problem is that AI gets things wrong while sounding completely certain. No hesitation. No "I'm not sure about this." Just a confident, fluent, completely wrong answer.
I wanted to measure that. So I built TruthGuard.
What Is TruthGuard?
TruthGuard is a 120-question confidence calibration benchmark I designed to evaluate the metacognitive abilities of large language models.
In plain English: it tests whether an AI actually knows what it knows.
A well-calibrated model should say "I'm 90% confident" only when it's actually right 90% of the time. When a model says it's 95% confident on something it gets wrong 40% of the time — that's metacognitive overconfidence. That's what TruthGuard hunts for.
Why I Built This
I was using several frontier models for a project and noticed something unsettling. The models that gave the most confident answers weren't always the most accurate ones. In some cases, the opposite was true.
That's dangerous. If you're building a product on top of an AI model, you need to know not just can it answer correctly but does it know when it can't.
I couldn't find a benchmark specifically designed to measure this at the metacognition level. So I built one.
How It Works
The benchmark has 120 questions split across three difficulty tiers:
- Easy — Straightforward questions where a capable model should score well and know it
- Tricky — Questions designed to expose surface-level pattern matching
- Hard/Trap — Adversarial questions that look simple but contain subtle catches
For each question, the model must provide both an answer AND a confidence score (0–100%). TruthGuard then measures:
- Calibration Error — How far off is the stated confidence from actual accuracy?
- Overconfidence Rate — How often does the model claim high confidence on wrong answers?
- Accuracy vs Confidence Correlation — Does confidence actually predict correctness?
What I Found
The results were revealing. Models consistently overestimated their accuracy on the Tricky and Trap tiers — sometimes by 20–30 percentage points. Easy questions showed reasonable calibration. But the harder the question, the worse the calibration got.
The pattern: AI models are most overconfident exactly when they should be most uncertain.
Why This Matters
If you're building anything serious with AI — a medical tool, a legal assistant, a research platform — you need to know how the model behaves when it's wrong. Does it signal uncertainty? Does it hold back? Or does it barrel forward with the same confident tone regardless of whether it's right?
TruthGuard gives you a way to measure that.
Try It Yourself
The full benchmark and dataset are publicly available on Kaggle:
- Notebook: TruthGuard Benchmark — Metacognition Evaluation
- Dataset: TruthGuard Metacognition Dataset
Both are open and free to use under CC BY-SA 4.0. If you're an AI researcher or developer, run your model against it and see how it scores.
What's Next
TruthGuard v1 is just the start. I'm working on expanding the question set, adding domain-specific tiers (medicine, law, code), and building a standardized evaluation pipeline that any developer can run against any model.
If you find this useful or want to collaborate, reach out at contact@ememzyvisuals.com.
Emmanuel Ariyo is a Creative Software Developer and AI Systems Engineer building AI tools and benchmarks from Nigeria. Follow at @ememzyvisuals.