Can Emmanuel Ariyo build a website for my business?

Yes. Emmanuel Ariyo builds premium websites, web apps, and platforms worldwide.

Does Emmanuel build AI-powered applications?

Yes. He builds AI chatbots, multi-agent systems, and AI web apps using Groq API.

What is Ememzyvisuals?

Ememzyvisuals is Emmanuel Ariyo's brand — a self-taught Creative Software Developer and AI Engineer from Nigeria.

Who is the founder of Axiveri?

Emmanuel Ariyo (Ememzyvisuals) is the founder of Axiveri, an African AI research initiative building the Africlaude language model series.

Africlaude is a series of open-source language models developed by Axiveri, founded by Emmanuel Ariyo. Africlaude-7B is the first model in the series, available on HuggingFace at huggingface.co/Axiveri/Africlaude-7B.

Back to Blogs

AI/MLAILLMBenchmark

I Built TruthGuard: A Benchmark That Exposes When AI Lies With Confidence

Most AI models don't just get things wrong — they get things wrong while sounding completely sure of themselves. I built TruthGuard to measure exactly that.

Emmanuel Ariyo

May 26, 20263 min read2 views

I Built TruthGuard: A Benchmark That Exposes When AI Lies With Confidence

There's a problem with AI that nobody talks about enough.

It's not that AI gets things wrong. Every system gets things wrong. The real problem is that AI gets things wrong while sounding completely certain. No hesitation. No "I'm not sure about this." Just a confident, fluent, completely wrong answer.

I wanted to measure that. So I built TruthGuard.

What Is TruthGuard?

TruthGuard is a 120-question confidence calibration benchmark I designed to evaluate the metacognitive abilities of large language models.

In plain English: it tests whether an AI actually knows what it knows.

A well-calibrated model should say "I'm 90% confident" only when it's actually right 90% of the time. When a model says it's 95% confident on something it gets wrong 40% of the time — that's metacognitive overconfidence. That's what TruthGuard hunts for.

Why I Built This

I was using several frontier models for a project and noticed something unsettling. The models that gave the most confident answers weren't always the most accurate ones. In some cases, the opposite was true.

That's dangerous. If you're building a product on top of an AI model, you need to know not just can it answer correctly but does it know when it can't.

I couldn't find a benchmark specifically designed to measure this at the metacognition level. So I built one.

How It Works

The benchmark has 120 questions split across three difficulty tiers:

Easy — Straightforward questions where a capable model should score well and know it
Tricky — Questions designed to expose surface-level pattern matching
Hard/Trap — Adversarial questions that look simple but contain subtle catches

For each question, the model must provide both an answer AND a confidence score (0–100%). TruthGuard then measures:

Calibration Error — How far off is the stated confidence from actual accuracy?
Overconfidence Rate — How often does the model claim high confidence on wrong answers?
Accuracy vs Confidence Correlation — Does confidence actually predict correctness?

What I Found

The results were revealing. Models consistently overestimated their accuracy on the Tricky and Trap tiers — sometimes by 20–30 percentage points. Easy questions showed reasonable calibration. But the harder the question, the worse the calibration got.

The pattern: AI models are most overconfident exactly when they should be most uncertain.

Why This Matters

If you're building anything serious with AI — a medical tool, a legal assistant, a research platform — you need to know how the model behaves when it's wrong. Does it signal uncertainty? Does it hold back? Or does it barrel forward with the same confident tone regardless of whether it's right?

TruthGuard gives you a way to measure that.

Try It Yourself

The full benchmark and dataset are publicly available on Kaggle:

Notebook: TruthGuard Benchmark — Metacognition Evaluation
Dataset: TruthGuard Metacognition Dataset

Both are open and free to use under CC BY-SA 4.0. If you're an AI researcher or developer, run your model against it and see how it scores.

What's Next

TruthGuard v1 is just the start. I'm working on expanding the question set, adding domain-specific tiers (medicine, law, code), and building a standardized evaluation pipeline that any developer can run against any model.

If you find this useful or want to collaborate, reach out at contact@ememzyvisuals.com.

Emmanuel Ariyo is a Creative Software Developer and AI Systems Engineer building AI tools and benchmarks from Nigeria. Follow at @ememzyvisuals.

AILLMBenchmarkTruthGuardPythonKaggle

Share this post

Share on X Share on Facebook Share on WhatsApp Share on TikTok

Comments

No comments yet. Be the first to leave one.

I'm Fine-Tuning an AI Model Specifically for African Developers. Here's Why.

Every major AI model was trained on data that doesn't reflect how African developers build software. So I'm building one that does — fine-tuned on Nigerian fintech, Pidgin, and real production patterns from the ground up.

Read post

Engineering5m

How I Became an AI/ML Engineer in Nigeria My Real Roadmap Failures and What Actually Worked

From web development to shipping production grade Nigerian language models and landing a real AI role. The honest technical journey with practical lessons.

Read post

Career4m