My AI-powered Spanish learning app HolaMundo generated this question: ‘What villages existed before the Romans arrived in Spain?’ The answer: ‘Íberos, Celtas…’
Here’s the problem. That information wasn’t in the article I fed it. Should I be worried?
What is hallucination, and why it matters
Hallucination is when an AI makes up information that’s not in its source material.
This creates a dilemma for a language learning app. Students practice by answering comprehension questions. If the answers aren’t in the article, how can they check their work?
This is concerning when the made up information is wrong. The safe approach is strict grounding: use only what’s in the article. But when information is factual and correct, it offers an opportunity to provide additional context and enrichment.
| Strict Grounding | Educational Enrichment |
|---|---|
| Students can verify answers | Richer learning experience |
| Accurate to source | Connects to broader context |
| Safe, no misinformation | Fills knowledge gaps |
An experiment
I built a hallucination detector with 4 ways to measure if content is grounded.
- Vocabulary grounding: Are the Spanish words actually in the article?
- Question answerability: Can questions be answered from the source?
- Entity consistency: Are dates/names/places from the article?
- LLM as the judge: Does another AI think it’s grounded?
The grounding score is calculated as the average score across the four dimensions. Hallucination is the inverse of grounding, or 100% - grounding score.
I tested with two articles: Fernando Alonso’s biography page, and History of Spain on Wikipedia.
The initial results were alarming. Both articles yielded 40% hallucination rate. It’s much higher than I expected. It’s time to dig deeper.
Lesson learned: measure the measurement tool first
The first problem I saw was that my detector was flagging normal words as hallucinations. For instance, it flagged words like “Christian” and “Muslims”. It turns out that these are the English translations in the vocab flashcards. In addition, some common words like “ganó” (won) and “imagínate” (imagine) were flagged as suspicious names, when they are pretty common Spanish words.
I put in some fixes to only check Spanish content, filter out common Spanish words, and skip creative content like dialogue which is meant to be generative.
After these small fixes, the entity consistency jumped from 75% to 92%.
The lesson learned here is that it’s necessary to do eval on the eval tool.
| Metric | Before Fix | After Fix |
|---|---|---|
| Entity Consistency | 75% | 92% ⬆️ |
| Flagged Entities | 13 | 2 ⬇️ |
| False Positives | 11 | 0 ✓ |
Surprising discovery: different types of “hallucination”
After fixing the detector, I found that the flagged hallucinations fall into two categories.
1. Enrichment
This is the earlier example where the app mentions íberos when the article doesn’t cover them. This information is historically accurate. It also enriches the lesson. However, students can’t verify it from the source.
From the product perspective, I would want to adjust the prompt to only return answers that are grounded. At the same time, these additional contexts are interesting to surface as a comment to enrich a student’s understanding.
2. True hallucination: the real problem
This is when the app says “Fernando won his first championship in 2003” when he actually won in 2005, or makes up statistics not in the article. Thankfully, I found very little of this. The LLM was generally accurate.
Context matters: when enrichment is unacceptable
Here’s the important caveat: As a healthcare PM in my day job, I know this nuance doesn’t apply everywhere. When using patient documents as context for clinical decision-making, there’s no room for “helpful enrichment.” Every piece of information must be strictly grounded and verifiable. The stakes matter.
In healthcare, there’s zero tolerance for hallucination. In education, enrichment can enhance learning. The use case determines what’s acceptable.
Stories told by the metrics
The Fernando Alonso test showed something weird. Vocabulary grounding was 100% but question answerability was 0%. How can the two metrics vary so drastically?
It turns out that my search query was “vocabulary and grammar concepts”, but Wikipedia biographies don’t have those phrases. So it retrieved reference sections, citation metadata, and footnotes.
The vocabulary words were there (pilot, race, team), but the biographical content wasn’t.
This is not hallucination but rather a retrieval problem.
An interesting lesson learned here is that different content needs different queries. A biography needs an entirely different query like “Fernando Alonso career achievements” instead of “vocabulary and grammar”.
Key takeaways
1. Not all “hallucination” is bad
From a product perspective, out of scope enrichment can be a feature. My plan here is to add “comprehension” vs. “extension” labels to questions to provide clarity to the students.
2. The query matters as much as the model
Poor retrieval looks like hallucination, and different content types need different approaches. Before assuming the model is hallucinating, experiment with better prompting and query terms tailored to the content type.