When your RAG bot answers 'yes' wrong
We had a bug where our AI support chatbot would refuse to help customers with questions it absolutely should have answered. The strange part was that the same question, asked slightly differently, would work fine. Classic flaky behaviour, the kind that's easy to dismiss as 'the model being weird' until you sit down and trace what's actually happening.
We had a bug on Lingawa where our AI support chatbot would sometimes refuse to help customers with questions it absolutely should have been able to answer. The strange part was that the same question, asked slightly differently, would work fine. I'd test it ten times and get a mix of correct answers and apologetic handoffs to a human agent.
This is the story of finding the cause and fixing it. I think it's a useful one because the root cause turned out to be conceptually simple, but the path to it involved chasing two wrong hypotheses first. That's pretty typical for these kinds of bugs and I want to talk about it honestly rather than skip to the answer.
What the bug looked like
Our chatbot is the first line of support on Lingawa. When a customer opens the chat widget and asks a question, the bot tries to answer using a knowledge base of articles we maintain. If it can't find relevant information, it hands off to a human support agent. Most of the time this works well. The bot resolves probably 60-70% of incoming questions without needing a person.
But Yvonne, our PM, sent me a screenshot one afternoon showing a conversation where the bot had clearly failed. A customer asked about rescheduling a lesson. The bot asked a clarifying question: "is it less than 24 hours before your next lesson?" The customer replied "no its more than 24 hours". And the bot responded with this:
I apologize, but I don't have the specific instructions for rescheduling lessons in my current knowledge base. I will connect you with a member of our team.
This was wrong on its face. We had two articles in the KB explicitly covering rescheduling with more than 24 hours notice. The bot should have given the customer a clear three-step answer. Instead it punted to a human and the customer had to wait for an agent.
What made it worse was the inconsistency. When I tested the same scenario with slightly different phrasing, sometimes it worked. "How do I reschedule a class?" followed by "No I have more than 24 hours before my lesson" gave the customer the correct steps. But "How do I reschedule a class?" followed by "no its more than 24 hours" failed. Same intent, same KB content available, totally different outcomes.
First wrong guess: maybe the KB articles are missing
When you have a bug where retrieval fails, the first thing to check is whether the content you're trying to retrieve actually exists in good shape. KBs drift. Articles get deactivated. Cron jobs that sync content from a source of truth fail silently. So I started there.
A read-only query against the KB collection returned eight articles matching "reschedul" in title, tags, or body. Two of them were exactly what I expected: KB000003 ("How to Reschedule a Lesson") and KB000060 ("Rescheduling With Less Than 24 Hours Notice"). Both active, both well-formed, both with the relevant phrases like "24 hours" prominently in the body. KB000003 even contained the exact clarifying question the bot had asked the customer.
So the content was fine. The KB wasn't the problem.
Second wrong guess: maybe the conversation context isn't being used
My next theory was that retrieval was only embedding the user's latest message, ignoring the rest of the conversation. This would explain the bug perfectly. The phrase "no its more than 24 hours" on its own has no rescheduling signal. There's no "reschedule" or "lesson" anywhere in it. If you embed that text and search a knowledge base full of articles about rescheduling lessons, you'll get noise back. The relevant articles can't match a query that has no relationship to them.
This hypothesis felt right and the code reading confirmed half of it: yes, retrieval embeds only the latest user message. The conversation history is included in the prompt sent to the LLM for generation, but not in the embedding query that selects which KB articles to pull in the first place.
But before I authorised a fix, I tested another scenario. The customer replies just "yes" to the bot's "is it less than 24 hours?" question. If retrieval were truly context-blind, "yes" alone should produce gibberish results. Instead, the bot replied with something that demonstrated correct understanding of the question: "I understand that you need to change your lesson time within 24 hours."
So the bot was using context somewhere. The retrieval was supposedly context-blind, but the bot still understood "yes" meant "yes, less than 24 hours." That contradiction broke my hypothesis.
What was actually happening
The pipeline is more layered than I'd assumed. There are multiple stages and they don't all use the same input.
user message
│
▼
[2] embed latest message ──→ vector
│
▼
search KB ──→ top 3 articles
│
[1] load history ───────────────────────────┤
▼
[4] build prompt
│
▼
[5] generate responseHere's what actually happens when a user sends a message:
- Conversation history is loaded (last 10 messages)
- The latest user message alone is sent to the embedding model to get a vector
- That vector is used to search the knowledge base for similar articles
- The top three articles, plus the full conversation history, plus the user's message, are all bundled into a prompt sent to the language model
- The language model generates a response
Stage 2 is the bottleneck. It's the only place where retrieval happens, and it sees only the latest message. Stages 4 and 5 have full context but they're working with whatever articles stage 2 happened to surface. If stage 2 surfaces irrelevant articles, the language model has nothing useful to work with and falls back to a default "I can't help, let me get someone" response defined in our system prompt.
Each individual stage of our pipeline was working as designed. The bug emerged from how the stages composed.
So when the customer says "yes", here's the actual sequence: stage 2 embeds "yes" and gets back random KB articles that have nothing to do with rescheduling. Stage 4 builds a prompt with those random articles plus the full conversation. Stage 5 sees the full conversation, understands the user is replying "yes" to a 24-hour question, but has no relevant article to ground its answer in. It generates something that demonstrates conversational understanding but apologises for not having instructions.
That's why the bug looked so confusing. The bot's response shows context understanding, which made me think context was being used. It was, just not at the layer that mattered.
Confirming with real data
Reading the code suggested a clear explanation, but I wanted to see it empirically before authorising a fix. We wrote a small read-only diagnostic that connected to our production database and ran the same retrieval pipeline against three queries. Query A was the actual failing case — what production currently embeds. Query B was an idealised version with all the right topic keywords. Query C was the conversation context concatenated with the user's reply, which is what the cheapest possible fix would produce.
const QUERIES = [
{ label: 'A', text: 'no its more than 24 hours' },
{ label: 'B', text: 'rescheduling lesson more than 24 hours notice' },
{ label: 'C', text: 'is it less than 24 hours before your next lesson? no its more than 24 hours' },
];
async function searchSimilar(query: string) {
const queryEmbedding = await generateEmbedding(query);
return kbArticles.aggregate([
{
$vectorSearch: {
index: 'kb_articles_vector_index',
path: 'embedding',
queryVector: queryEmbedding,
numCandidates: 30,
limit: 3,
filter: { isActive: true },
},
},
{ $addFields: { score: { $meta: 'vectorSearchScore' } } },
]).toArray();
}
for (const q of QUERIES) {
const results = await searchSimilar(q.text);
console.log(`\n${q.label}: "${q.text}"`);
results.forEach((article, i) => {
console.log(` [${i + 1}] ${article.articleId} ${article.title}`);
console.log(` score: ${article.score.toFixed(4)}`);
});
}The results were unambiguous.
Query A returned three completely off-topic articles: one about a "Topic Locked" feature, one about opening hours, one about the "Proverb of the Day". The top similarity score was 0.79 which is above our escalation threshold of 0.35. So the system thought it had found relevant content with high confidence. It just happened to be wildly wrong content.
Query B returned the correct article (KB000003) at rank one with a score of 0.90.
Query C, which is what the cheapest possible fix would actually produce, returned KB000003 at rank one with a score of 0.84.
So the fix was clear: include conversation context in the retrieval query. Not just the user's latest message, but the recent back-and-forth that establishes what the user is actually asking about.
The fix
We added a small helper function called buildRetrievalQuery that takes the user's current message and the conversation history, and produces a string that gets embedded for retrieval. The string is built from the last four text messages in the conversation, joined together.
A few details that mattered.
It only includes text content. Our conversations have other message types like category selection prompts and resolution feedback prompts. Those are pure UI scaffolding with no semantic value, so they get filtered out.
It excludes bot messages whose text matches certain phrases that indicate the bot was previously refusing to help, like "connect you with a member of our team." Without this filter, a customer who hits the bug, gets the apologetic refusal, and tries again would have the refusal language pollute their next retrieval query. It would basically train the embedding to look for "I'm sorry I can't help" articles, which is the opposite of what we want.
It degrades gracefully when there's no history. First-turn queries work the same as before.
The keyword-based escalation logic and the language model's generation prompt both still use the bare user message. We didn't want to change those because they have different requirements. The keyword check needs to be precise about the current message — we don't want stale "refund" keywords from earlier in the conversation triggering an escalation later. The generation prompt already had access to the full history through a separate code path.
The actual production code change is about fifty lines, plus tests. Small fix, big effect.
Verifying we didn't break anything
We pulled twenty real production conversations and replayed each one through both the old and new retrieval logic. Fifteen were known failure cases, conversations where the bot had escalated incorrectly. Five were known success cases where the bot had answered well.
Of the fifteen failures, six showed clear improvements where the correct article jumped from "not in top three" to "rank one with high confidence." None of the failures got worse. The remaining nine were either unchanged or only marginally affected, which is fine because some of those conversations had legitimately ambiguous queries where no amount of retrieval tuning will save you.
The five success cases: four were unchanged, and one turned out to be a false-positive resolution where the customer probably just gave up rather than getting a real answer. That's a separate problem.
So the verification said clearly: the fix lifts the cases we're targeting and doesn't regress the cases that were already working. Ship it.
What this fix means
For customers, the bot can now answer follow-up questions properly. When the bot asks "is it less than 24 hours?" and the customer replies "no", the bot will give them the actual rescheduling steps instead of apologising and putting them in the agent queue. That's a meaningful experience improvement.
For our agents, fewer escalations means more time on questions that genuinely require human judgment. For the business, lower agent load on routine questions lets us scale support coverage without proportionally scaling headcount, which matters when a small team is serving a global customer base across multiple timezones.
What we left for later
Two things we deliberately didn't fix in this PR.
First, our two rescheduling articles (the >24h one and the <24h one) are similar enough in embedding space that they often score within 0.01 of each other on relevant queries. Today the right one wins for the test queries we ran, but a slightly different phrasing could flip the rank order and the bot would feed the wrong article to the language model. The fix for this is structural: either merge the two articles, or split them so the >24h and <24h instructions embed separately, or add a reranking layer that does a more careful second pass on the top candidates. Worth doing eventually but not blocking the immediate fix.
Second, I noticed during this investigation that we have no article covering "how to book a lesson." Customers who ask "how do I book a class?" get whatever happens to score highest among Lessons-category articles, which is usually something tangentially related but not actually answering their question. That's a content gap, not a retrieval problem. Filing it for the content team.
Three things I'd take away
The first is how easy it is to be confidently wrong about a hypothesis when the symptom is consistent with multiple causes. "Retrieval is context-blind" was a plausible theory that fit the initial evidence. Testing one more scenario before authorising a fix is what saved us from shipping the wrong solution. That extra ten minutes was the cheapest insurance possible.
The second is that production AI systems have layers, and bugs hide in the seams between layers. Each individual stage of our pipeline was working as designed. The bug emerged from how the stages composed. That's hard to see from inside any one stage, which is why you need to map the whole pipeline before you can find these.
The third is that empirical verification beats reasoning every time. I could have stopped after the code reading and shipped the fix, but running the actual diagnostic against production data turned a "this should work" into a "this definitely works" plus a clear list of cases that don't improve. Those non-improvements pointed at follow-up work I'd otherwise have missed.
Bugs like this are common in RAG systems and most teams will hit something like it eventually. If your bot is sometimes answering well and sometimes refusing, with no obvious pattern from the customer's perspective, check what your retrieval query actually looks like before anything else. The answer is often that you're asking the wrong question, literally.