If you've looked into adding an AI chatbot to your site in the last two years, you've encountered RAG. Retrieval-augmented generation. The idea: take a user's question, search your documents for relevant chunks, paste those chunks into a prompt alongside the question, and let a large language model generate an answer.
It's the default architecture for every AI chatbot builder, every "chat with your docs" tool, and most DIY setups. It's also fundamentally flawed for production support use cases.
How RAG actually works
The pipeline:
- Your documents get split into chunks (typically 500–1000 tokens each)
- Each chunk gets converted to a vector embedding and stored in a vector database
- When a user asks a question, their question also gets embedded
- The vector DB finds the top-K most similar chunks (usually 3–5)
- Those chunks get inserted into a prompt: "Based on the following context, answer the user's question: [chunks] Question: [user question]"
- A large model (GPT-4, Claude, etc.) generates the answer
Every step in this pipeline is a point of failure.
Problem 1: Retrieval misses
The entire system depends on step 4, finding the right chunks. If the retrieval fails, the model gets wrong context and generates a confident, wrong answer.
Retrieval fails more often than people admit:
Vocabulary mismatch. A customer asks "Can I get my money back?" Your return policy page uses "refund" and "return." The embedding similarity between the question and the relevant chunk might be lower than a chunk about "cashback rewards" from a different page. Wrong chunk retrieved. Wrong answer generated.
Cross-page answers. "What size should I get if I'm between a Medium and Large in the oversized blazer?" The answer requires information from the size guide page and the product page for that specific blazer. RAG retrieves chunks independently. If the relevant chunks end up split across two pages and only one gets retrieved, the answer is incomplete.
Chunking artifacts. Your return policy says "Sale items can be returned within 14 days. All other items within 30 days." If the chunk boundary falls between those two sentences, a query about sale item returns might only retrieve the "30 days" chunk. The answer is technically generated from your content, and it's wrong.
Embedding model limitations. Most RAG systems use general-purpose embedding models. These models are good at semantic similarity in broad domains. They're not optimized for your specific terminology, your product names, your internal jargon. "The Aurora collection" means nothing to a general embedding model. It might rank it lower than a chunk containing the word "lights."
You can mitigate some of these with better chunking strategies, hybrid search (BM25 + vector), re-ranking, and chunk overlap. But each mitigation adds complexity, latency, and new failure modes. You're building compensating mechanisms for a fundamentally brittle foundation.
Problem 2: The model doesn't know your content
This is the one that surprises people. In a RAG system, the large language model has never seen your content before. It's not trained on your docs. It receives a few chunks in the prompt and generates from that context, every single time.
This means:
No persistent understanding. The model doesn't know your product line, your policies, your brand voice, or your terminology. It's doing reading comprehension from scratch on every query. A human support agent who's worked at your company for a month has a mental model of your business. A RAG chatbot has the memory of a goldfish, it starts from zero every time.
Context window pressure. You can only fit so many chunks in the prompt. At 5 chunks of 500 tokens each, that's 2,500 tokens of context. Your entire site might be 500 pages. The model is seeing 0.5% of your content on any given query. This works when the retrieval is perfect. When it's not, the model is missing 99.5% of the information it might need.
No reasoning across your content. Ask a RAG bot "Which of your products is best for sensitive skin?" and it'll answer based on whatever 3–5 chunks got retrieved. If your sensitive-skin product descriptions happen to use different terminology than the question, those chunks might not even be in the context. A fine-tuned model that has learned your entire catalog can reason across all products simultaneously.
Problem 3: Per-query cost, forever
Every RAG query involves:
- An embedding API call (to embed the question)
- A vector database query (to find similar chunks)
- An LLM API call (to generate the answer)
For a single query, this costs fractions of a cent. At scale, it's a real line item:
| Monthly queries | Embedding cost | LLM cost (GPT-4o-mini) | Total API cost |
|---|---|---|---|
| 1,000 | ~$0.02 | ~$1.50 | ~$1.52 |
| 10,000 | ~$0.20 | ~$15 | ~$15.20 |
| 100,000 | ~$2 | ~$150 | ~$152 |
| 500,000 | ~$10 | ~$750 | ~$760 |
And that's just the API cost. Add hosting for your vector DB, your embedding pipeline, your application server, monitoring, and error handling. Realistic all-in costs are 2–3x the raw API numbers.
The problem isn't that it's expensive at any single point. The problem is that the cost scales linearly with your success. More customers, more questions, more spend. The tool that was supposed to reduce your support costs becomes a growing line item.
Problem 4: Latency floor
A RAG query has a minimum latency chain:
- Embed the question: 50–200ms
- Vector DB search: 20–100ms
- LLM generation: 500ms–3s (time to first token)
- Streaming the full response: 2–10s
Best case, your customer waits 600ms before seeing anything. Typical case, 1–2 seconds. This is the floor, it can't get faster without fundamentally changing the architecture.
A fine-tuned model running on-device starts generating tokens in under 200ms. No network hop. No embedding step. No retrieval step. The model already knows the answer because it learned the content during training.
Problem 5: Hallucination from context, not ignorance
RAG hallucinations are worse than base model hallucinations because they're harder to detect.
When a base model makes something up, it often sounds vague or hedges. When a RAG model hallucinates, it does so using your content as the source. It'll cite your actual product names, your actual prices, your actual policies, and still get the answer wrong because it misunderstood the chunk or combined two unrelated chunks.
"The Aurora dress is available in sizes XS through XXL and is made from 100% silk." Maybe the Aurora dress only goes up to XL, and the 100% silk detail was from a different product whose chunk was also retrieved. The answer sounds authoritative. It references real products. And it's wrong.
Users trust answers more when they contain specific, real-sounding details. RAG hallucinations produce exactly that kind of false confidence.
The alternative: fine-tuning on your content
Instead of searching your content at query time, what if the model already knew it?
That's the fine-tuning approach. You take a small language model (0.6B–4B parameters), train it on QA pairs generated from your content, and deploy it. The model has learned your products, policies, and terminology. It doesn't retrieve chunks, it generates answers from internal knowledge.
Retrieval can't miss because there's no retrieval step. The knowledge is in the weights.
Cross-page reasoning works because the model was trained on content from all your pages. It can synthesize information across your entire site.
Per-query cost is zero because the model runs on the customer's device. No API calls, no server costs, no scaling tax.
Latency is sub-200ms because inference is local. No network, no embedding, no vector search.
Hallucinations are about absence, not confusion. A fine-tuned model either knows the answer or it doesn't. When it doesn't know, it says so. It doesn't construct false answers from mismatched context chunks.
The fine-tuning tradeoff
Fine-tuning isn't free. There are real costs:
Training time. You can't update the model instantly. Retraining takes 10–15 minutes. If your content changes hourly, you need to plan retrain cycles. RAG can pick up new content as soon as it's indexed.
Training cost. Each retrain uses GPU compute. Kanha includes retrains in your plan (3/month on Starter, 8/month on Pro), but it's a finite resource. RAG re-indexing is cheaper per update.
Model size limits. Small models can't memorize 10,000 pages of dense content. They work best for focused knowledge bases, a few hundred pages of product info, docs, or policies. RAG handles larger corpora better because the model doesn't need to memorize everything.
No citation. A RAG system can point to the source chunk. A fine-tuned model generates from learned knowledge, it can't easily say "this answer came from page X." For some use cases, citations matter.
These are real constraints. But for the 90% of businesses with sites under 2,000 pages, whose content changes weekly or monthly, and whose customers ask the same categories of questions, the fine-tuning approach is strictly better. Better answers, lower cost, faster responses, stronger privacy.
Where RAG still wins
RAG is the right choice when:
- Your content changes multiple times per day and freshness is critical
- Your knowledge base is 10,000+ pages and growing rapidly
- You need source citations on every answer
- You're building a general-purpose AI assistant, not a support bot
For these use cases, the retrieval architecture makes sense. The model is a general reasoning engine, and the documents are external memory. That's what RAG was designed for.
But most businesses deploying a support chatbot don't have these requirements. They have a product catalog, a help center, and a FAQ. The content changes weekly. Customers ask the same 100 types of questions. A fine-tuned small model handles this better than a billion-parameter model doing reading comprehension from scratch on every query.
Try the difference
The best way to understand this isn't reading about it, it's experiencing it. Check our showcase to chat with bots trained on real sites. Notice the speed. Notice the accuracy. Notice that these are 0.6B parameter models outperforming GPT-4-based RAG on domain-specific questions.
Then sign up at kanha.ai, point it at your site, and have your own fine-tuned bot running in 20 minutes. Free tier, 50 pages, no credit card.
The RAG era was a stepping stone. Fine-tuned, on-device models are where this is going.