Blog
/
Architecture
RAG
LLM
Data pipelines
Production readiness

RAG in production challenges

Eduard Dolynskyi
Solution Architect at Radency
Sep 18, 2025

Retrieval-Augmented Generation (RAG) represents a significant advancement in natural language processing (NLP) and artificial intelligence (AI). By blending retrieval-based techniques with generative models, RAG systems can pull in external information on demand and use it to craft more accurate, context-aware responses.

However, releasing the potential of the RAG framework can be tricky. This article will enlighten the pitfalls of working with RAG and explain why a naive approach is a dead end. More importantly, it will provide specific options to overcome the challenges of working with RAG, transforming your understanding from “it just works” to “here's how to make it work reliably.”

Provided workarounds are based on my experience of integrating RAG into a real SaaS product for a client, as well as data from third-party sources available on the Internet.

Getting Started With RAG

Launching a local LLM is as simple as a few terminal commands. Whether you're using Ollama, LM Studio, or running models directly through Hugging Face Transformers, getting a language model running locally feels like magic — suddenly you have a ChatGPT-like interface responding to your queries with impressive fluency.

Connecting a vector store seems equally trivial. Spin up ChromaDB, Pinecone, or FAISS with minimal configuration. The setup tutorials make it look effortless — a few API calls, some basic indexing parameters, and you're ready to store and retrieve embeddings.

Uploading files and getting responses completes the illusion of simplicity. Drag and drop your PDFs, markdown files, or text documents into the system. Watch as they get processed, chunked, and embedded into your vector database. Ask a question, and the system confidently retrieves relevant passages and generates coherent, well-formatted answers.

The illusion of functionality is perfect — and perfectly deceptive. Those first interactions are genuinely impressive. Ask about a specific policy in your employee handbook, and the system retrieves the exact section with apparent precision. Query technical documentation, and it pulls relevant details with contextual understanding that validates every promise made about RAG's potential.

What you're witnessing isn't a robust system handling complex information retrieval — it's a series of fortunate coincidences where simple queries align with well-structured documents under ideal conditions.

The underlying complexity remains hidden beneath a veneer of apparent success, creating dangerous overconfidence that leads teams to rush toward production deployment without understanding the fundamental challenges that lie ahead.

The Reality Check: What Can Go Wrong

Once you push a RAG system beyond simple demos, the same failure modes start showing up. Token limits become a recurring headache: documents that seemed manageable in testing suddenly exceed the model's context window, and multi-document queries that should combine insights instead crash or get truncated.

Retrieval introduces its own problems — questions about vacation policy pull in office security details, or technical queries blend snippets from unrelated documents into convincing but misleading hybrid answers. Even when the right sources are retrieved, responses can vary wildly: the same question asked twice might yield two completely different answers, both delivered with confidence.

Worst of all are the fluent but factually wrong outputs. These responses are polished, authoritative, and often cite sources correctly — yet the core content is incorrect, making them especially dangerous without expert oversight.

Issue Type Symptoms Impact
Token Limit ErrorsContext length exceeded, processing failuresComplete system breakdown
Wrong-Document AnswersInformation from unrelated sources mixed togetherUser trust erosion, dangerous misinformation
Inconsistent ResponsesSame question produces different answersUnreliable for critical applications
Fluent but Incorrect OutputsAuthoritative-sounding but factually wrong responsesDifficult to detect, highly dangerous

The most frustrating discovery: a better model didn't fix the issues. I naturally assumed that upgrading to a better model would resolve the problems. Instead, it turned out that more powerful models simply generate more convincing wrong answers, making the system even less reliable for production use.

The Real Problem: It's the Data, Not the Model

The fundamental insight that eludes most RAG implementations is deceptively simple: LLMs just predict the next token — garbage in, garbage out.

If you feed poorly processed, irrelevant, or contradictory information into a system, even the most advanced model will confidently generate plausible but incorrect responses.

When my team and I were working on integrating RAG into a client's software environment, one of the challenges we had to deal with was handling diverse and unstructured document formats.

Instance: Documents were in various formats (PDF, DOCX, JSON), with inconsistent structures. Some contained tables, others were scanned images with OCR artefacts and strange characters, making standard parsing unreliable.

Our solution: Format-specific processing pipelines. We designed multiple processing pipelines tailored to each document type. Integrated metadata-based routing, specialized parsers for text and tables, and optical character recognition cleaning for scanned documents to extract accurate, structured data.

Before we calibrated RAG pipelines, the system provided irrelevant or stitched-together answers. The retriever pulled passages from different documents, and the LLM blended them into a single response, even if they didn't belong together.

This led to “wrong-document” responses — where the output is technically well-phrased but not actually supported by the right source — undermining both accuracy and trust.

At that point, we rebuilt pipelines and achieved the following:

  • Higher relevance
  • Reduced hallucinations
  • Consistency across answers
  • Better source attribution
  • Performance efficiency

Here are the examples of outputs before/after RAG pipelines calibration:

Before RAG Calibration (Irrelevant / Misleading Outputs)

  1. Prompt: “What is the property tax for 123 Main Street?”
    Output: “The property has 3 bedrooms and a backyard.” (pulled from a listing page, mixing unrelated info instead of answering the tax question).
  2. Prompt: “Does the apartment at 45 Oak Avenue allow pets?”
    Output: “This property has a spacious kitchen and modern appliances.” (retrieved fragments from the wrong page, skipping the rental policy).
  3. Prompt: “What's the square footage of the office building on Market Street?”
    Output: “The building is located near public transport and shopping areas.” (location info stitched together, but no square footage).

After RAG Calibration (Correct / Context-Related Outputs)

  1. Prompt: “What is the property tax for 123 Main Street?”
    Output: “The annual property tax for 123 Main Street is $5,200, as stated in the municipal record.”
  2. Prompt: “Does the apartment at 45 Oak Avenue allow pets?”
    Output: “Yes, the lease agreement for 45 Oak Avenue specifies that cats and small dogs are permitted.”
  3. Prompt: “What's the square footage of the office building on Market Street?”
    Output: “The office building on Market Street has a total floor area of 42,000 sq. ft., according to the appraisal document.”

Now, based on our experience, let's take a look at actionable approaches to adjusting RAG pipelines.

Rebuilding the Pipelines: A New RAG Mental Model

Moving beyond naive implementations requires a fundamental shift in thinking about RAG architecture.

Instead of viewing RAG as a simple “model + vector database” system, successful implementations require understanding RAG as two interconnected but distinct pipelines: ingestion and retrieval. Each pipeline has its complex workflow, failure modes, and optimization opportunities that directly impact final system performance.

Ingestion Pipeline

The ingestion pipeline is where most RAG systems succeed or fail, yet it receives the least attention during development. This pipeline must handle the messy reality of converting human-readable documents into machine-processable knowledge.

Reading and parsing source documents involves far more than simple text extraction. PDFs contain complex layouts, tables, and embedded images that standard parsers handle poorly. Web content includes navigation elements, advertisements, and formatting artifacts that pollute the knowledge base. Even clean text documents contain implicit structure — headers, lists, and semantic relationships — that naive parsing destroys.

Enterprise data isn't neatly packaged. It's scattered across cloud storage buckets, collaboration tools, databases, and SaaS applications. Each source brings its own API quirks, content formats, permission models, and metadata conventions. Successful ingestion requires connectivity across platforms while preserving critical context and metadata.

Cleaning and reshaping text requires domain-specific intelligence that no framework can provide out of the box. Technical documentation might need code snippets preserved exactly while removing boilerplate text. Legal documents require careful handling of cross-references and citations. Marketing materials need fact separation from promotional language. Each document type demands its preprocessing strategy.

Chunking based on semantic structure and user intent is perhaps the most critical and least understood aspect of RAG systems. Fixed-size chunking destroys context by splitting related concepts across boundaries. Topic-based chunking requires understanding the document structure that varies dramatically across sources. The optimal chunking strategy must consider not just document content, but the types of questions users will ask.

Storing and caching for performance involves balancing retrieval speed against storage costs while maintaining the metadata necessary for effective filtering and ranking. Production systems require distributed architectures, queuing mechanisms, and automatic scaling to handle enterprise-scale document collections.

Retrieval Pipeline

The retrieval pipeline must bridge the gap between ambiguous human queries and precise document references, handling the inherent uncertainty in both natural language understanding and relevance ranking.

Stage Function Common Failures Best Practices
Query UnderstandingProcess ambiguous user questionsLiteral interpretation of complex queriesExpansion, clarification, intent matching
Relevance FilteringRemove contextually inappropriate content“Python debugging” returning snake handlingDomain context and intent understanding
Context ExpansionInclude sufficient surrounding informationInsufficient context for comprehensive answersRetrieve adjacent sections, preserve hierarchy
Candidate RerankingImprove relevance beyond similarity scoresOver-reliance on vector similarityMultiple signals: authority, recency, feedback
Result TruncationFit within model context windowsInformation loss during truncationIntelligent selection and summarization

Lessons Learned & Best Practices

Hard-won insights from our team that have successfully navigated the transition from RAG prototype to production system:

  • Spend 80% of the time on data preparation
  • Most sophisticated LLMs cannot overcome poor data processing
  • Data quality determines system success more than model selection
  • Engineering resources should prioritize data pipelines over generation optimization
  • Prioritize chunking strategy over fancy prompts
  • Perfect prompts cannot rescue irrelevant context
  • Well-processed information with mediocre prompts outperforms poor data with perfect prompts
  • Understanding document structure and user patterns is critical
  • Don't trust framework defaults
  • Default configurations optimize for demos, not production reliability
  • Framework elegance often masks significant limitations beyond basic use cases
  • Custom configuration required for real-world applications
  • Create architecture that envisages scaling
  • Develop a user-centric design
  • Include feedback loops for continuous improvement

The Five Critical Questions

Before building any RAG system, answer these fundamental questions:

  1. What's the structure of documents?
  2. Can content be reshaped for clarity?
  3. What noise can be removed?
  4. What format aids LLM reasoning?
  5. Does chunking match user patterns?

The answers will help you to:

  • Determine parsing and chunking strategies
  • Develop a preprocessing pipeline design
  • Define filtering and cleaning approaches
  • Shape contexts preparation methods
  • Create an information architecture

Now, let's move to the conclusion section of the article.

The Bottom Line

The gap between hype and real-world readiness remains substantial. Despite polished demos and confident claims, RAG deployments still struggle with unreliable retrieval, inconsistent responses, and hidden failures — even on top frameworks and models. This isn't a flaw of RAG itself, but a reminder that we've underestimated the engineering effort it takes to make these systems production-ready.

The future of reliable RAG systems lies not in making the technology more magical, but in making it more engineered:

  • Success comes from thoughtful data processing, not sophisticated models
  • Comprehensive testing reveals problems demos never show
  • Deep understanding of information flow from documents to users
  • Good AI is built on good data engineering fundamentals

The sooner we abandon the illusion that RAG is a solved problem and embrace its true complexity, the sooner we can build systems that actually deliver on its remarkable potential. The challenge isn't technical — it's architectural, methodological, and fundamentally about understanding that reliable AI requires reliable data.