Architecting a RAG Data Pipeline with Proxies: Ingestion, Chunking, and Freshness
Retrieval-Augmented Generation (RAG) systems are only as good as the corpus behind them. A production RAG stack typically has four stages: document acquisition, preprocessing and chunking, embedding generation, and retrieval at query time. Each stage has distinct failure modes, and the acquisition stage -- pulling documents from public web sources -- is where most pipelines silently degrade.
This post walks through the architecture of a production RAG ingestion pipeline, the role proxies play in keeping it reliable, and the specific chunking and embedding tradeoffs that affect retrieval quality. For background on proxy fundamentals see our proxy rotation explainer.
The Four Stages of a RAG Pipeline
A canonical RAG pipeline, as described in Lewis et al., Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (arXiv:2005.11401), separates a non-parametric memory (the vector store) from a parametric generator (the LLM). In practice most teams use a layered design:
- Ingestion: fetch documents from web sources, APIs, S3 buckets, or internal systems.
- Preprocessing: HTML cleaning, language detection, PII stripping, and chunking.
- Embedding: convert chunks to dense vectors via a model such as
text-embedding-3-large(3072-dim) orvoyage-3. - Retrieval and reranking: nearest-neighbor search, optional BM25 hybrid, and a cross-encoder reranker.
LangChain and LlamaIndex both expose document loaders, text splitters, and vector store wrappers that map onto these stages. LlamaIndex's IngestionPipeline and LangChain's RecursiveCharacterTextSplitter are the two most common starting points.
Why Proxies Matter for RAG Ingestion
Public web sources rate-limit aggressively. A single IP pulling 50,000 product pages from a retailer, or re-crawling an SEC EDGAR mirror daily, will be throttled or blocked. Three properties of RAG ingestion make proxies effective:
- Geographic correctness: localized sites return different content per region. A US-based RAG system answering questions about French regulations needs FR-egress to see accurate source material.
- Freshness SLAs: if your RAG system promises day-old data, you cannot tolerate multi-hour back-offs. Distributing fetches across a pool keeps throughput stable.
- Session isolation: when sources vary responses by cookie or IP history, per-document sticky sessions prevent cross-contamination.
Hex Proxies customers running RAG workloads typically combine ISP proxies for high-throughput static content and residential proxies for sources with aggressive bot defenses. See our ISP vs residential decision guide for the tradeoffs.
Chunking Strategy
Chunking is the single biggest lever on retrieval quality. The standard approaches:
- Fixed-size: e.g. 512 tokens with 64-token overlap. Simple, fast, loses semantic boundaries.
- Recursive character: split on paragraph, then sentence, then word. LangChain's default. Preserves structure on well-formatted HTML.
- Semantic chunking: compute embedding distance between adjacent sentences, split at local maxima. Implemented in LlamaIndex's
SemanticSplitterNodeParser. Better recall, 5-10x more expensive at ingest. - Proposition-based: use an LLM to extract self-contained factual propositions (see Chen et al., Dense X Retrieval, arXiv:2312.06648). Highest retrieval quality, 20-50x cost.
Empirically, on technical documentation corpora, semantic chunking with 384-768 token targets improves nDCG@10 by 8-15% over fixed-size. On short-form news it makes little difference.
A Minimal Ingestion Loop
import httpx
from llama_index.core import Document
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.vector_stores.qdrant import QdrantVectorStore
PROXY = "http://user:pass@gate.hexproxies.com:7777"
async def fetch(url: str) -> str:
async with httpx.AsyncClient(proxy=PROXY, timeout=30) as client:
r = await client.get(url, headers={"User-Agent": "ragbot/1.0"})
r.raise_for_status()
return r.text
splitter = SentenceSplitter(chunk_size=512, chunk_overlap=64)
embedder = OpenAIEmbedding(model="text-embedding-3-large")
async def ingest(urls: list[str]):
for url in urls:
html = await fetch(url)
doc = Document(text=html, metadata={"source": url})
nodes = splitter.get_nodes_from_documents([doc])
for node in nodes:
node.embedding = embedder.get_text_embedding(node.text)
# persist to Qdrant
This is the skeleton. Production versions add HTML-to-markdown conversion (e.g. trafilatura), language detection, dedupe by SHA-256 of normalized content, and a work queue so fetches do not block embedding calls.
Embedding Choices and Dimensionality
Modern embedding models are not drop-in equivalents. Relevant tradeoffs:
text-embedding-3-small(1536-dim, OpenAI): cheapest, strong on English, weak on code.text-embedding-3-large(3072-dim): ~5% better on MTEB but 6x the cost and storage.voyage-3andvoyage-code-3: currently top of MTEB for English retrieval and code search respectively.bge-m3(BAAI): open-weights, supports dense, sparse, and multi-vector in one model.
Matryoshka representation learning (Kusupati et al., arXiv:2205.13147) lets you truncate embeddings to a smaller dimension without retraining -- useful when storage is the bottleneck. At 512 dims, text-embedding-3-large retains ~95% of full-dim nDCG on many benchmarks.
Freshness and Incremental Updates
Static-corpus RAG is the exception. Most real systems need delta updates. Patterns that work:
- ETag and Last-Modified: cheapest; only re-fetch when the source says it changed.
- Content hashing: normalize HTML, hash, skip if unchanged.
- Scheduled recrawl with TTLs: assign TTLs per document type -- news at 1h, docs at 24h, regulations at 7d.
- Change-data-capture: when the source has a feed (RSS, Atom, sitemap lastmod), poll the feed instead of crawling.
For freshness-sensitive RAG, splitting fetch across a proxy pool lets you parallelize without hitting per-IP rate limits. A 200-worker crawl with 1 req/sec per IP is far less conspicuous than a 1-IP crawl at 200 req/sec -- and both have the same aggregate throughput.
Retrieval Quality: Dense, Sparse, and Hybrid
Pure dense retrieval loses on keyword-heavy queries (product SKUs, API names, legal citations). The standard fix is hybrid retrieval with Reciprocal Rank Fusion (Cormack et al., 2009):
score(d) = ∑ 1 / (k + ranki(d)) over all rankers i, with k typically 60.
BM25 + dense + RRF consistently outperforms either alone by 5-12% nDCG@10. For high-stakes retrieval add a cross-encoder reranker (e.g. bge-reranker-v2-m3) over the top 50 candidates; latency cost is ~100-300ms but precision gains are substantial.
Another technique worth citing: HyDE (Gao et al., Precise Zero-Shot Dense Retrieval without Relevance Labels, arXiv:2212.10496). The LLM generates a hypothetical answer, you embed that, and retrieve against it. Works well when queries are short and underspecified.
Monitoring and Observability
RAG pipelines fail quietly. The chunks you retrieve look reasonable but are subtly stale, wrong language, or missing key documents. Instrument:
- Ingest health: fetched URLs per hour, HTTP status distribution per proxy region, dedupe rate.
- Embedding cost: tokens embedded per day, cost per document.
- Retrieval quality: offline eval with a held-out query set; track nDCG@10 and answer-faithfulness over time.
- End-to-end evals: RAGAS (arXiv:2309.15217) for faithfulness, answer relevance, and context precision.
Putting It Together
A production RAG pipeline is less about clever prompts and more about boring ingestion discipline: reliable fetches, deterministic chunking, cheap dedupe, and measurable retrieval quality. Proxies sit at the base of that stack. Without them, the acquisition layer becomes the silent bottleneck that caps everything above it.
If you are building a RAG system with web-sourced data, see our AI data collection use case page for proxy configurations tuned for corpus building.