Using Proxies with LangChain and LlamaIndex: Integration Guide
Last updated: April 2026 | Author: Hex Proxies Team
The rise of RAG (Retrieval-Augmented Generation) applications has created a new category of web data consumers: AI frameworks that need to ingest web content as context for language model responses. LangChain and LlamaIndex — the two dominant RAG frameworks — both include web loading components that fetch, parse, and index web content. In production, these loaders encounter the same challenges as traditional web scrapers: rate limiting, IP blocking, geo-restrictions, and anti-bot detection.
This guide shows how to integrate proxy infrastructure into LangChain and LlamaIndex pipelines for reliable, scalable web data ingestion.
Why RAG Pipelines Need Proxies
Web Loader Limitations
Out of the box, LangChain's WebBaseLoader and LlamaIndex's SimpleWebPageReader make requests from your server's IP address. In production, this creates several problems:
- IP blocking: After a few hundred requests, target sites block your server's IP
- Rate limiting: Sites impose per-IP rate limits that throttle ingestion speed
- Geo-restrictions: Content varies by geography, and your server's location determines what you see
- Bot detection: Datacenter IPs are flagged immediately by sophisticated anti-bot systems
- Single point of failure: If your IP is blocked, the entire ingestion pipeline stops
Scale Requirements
A production RAG application that indexes thousands of web pages needs to handle concurrent fetching, retry failed requests through different IPs, access geo-specific content, and maintain consistent ingestion rates. Proxy infrastructure solves all of these requirements.
LangChain Proxy Integration
Method 1: Environment Variable Configuration
The simplest approach sets proxy environment variables that Python's requests library (used internally by LangChain) respects:
import os
# Set proxy for all HTTP requests in the process
os.environ["HTTP_PROXY"] = "http://USERNAME-country-us:PASSWORD@gate.hexproxies.com:8080"
os.environ["HTTPS_PROXY"] = "http://USERNAME-country-us:PASSWORD@gate.hexproxies.com:8080"
from langchain_community.document_loaders import WebBaseLoader
# All web requests now route through Hex Proxies
loader = WebBaseLoader(["https://example.com/page1", "https://example.com/page2"])
documents = loader.load()
print(f"Loaded {len(documents)} documents via proxy")
Method 2: Custom Session with Proxy
For more control, create a custom requests session with proxy configuration:
import requests
from langchain_community.document_loaders import WebBaseLoader
def create_proxy_session(country="us"):
"""Create a requests session configured with Hex Proxies."""
session = requests.Session()
proxy_url = f"http://USERNAME-country-{country}:PASSWORD@gate.hexproxies.com:8080"
session.proxies = {
"http": proxy_url,
"https": proxy_url
}
session.headers.update({
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/133.0.0.0"
})
return session
# Use custom session with WebBaseLoader
session = create_proxy_session(country="us")
loader = WebBaseLoader(
web_paths=["https://example.com/data"],
requests_kwargs={"verify": True},
session=session
)
documents = loader.load()
Method 3: Custom Web Loader with Rotation
For production RAG pipelines that ingest from many sources, build a custom loader with proxy rotation and error handling:
import requests
import time
import random
from typing import List
from langchain_core.documents import Document
from langchain_community.document_loaders.base import BaseLoader
from bs4 import BeautifulSoup
class ProxyWebLoader(BaseLoader):
"""LangChain document loader with Hex Proxies integration."""
GATEWAY = "gate.hexproxies.com:8080"
def __init__(
self,
urls: List[str],
username: str,
password: str,
country: str = "us",
delay_range: tuple = (1, 3),
max_retries: int = 3
):
self.urls = urls
self.username = username
self.password = password
self.country = country
self.delay_range = delay_range
self.max_retries = max_retries
def _get_proxy_url(self, session_id=None):
auth = f"{self.username}-country-{self.country}"
if session_id:
auth += f"-sessid-{session_id}"
return f"http://{auth}:{self.password}@{self.GATEWAY}"
def _fetch_page(self, url: str) -> str:
for attempt in range(self.max_retries):
proxy_url = self._get_proxy_url(session_id=f"rag-{random.randint(1000, 9999)}")
proxies = {"http": proxy_url, "https": proxy_url}
try:
response = requests.get(
url,
proxies=proxies,
headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/133.0.0.0"},
timeout=30
)
if response.status_code == 200:
return response.text
except requests.exceptions.RequestException:
pass
time.sleep(random.uniform(*self.delay_range))
return ""
def load(self) -> List[Document]:
documents = []
for url in self.urls:
html = self._fetch_page(url)
if html:
soup = BeautifulSoup(html, "html.parser")
text = soup.get_text(separator="\n", strip=True)
documents.append(Document(
page_content=text,
metadata={"source": url}
))
time.sleep(random.uniform(*self.delay_range))
return documents
# Usage
loader = ProxyWebLoader(
urls=["https://example.com/page1", "https://example.com/page2"],
username="YOUR_USERNAME",
password="YOUR_PASSWORD",
country="us"
)
docs = loader.load()
print(f"Loaded {len(docs)} documents")
LlamaIndex Proxy Integration
SimpleWebPageReader with Proxy
import os
# Set proxy environment variables before importing LlamaIndex
os.environ["HTTP_PROXY"] = "http://USERNAME-country-us:PASSWORD@gate.hexproxies.com:8080"
os.environ["HTTPS_PROXY"] = "http://USERNAME-country-us:PASSWORD@gate.hexproxies.com:8080"
from llama_index.readers.web import SimpleWebPageReader
reader = SimpleWebPageReader(html_to_text=True)
documents = reader.load_data(urls=["https://example.com/data"])
print(f"Ingested {len(documents)} documents via proxy")
Custom LlamaIndex Reader with Proxy Rotation
import requests
import random
from typing import List
from llama_index.core.readers.base import BaseReader
from llama_index.core import Document
from bs4 import BeautifulSoup
class HexProxyWebReader(BaseReader):
"""LlamaIndex reader with Hex Proxies rotation for reliable ingestion."""
def __init__(self, username: str, password: str, country: str = "us"):
self.username = username
self.password = password
self.country = country
self.gateway = "gate.hexproxies.com:8080"
def _make_proxy_url(self):
session = f"llama-{random.randint(10000, 99999)}"
auth = f"{self.username}-country-{self.country}-sessid-{session}"
return f"http://{auth}:{self.password}@{self.gateway}"
def load_data(self, urls: List[str]) -> List[Document]:
documents = []
for url in urls:
proxy_url = self._make_proxy_url()
proxies = {"http": proxy_url, "https": proxy_url}
try:
response = requests.get(
url,
proxies=proxies,
headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) Chrome/133.0.0.0"},
timeout=30
)
if response.status_code == 200:
soup = BeautifulSoup(response.text, "html.parser")
text = soup.get_text(separator="\n", strip=True)
documents.append(Document(text=text, metadata={"url": url}))
except requests.exceptions.RequestException as e:
print(f"Failed to load {url}: {e}")
return documents
# Usage with LlamaIndex pipeline
from llama_index.core import VectorStoreIndex
reader = HexProxyWebReader("YOUR_USERNAME", "YOUR_PASSWORD", country="us")
docs = reader.load_data([
"https://example.com/article1",
"https://example.com/article2",
"https://example.com/article3"
])
# Build index from proxy-fetched documents
index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine()
response = query_engine.query("Summarize the key points")
print(response)
Production Architecture Patterns
Proxy-Enabled RAG Pipeline
┌─────────────────────────────────────────────────┐
│ RAG Application Layer │
│ LangChain / LlamaIndex query engine │
└──────────────────────┬──────────────────────────┘
│ query
▼
┌────────────────┐
│ Vector Store │
│ (embeddings) │
└────────┬───────┘
│ indexed from
▼
┌─────────────────────────────────────────────────┐
│ Proxy-Enabled Ingestion Pipeline │
│ Custom loader → Hex Proxies → Target sites │
│ gate.hexproxies.com:8080 │
│ Rotation | Geo-targeting | Retry logic │
└─────────────────────────────────────────────────┘
Concurrent Ingestion with asyncio
import aiohttp
import asyncio
import random
from typing import List, Dict
async def fetch_with_proxy(
session: aiohttp.ClientSession,
url: str,
username: str,
password: str,
country: str = "us"
) -> Dict:
"""Fetch a URL through Hex Proxies with async support."""
sessid = f"async-{random.randint(10000, 99999)}"
proxy_url = f"http://{username}-country-{country}-sessid-{sessid}:{password}@gate.hexproxies.com:8080"
try:
async with session.get(url, proxy=proxy_url, timeout=aiohttp.ClientTimeout(total=30)) as resp:
if resp.status == 200:
text = await resp.text()
return {"url": url, "content": text, "status": "success"}
except Exception as e:
return {"url": url, "content": "", "status": f"error: {e}"}
return {"url": url, "content": "", "status": f"http_{resp.status}"}
async def batch_ingest(urls: List[str], username: str, password: str, concurrency: int = 10):
"""Ingest multiple URLs concurrently through proxies."""
semaphore = asyncio.Semaphore(concurrency)
async def limited_fetch(session, url):
async with semaphore:
return await fetch_with_proxy(session, url, username, password)
async with aiohttp.ClientSession() as session:
tasks = [limited_fetch(session, url) for url in urls]
results = await asyncio.gather(*tasks)
return [r for r in results if r["status"] == "success"]
# Usage
urls = [f"https://example.com/page/{i}" for i in range(100)]
results = asyncio.run(batch_ingest(urls, "YOUR_USERNAME", "YOUR_PASSWORD"))
print(f"Successfully ingested {len(results)} pages")
Cost Estimation for RAG Pipelines
| Pipeline Scale | Pages Indexed | Avg. Page Size | Total Bandwidth | Monthly Cost |
|---|---|---|---|---|
| Small knowledge base | 500 pages | 200 KB | 100 MB | $0.17 |
| Medium RAG application | 5,000 pages | 300 KB | 1.5 GB | $2.55 |
| Large-scale ingestion | 50,000 pages | 300 KB | 15 GB | $25.50 |
| Enterprise continuous ingestion | 500,000 pages/month | 300 KB | 150 GB | $255 |
At $1.70/GB, proxy costs for RAG ingestion are trivial compared to LLM API costs, vector database hosting, and compute infrastructure. Even a large-scale pipeline ingesting 500,000 pages monthly costs only $255 in proxy bandwidth through Hex Proxies.
Best Practices for RAG + Proxy Integration
- Separate ingestion from serving: Run proxy-based ingestion as a background pipeline, not in the request path of your RAG application
- Cache aggressively: Once content is ingested and indexed, serve from your vector store — do not re-fetch through proxies for every query
- Use geo-targeting for locale-specific content: If your RAG application needs to answer questions about region-specific information, ingest content through proxies in the relevant geography
- Implement incremental updates: Track last-modified headers and only re-ingest content that has changed
- Rate limit per domain: Even with proxy rotation, be respectful of target sites by limiting request rates per domain
- Monitor ingestion quality: Track success rates, content freshness, and data quality metrics for your proxy-based ingestion pipeline
Frequently Asked Questions
Do I need proxies for LangChain web loading?
For development and small-scale prototypes, direct access works fine. In production, where you ingest from dozens or hundreds of sources continuously, proxies become essential to avoid IP blocks, rate limits, and geo-restrictions. The cost of residential proxies ($1.70/GB) is negligible compared to the reliability they provide.
Which proxy type should I use for RAG ingestion?
Use residential proxies for sites with anti-bot protection (most commercial websites, social media, news sites). Use ISP proxies ($0.83/IP) for persistent monitoring of specific sources that you scrape repeatedly (documentation sites, APIs, government portals). The residential/ISP split depends on your target sources.
How do proxies affect ingestion speed?
Proxies add 100-500ms of latency per request due to the additional network hop. For batch ingestion, this is offset by the ability to make concurrent requests through different IPs without hitting rate limits. A pipeline that would be throttled to 1 request/second on a single IP can make 10-50 concurrent requests through proxies, dramatically improving total throughput.
Can I use Hex Proxies with other RAG frameworks like Haystack?
Yes. Any Python-based RAG framework that uses the requests library or respects HTTP_PROXY/HTTPS_PROXY environment variables works with Hex Proxies. The integration pattern is the same: configure gate.hexproxies.com:8080 as the proxy endpoint in your HTTP client configuration.
What about JavaScript-heavy sites that need rendering?
For sites that require JavaScript rendering, use a headless browser (Playwright, Selenium) with proxy configuration instead of simple HTTP-based loaders. Both LangChain and LlamaIndex support Playwright-based loaders that can be configured with proxy settings. See our guide on proxy configuration for client-specific setup details.