Using Proxies with LangChain and LlamaIndex: Integration Guide

Last updated: April 2026 | Author: Hex Proxies Team

TL;DR: LangChain and LlamaIndex use web loaders to ingest data from URLs, but these loaders fail against sites with anti-bot protection or rate limiting. Integrating proxies into your RAG (Retrieval-Augmented Generation) pipeline ensures reliable data ingestion from any source. Hex Proxies' residential proxies ($4.25/GB) and ISP proxies ($2.08/IP) integrate with both frameworks through their HTTP client configuration. This guide covers setup, code examples, and production patterns.

The rise of RAG (Retrieval-Augmented Generation) applications has created a new category of web data consumers: AI frameworks that need to ingest web content as context for language model responses. LangChain and LlamaIndex — the two dominant RAG frameworks — both include web loading components that fetch, parse, and index web content. In production, these loaders encounter the same challenges as traditional web scrapers: rate limiting, IP blocking, geo-restrictions, and anti-bot detection.

This guide shows how to integrate proxy infrastructure into LangChain and LlamaIndex pipelines for reliable, scalable web data ingestion.

Why RAG Pipelines Need Proxies

Web Loader Limitations

Out of the box, LangChain's WebBaseLoader and LlamaIndex's SimpleWebPageReader make requests from your server's IP address. In production, this creates several problems:

IP blocking: After a few hundred requests, target sites block your server's IP
Rate limiting: Sites impose per-IP rate limits that throttle ingestion speed
Geo-restrictions: Content varies by geography, and your server's location determines what you see
Bot detection: Datacenter IPs are flagged immediately by sophisticated anti-bot systems
Single point of failure: If your IP is blocked, the entire ingestion pipeline stops

Scale Requirements

A production RAG application that indexes thousands of web pages needs to handle concurrent fetching, retry failed requests through different IPs, access geo-specific content, and maintain consistent ingestion rates. Proxy infrastructure solves all of these requirements.

LangChain Proxy Integration

Method 1: Environment Variable Configuration

The simplest approach sets proxy environment variables that Python's requests library (used internally by LangChain) respects:

import os

# Set proxy for all HTTP requests in the process
os.environ["HTTP_PROXY"] = "http://USERNAME-country-us:PASSWORD@gate.hexproxies.com:8080"
os.environ["HTTPS_PROXY"] = "http://USERNAME-country-us:PASSWORD@gate.hexproxies.com:8080"

from langchain_community.document_loaders import WebBaseLoader

# All web requests now route through Hex Proxies
loader = WebBaseLoader(["https://example.com/page1", "https://example.com/page2"])
documents = loader.load()

print(f"Loaded {len(documents)} documents via proxy")

Method 2: Custom Session with Proxy

For more control, create a custom requests session with proxy configuration:

import requests
from langchain_community.document_loaders import WebBaseLoader

def create_proxy_session(country="us"):
    """Create a requests session configured with Hex Proxies."""
    session = requests.Session()
    proxy_url = f"http://USERNAME-country-{country}:PASSWORD@gate.hexproxies.com:8080"
    session.proxies = {
        "http": proxy_url,
        "https": proxy_url
    }
    session.headers.update({
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/133.0.0.0"
    })
    return session

# Use custom session with WebBaseLoader
session = create_proxy_session(country="us")
loader = WebBaseLoader(
    web_paths=["https://example.com/data"],
    requests_kwargs={"verify": True},
    session=session
)
documents = loader.load()

Method 3: Custom Web Loader with Rotation

For production RAG pipelines that ingest from many sources, build a custom loader with proxy rotation and error handling:

import requests
import time
import random
from typing import List
from langchain_core.documents import Document
from langchain_community.document_loaders.base import BaseLoader
from bs4 import BeautifulSoup

class ProxyWebLoader(BaseLoader):
    """LangChain document loader with Hex Proxies integration."""

    GATEWAY = "gate.hexproxies.com:8080"

    def __init__(
        self,
        urls: List[str],
        username: str,
        password: str,
        country: str = "us",
        delay_range: tuple = (1, 3),
        max_retries: int = 3
    ):
        self.urls = urls
        self.username = username
        self.password = password
        self.country = country
        self.delay_range = delay_range
        self.max_retries = max_retries

    def _get_proxy_url(self, session_id=None):
        auth = f"{self.username}-country-{self.country}"
        if session_id:
            auth += f"-sessid-{session_id}"
        return f"http://{auth}:{self.password}@{self.GATEWAY}"

    def _fetch_page(self, url: str) -> str:
        for attempt in range(self.max_retries):
            proxy_url = self._get_proxy_url(session_id=f"rag-{random.randint(1000, 9999)}")
            proxies = {"http": proxy_url, "https": proxy_url}

            try:
                response = requests.get(
                    url,
                    proxies=proxies,
                    headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/133.0.0.0"},
                    timeout=30
                )
                if response.status_code == 200:
                    return response.text
            except requests.exceptions.RequestException:
                pass

            time.sleep(random.uniform(*self.delay_range))

        return ""

    def load(self) -> List[Document]:
        documents = []
        for url in self.urls:
            html = self._fetch_page(url)
            if html:
                soup = BeautifulSoup(html, "html.parser")
                text = soup.get_text(separator="\n", strip=True)
                documents.append(Document(
                    page_content=text,
                    metadata={"source": url}
                ))
            time.sleep(random.uniform(*self.delay_range))
        return documents

# Usage
loader = ProxyWebLoader(
    urls=["https://example.com/page1", "https://example.com/page2"],
    username="YOUR_USERNAME",
    password="YOUR_PASSWORD",
    country="us"
)
docs = loader.load()
print(f"Loaded {len(docs)} documents")

LlamaIndex Proxy Integration

SimpleWebPageReader with Proxy

import os

# Set proxy environment variables before importing LlamaIndex
os.environ["HTTP_PROXY"] = "http://USERNAME-country-us:PASSWORD@gate.hexproxies.com:8080"
os.environ["HTTPS_PROXY"] = "http://USERNAME-country-us:PASSWORD@gate.hexproxies.com:8080"

from llama_index.readers.web import SimpleWebPageReader

reader = SimpleWebPageReader(html_to_text=True)
documents = reader.load_data(urls=["https://example.com/data"])

print(f"Ingested {len(documents)} documents via proxy")

Custom LlamaIndex Reader with Proxy Rotation

import requests
import random
from typing import List
from llama_index.core.readers.base import BaseReader
from llama_index.core import Document
from bs4 import BeautifulSoup

class HexProxyWebReader(BaseReader):
    """LlamaIndex reader with Hex Proxies rotation for reliable ingestion."""

    def __init__(self, username: str, password: str, country: str = "us"):
        self.username = username
        self.password = password
        self.country = country
        self.gateway = "gate.hexproxies.com:8080"

    def _make_proxy_url(self):
        session = f"llama-{random.randint(10000, 99999)}"
        auth = f"{self.username}-country-{self.country}-sessid-{session}"
        return f"http://{auth}:{self.password}@{self.gateway}"

    def load_data(self, urls: List[str]) -> List[Document]:
        documents = []
        for url in urls:
            proxy_url = self._make_proxy_url()
            proxies = {"http": proxy_url, "https": proxy_url}

            try:
                response = requests.get(
                    url,
                    proxies=proxies,
                    headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) Chrome/133.0.0.0"},
                    timeout=30
                )
                if response.status_code == 200:
                    soup = BeautifulSoup(response.text, "html.parser")
                    text = soup.get_text(separator="\n", strip=True)
                    documents.append(Document(text=text, metadata={"url": url}))
            except requests.exceptions.RequestException as e:
                print(f"Failed to load {url}: {e}")

        return documents

# Usage with LlamaIndex pipeline
from llama_index.core import VectorStoreIndex

reader = HexProxyWebReader("YOUR_USERNAME", "YOUR_PASSWORD", country="us")
docs = reader.load_data([
    "https://example.com/article1",
    "https://example.com/article2",
    "https://example.com/article3"
])

# Build index from proxy-fetched documents
index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine()
response = query_engine.query("Summarize the key points")
print(response)

Production Architecture Patterns

Proxy-Enabled RAG Pipeline

┌─────────────────────────────────────────────────┐
│             RAG Application Layer                │
│  LangChain / LlamaIndex query engine            │
└──────────────────────┬──────────────────────────┘
                       │ query
                       ▼
              ┌────────────────┐
              │  Vector Store   │
              │  (embeddings)   │
              └────────┬───────┘
                       │ indexed from
                       ▼
┌─────────────────────────────────────────────────┐
│           Proxy-Enabled Ingestion Pipeline        │
│  Custom loader → Hex Proxies → Target sites     │
│  gate.hexproxies.com:8080                        │
│  Rotation | Geo-targeting | Retry logic          │
└─────────────────────────────────────────────────┘

Concurrent Ingestion with asyncio

import aiohttp
import asyncio
import random
from typing import List, Dict

async def fetch_with_proxy(
    session: aiohttp.ClientSession,
    url: str,
    username: str,
    password: str,
    country: str = "us"
) -> Dict:
    """Fetch a URL through Hex Proxies with async support."""
    sessid = f"async-{random.randint(10000, 99999)}"
    proxy_url = f"http://{username}-country-{country}-sessid-{sessid}:{password}@gate.hexproxies.com:8080"

    try:
        async with session.get(url, proxy=proxy_url, timeout=aiohttp.ClientTimeout(total=30)) as resp:
            if resp.status == 200:
                text = await resp.text()
                return {"url": url, "content": text, "status": "success"}
    except Exception as e:
        return {"url": url, "content": "", "status": f"error: {e}"}

    return {"url": url, "content": "", "status": f"http_{resp.status}"}

async def batch_ingest(urls: List[str], username: str, password: str, concurrency: int = 10):
    """Ingest multiple URLs concurrently through proxies."""
    semaphore = asyncio.Semaphore(concurrency)

    async def limited_fetch(session, url):
        async with semaphore:
            return await fetch_with_proxy(session, url, username, password)

    async with aiohttp.ClientSession() as session:
        tasks = [limited_fetch(session, url) for url in urls]
        results = await asyncio.gather(*tasks)

    return [r for r in results if r["status"] == "success"]

# Usage
urls = [f"https://example.com/page/{i}" for i in range(100)]
results = asyncio.run(batch_ingest(urls, "YOUR_USERNAME", "YOUR_PASSWORD"))
print(f"Successfully ingested {len(results)} pages")

Cost Estimation for RAG Pipelines

Pipeline Scale	Pages Indexed	Avg. Page Size	Total Bandwidth	Monthly Cost
Small knowledge base	500 pages	200 KB	100 MB	$0.43
Medium RAG application	5,000 pages	300 KB	1.5 GB	$6.38
Large-scale ingestion	50,000 pages	300 KB	15 GB	$63.75
Enterprise continuous ingestion	500,000 pages/month	300 KB	150 GB	$638

At $4.25/GB, proxy costs for RAG ingestion are trivial compared to LLM API costs, vector database hosting, and compute infrastructure. Even a large-scale pipeline ingesting 500,000 pages monthly costs only $638 in proxy bandwidth through Hex Proxies.

Best Practices for RAG + Proxy Integration

Separate ingestion from serving: Run proxy-based ingestion as a background pipeline, not in the request path of your RAG application
Cache aggressively: Once content is ingested and indexed, serve from your vector store — do not re-fetch through proxies for every query
Use geo-targeting for locale-specific content: If your RAG application needs to answer questions about region-specific information, ingest content through proxies in the relevant geography
Implement incremental updates: Track last-modified headers and only re-ingest content that has changed
Rate limit per domain: Even with proxy rotation, be respectful of target sites by limiting request rates per domain
Monitor ingestion quality: Track success rates, content freshness, and data quality metrics for your proxy-based ingestion pipeline

Frequently Asked Questions

Do I need proxies for LangChain web loading?

For development and small-scale prototypes, direct access works fine. In production, where you ingest from dozens or hundreds of sources continuously, proxies become essential to avoid IP blocks, rate limits, and geo-restrictions. The cost of residential proxies ($4.25/GB) is negligible compared to the reliability they provide.

Which proxy type should I use for RAG ingestion?

Use residential proxies for sites with anti-bot protection (most commercial websites, social media, news sites). Use ISP proxies ($2.08/IP) for persistent monitoring of specific sources that you scrape repeatedly (documentation sites, APIs, government portals). The residential/ISP split depends on your target sources.

How do proxies affect ingestion speed?

Proxies add 100-500ms of latency per request due to the additional network hop. For batch ingestion, this is offset by the ability to make concurrent requests through different IPs without hitting rate limits. A pipeline that would be throttled to 1 request/second on a single IP can make 10-50 concurrent requests through proxies, dramatically improving total throughput.

Can I use Hex Proxies with other RAG frameworks like Haystack?

Yes. Any Python-based RAG framework that uses the requests library or respects HTTP_PROXY/HTTPS_PROXY environment variables works with Hex Proxies. The integration pattern is the same: configure gate.hexproxies.com:8080 as the proxy endpoint in your HTTP client configuration.

What about JavaScript-heavy sites that need rendering?

For sites that require JavaScript rendering, use a headless browser (Playwright, Selenium) with proxy configuration instead of simple HTTP-based loaders. Both LangChain and LlamaIndex support Playwright-based loaders that can be configured with proxy settings. See our guide on proxy configuration for client-specific setup details.