How to Use Proxies with LlamaIndex Web Readers

LlamaIndex Web Reader Proxy Integration

LlamaIndex is a data framework for building LLM applications with custom data. Its web readers fetch content from websites for indexing into knowledge bases that power RAG chatbots, search tools, and AI assistants. Like all web fetching operations, these readers benefit from proxy infrastructure that prevents blocking.

Why LlamaIndex Web Readers Need Proxies

LlamaIndex ingestion pipelines face the same challenges as other web fetching tools:

Bulk ingestion: Building a knowledge base requires loading dozens to thousands of pages, generating traffic patterns that trigger rate limiting.
Server deployment: Production LlamaIndex applications run on cloud infrastructure with blocked datacenter IPs.
Periodic refresh: Knowledge bases need regular updates, requiring sustained access to source websites.
Multi-source ingestion: A single knowledge base may ingest from documentation sites, forums, blogs, and official sources.

Configuring Proxies for LlamaIndex

SimpleWebPageReader with Proxy

from llama_index.readers.web import SimpleWebPageReader
import os

# Set proxy via environment variables
os.environ["HTTP_PROXY"] = "http://user:pass@gate.hexproxies.com:8080"
os.environ["HTTPS_PROXY"] = "http://user:pass@gate.hexproxies.com:8080"

reader = SimpleWebPageReader(html_to_text=True)
documents = reader.load_data(
    urls=["https://docs.example.com/guide", "https://blog.example.com/tutorial"]
)

BeautifulSoupWebReader with Proxy

from llama_index.readers.web import BeautifulSoupWebReader
import os

os.environ["HTTP_PROXY"] = "http://user:pass@gate.hexproxies.com:8080"
os.environ["HTTPS_PROXY"] = "http://user:pass@gate.hexproxies.com:8080"

reader = BeautifulSoupWebReader()
documents = reader.load_data(
    urls=["https://docs.example.com/api-reference"],
    custom_hostname="docs.example.com"
)

Custom Web Reader with Explicit Proxy

For more control over proxy configuration:

import requests
from llama_index.core import Document

def load_urls_with_proxy(urls, proxy_user="user", proxy_pass="your-password"):
    """Custom loader with explicit proxy configuration."""
    session = requests.Session()
    session.proxies = {
        "http": f"http://{proxy_user}:{proxy_pass}@gate.hexproxies.com:8080",
        "https": f"http://{proxy_user}:{proxy_pass}@gate.hexproxies.com:8080",
    }

    documents = []
    for url in urls:
        try:
            response = session.get(url, timeout=30)
            response.raise_for_status()
            doc = Document(text=response.text, metadata={"source": url})
            documents.append(doc)
        except Exception as e:
            print(f"Failed to load {url}: {e}")
    return documents

# Use with geo-targeting
docs = load_urls_with_proxy(
    urls=["https://example.com/page1"],
    proxy_user="user-country-us"
)

Building a Proxied Knowledge Base Pipeline

Complete pipeline from web ingestion to queryable knowledge base:

import os
from llama_index.core import VectorStoreIndex, Settings
from llama_index.llms.openai import OpenAI

# Configure proxy
os.environ["HTTP_PROXY"] = os.environ["HEX_PROXY_URL"]
os.environ["HTTPS_PROXY"] = os.environ["HEX_PROXY_URL"]

# Load documents through proxy
reader = SimpleWebPageReader(html_to_text=True)
documents = reader.load_data(urls=[
    "https://docs.example.com/getting-started",
    "https://docs.example.com/api-reference",
    "https://docs.example.com/best-practices",
])

# Configure global Settings (LlamaIndex 0.10+ replaced ServiceContext with Settings)
Settings.llm = OpenAI(model="gpt-4.1-mini")
index = VectorStoreIndex.from_documents(documents)

# Query
query_engine = index.as_query_engine()
response = query_engine.query("How do I configure authentication?")
print(response)

Handling Large-Scale Ingestion

For knowledge bases requiring thousands of source pages:

Batch loading: Process URLs in batches of 10-50 with delays between batches.
Geo-targeting: Use country-specific proxies when ingesting region-specific content.
Error resilience: Skip failed URLs and continue loading, then retry failures in a separate pass.
Incremental updates: Track which URLs have been loaded and only fetch new or changed content.
Bandwidth monitoring: Use the Hex Proxies dashboard to track ingestion costs and optimize.

LlamaIndex Web Reader Proxy Config

Prerequisites

Steps

Install LlamaIndex

Set proxy environment

Configure web readers

Build ingestion pipeline

Test and deploy

LlamaIndex Web Reader Proxy Integration

Why LlamaIndex Web Readers Need Proxies

Configuring Proxies for LlamaIndex

SimpleWebPageReader with Proxy

BeautifulSoupWebReader with Proxy

Custom Web Reader with Explicit Proxy

Building a Proxied Knowledge Base Pipeline

Handling Large-Scale Ingestion

Tips

Ready to Get Started?

Related Resources

Best Proxies for Web Scraping in 2026

Proxies for Web Scraping

How to Set Up Rotating Proxies in Python

Mechanize (Ruby) Integration

Proxies for Alternative Data Collection

Residential Proxies