v1.8.91-d84675c
← Back to Hex Proxies

LlamaIndex Web Reader Proxy Config

Last updated: April 2026

By Hex Proxies Engineering Team

Set up Hex Proxies with LlamaIndex web readers for reliable content ingestion in knowledge bases and RAG applications. Covers SimpleWebPageReader and more.

intermediate15 minutesai-agents

Prerequisites

  • Python 3.9 or later
  • LlamaIndex installed (pip install llama-index)
  • Hex Proxies account with residential proxy access
  • Basic understanding of LlamaIndex data connectors

Steps

1

Install LlamaIndex

Install llama-index with web reader dependencies: pip install llama-index llama-index-readers-web

2

Set proxy environment

Configure HTTP_PROXY and HTTPS_PROXY environment variables with Hex Proxies credentials.

3

Configure web readers

Initialize SimpleWebPageReader or BeautifulSoupWebReader -- they inherit proxy from environment variables.

4

Build ingestion pipeline

Create a batch loading pipeline with error handling and rate limiting for large-scale ingestion.

5

Test and deploy

Verify documents load correctly through the proxy, then deploy the knowledge base pipeline.

LlamaIndex Web Reader Proxy Integration

LlamaIndex is a data framework for building LLM applications with custom data. Its web readers fetch content from websites for indexing into knowledge bases that power RAG chatbots, search tools, and AI assistants. Like all web fetching operations, these readers benefit from proxy infrastructure that prevents blocking.

Why LlamaIndex Web Readers Need Proxies

LlamaIndex ingestion pipelines face the same challenges as other web fetching tools:

  • **Bulk ingestion**: Building a knowledge base requires loading dozens to thousands of pages, generating traffic patterns that trigger rate limiting.
  • **Server deployment**: Production LlamaIndex applications run on cloud infrastructure with blocked datacenter IPs.
  • **Periodic refresh**: Knowledge bases need regular updates, requiring sustained access to source websites.
  • **Multi-source ingestion**: A single knowledge base may ingest from documentation sites, forums, blogs, and official sources.

Configuring Proxies for LlamaIndex

#### SimpleWebPageReader with Proxy

from llama_index.readers.web import SimpleWebPageReader

# Set proxy via environment variables os.environ["HTTP_PROXY"] = "http://user:pass@gate.hexproxies.com:8080" os.environ["HTTPS_PROXY"] = "http://user:pass@gate.hexproxies.com:8080"

reader = SimpleWebPageReader(html_to_text=True) documents = reader.load_data( urls=["https://docs.example.com/guide", "https://blog.example.com/tutorial"] ) ```

#### BeautifulSoupWebReader with Proxy

from llama_index.readers.web import BeautifulSoupWebReader

os.environ["HTTP_PROXY"] = "http://user:pass@gate.hexproxies.com:8080" os.environ["HTTPS_PROXY"] = "http://user:pass@gate.hexproxies.com:8080"

reader = BeautifulSoupWebReader() documents = reader.load_data( urls=["https://docs.example.com/api-reference"], custom_hostname="docs.example.com" ) ```

#### Custom Web Reader with Explicit Proxy

For more control over proxy configuration:

import requests

def load_urls_with_proxy(urls, proxy_user="user", proxy_pass="your-password"): """Custom loader with explicit proxy configuration.""" session = requests.Session() session.proxies = { "http": f"http://{proxy_user}:{proxy_pass}@gate.hexproxies.com:8080", "https": f"http://{proxy_user}:{proxy_pass}@gate.hexproxies.com:8080", }

documents = [] for url in urls: try: response = session.get(url, timeout=30) response.raise_for_status() doc = Document(text=response.text, metadata={"source": url}) documents.append(doc) except Exception as e: print(f"Failed to load {url}: {e}") return documents

# Use with geo-targeting docs = load_urls_with_proxy( urls=["https://example.com/page1"], proxy_user="user-country-us" ) ```

Building a Proxied Knowledge Base Pipeline

Complete pipeline from web ingestion to queryable knowledge base:

from llama_index.core import VectorStoreIndex, ServiceContext
from llama_index.llms.openai import OpenAI

# Configure proxy os.environ["HTTP_PROXY"] = "http://user:pass@gate.hexproxies.com:8080" os.environ["HTTPS_PROXY"] = "http://user:pass@gate.hexproxies.com:8080"

# Load documents through proxy reader = SimpleWebPageReader(html_to_text=True) documents = reader.load_data(urls=[ "https://docs.example.com/getting-started", "https://docs.example.com/api-reference", "https://docs.example.com/best-practices", ])

# Build index llm = OpenAI(model="gpt-4o-mini") index = VectorStoreIndex.from_documents(documents)

# Query query_engine = index.as_query_engine() response = query_engine.query("How do I configure authentication?") print(response) ```

Handling Large-Scale Ingestion

For knowledge bases requiring thousands of source pages:

  1. **Batch loading**: Process URLs in batches of 10-50 with delays between batches.
  2. **Geo-targeting**: Use country-specific proxies when ingesting region-specific content.
  3. **Error resilience**: Skip failed URLs and continue loading, then retry failures in a separate pass.
  4. **Incremental updates**: Track which URLs have been loaded and only fetch new or changed content.
  5. **Bandwidth monitoring**: Use the Hex Proxies dashboard to track ingestion costs and optimize.

Tips

  • *Environment variables are the most reliable way to configure proxies for LlamaIndex web readers.
  • *Use html_to_text=True in SimpleWebPageReader to reduce document size and improve embedding quality.
  • *Batch URL loading in groups of 10-50 with 2-3 second delays between batches for sustainable ingestion.
  • *Cache loaded documents locally to avoid re-fetching during development and testing.
  • *For JavaScript-rendered content, use a custom Playwright-based reader with proxy support instead of HTTP-based readers.

Ready to Get Started?

Put this guide into practice with Hex Proxies.

Cookie Preferences

We use cookies to ensure the best experience. You can customize your preferences below. Learn more