v1.10.90-0e025b8
Skip to main content
Back to Hex Proxies

LlamaIndex Web Reader Proxy Config

Last updated: April 2026

By Hex Proxies Engineering Team

Set up Hex Proxies with LlamaIndex web readers for reliable content ingestion in knowledge bases and RAG applications. Covers SimpleWebPageReader and more.

intermediate15 minutesai-agents

Prerequisites

  • Python 3.9 or later
  • LlamaIndex installed (pip install llama-index)
  • Hex Proxies account with residential proxy access
  • Basic understanding of LlamaIndex data connectors

Steps

1

Install LlamaIndex

Install llama-index with web reader dependencies: pip install llama-index llama-index-readers-web

2

Set proxy environment

Configure HTTP_PROXY and HTTPS_PROXY environment variables with Hex Proxies credentials.

3

Configure web readers

Initialize SimpleWebPageReader or BeautifulSoupWebReader — they inherit proxy from environment variables.

4

Build ingestion pipeline

Create a batch loading pipeline with error handling and rate limiting for large-scale ingestion.

5

Test and deploy

Verify documents load correctly through the proxy, then deploy the knowledge base pipeline.

LlamaIndex Web Reader Proxy Integration

LlamaIndex is a data framework for building LLM applications with custom data. Its web readers fetch content from websites for indexing into knowledge bases that power RAG chatbots, search tools, and AI assistants. Like all web fetching operations, these readers benefit from proxy infrastructure that prevents blocking.

Why LlamaIndex Web Readers Need Proxies

LlamaIndex ingestion pipelines face the same challenges as other web fetching tools:

  • Bulk ingestion: Building a knowledge base requires loading dozens to thousands of pages, generating traffic patterns that trigger rate limiting.
  • Server deployment: Production LlamaIndex applications run on cloud infrastructure with blocked datacenter IPs.
  • Periodic refresh: Knowledge bases need regular updates, requiring sustained access to source websites.
  • Multi-source ingestion: A single knowledge base may ingest from documentation sites, forums, blogs, and official sources.

Configuring Proxies for LlamaIndex

SimpleWebPageReader with Proxy

from llama_index.readers.web import SimpleWebPageReader
import os

Set proxy via environment variables os.environ["HTTP_PROXY"] = "http://user:pass@gate.hexproxies.com:8080" os.environ["HTTPS_PROXY"] = "http://user:pass@gate.hexproxies.com:8080"

reader = SimpleWebPageReader(html_to_text=True) documents = reader.load_data( urls=["https://docs.example.com/guide", "https://blog.example.com/tutorial"] ) ```

BeautifulSoupWebReader with Proxy

from llama_index.readers.web import BeautifulSoupWebReader
import os

os.environ["HTTP_PROXY"] = "http://user:pass@gate.hexproxies.com:8080" os.environ["HTTPS_PROXY"] = "http://user:pass@gate.hexproxies.com:8080"

reader = BeautifulSoupWebReader() documents = reader.load_data( urls=["https://docs.example.com/api-reference"], custom_hostname="docs.example.com" ) ```

Custom Web Reader with Explicit Proxy

For more control over proxy configuration:

import requests
from llama_index.core import Document

def load_urls_with_proxy(urls, proxy_user="user", proxy_pass="your-password"): """Custom loader with explicit proxy configuration.""" session = requests.Session() session.proxies = { "http": f"http://{proxy_user}:{proxy_pass}@gate.hexproxies.com:8080", "https": f"http://{proxy_user}:{proxy_pass}@gate.hexproxies.com:8080", }

documents = [] for url in urls: try: response = session.get(url, timeout=30) response.raise_for_status() doc = Document(text=response.text, metadata={"source": url}) documents.append(doc) except Exception as e: print(f"Failed to load {url}: {e}") return documents

Use with geo-targeting docs = load_urls_with_proxy( urls=["https://example.com/page1"], proxy_user="user-country-us" ) ```

Building a Proxied Knowledge Base Pipeline

Complete pipeline from web ingestion to queryable knowledge base:

import os
from llama_index.core import VectorStoreIndex, Settings
from llama_index.llms.openai import OpenAI

Configure proxy os.environ["HTTP_PROXY"] = os.environ["HEX_PROXY_URL"] os.environ["HTTPS_PROXY"] = os.environ["HEX_PROXY_URL"]

Load documents through proxy reader = SimpleWebPageReader(html_to_text=True) documents = reader.load_data(urls=[ "https://docs.example.com/getting-started", "https://docs.example.com/api-reference", "https://docs.example.com/best-practices", ])

Configure global Settings (LlamaIndex 0.10+ replaced ServiceContext with Settings) Settings.llm = OpenAI(model="gpt-4.1-mini") index = VectorStoreIndex.from_documents(documents)

Query query_engine = index.as_query_engine() response = query_engine.query("How do I configure authentication?") print(response) ```

Handling Large-Scale Ingestion

For knowledge bases requiring thousands of source pages:

  1. Batch loading: Process URLs in batches of 10-50 with delays between batches.
  2. Geo-targeting: Use country-specific proxies when ingesting region-specific content.
  3. Error resilience: Skip failed URLs and continue loading, then retry failures in a separate pass.
  4. Incremental updates: Track which URLs have been loaded and only fetch new or changed content.
  5. Bandwidth monitoring: Use the Hex Proxies dashboard to track ingestion costs and optimize.

Tips

  • Environment variables are the most reliable way to configure proxies for LlamaIndex web readers.
  • Use html_to_text=True in SimpleWebPageReader to reduce document size and improve embedding quality.
  • Batch URL loading in groups of 10-50 with 2-3 second delays between batches for sustainable ingestion.
  • Cache loaded documents locally to avoid re-fetching during development and testing.
  • For JavaScript-rendered content, use a custom Playwright-based reader with proxy support instead of HTTP-based readers.

Ready to Get Started?

Put this guide into practice with Hex Proxies.