v1.8.91-d84675c
← Back to Hex Proxies

LangChain Web Scraping with Proxies

Last updated: April 2026

By Hex Proxies Engineering Team

Set up Hex Proxies with LangChain web loaders for reliable document ingestion, web scraping, and data collection in RAG and AI applications.

intermediate15 minutesai-agents

Prerequisites

  • Python 3.9 or later
  • LangChain installed (pip install langchain langchain-community)
  • Hex Proxies account with residential proxy access
  • Basic understanding of LangChain document loaders

Steps

1

Install LangChain with loaders

Install langchain and langchain-community with web loader dependencies.

2

Configure proxy session

Create a requests.Session with Hex Proxies credentials for WebBaseLoader, or set environment variables for async loaders.

3

Integrate with document loaders

Pass the proxied session or proxy config to your chosen LangChain loader (WebBaseLoader, AsyncHtmlLoader, or PlaywrightURLLoader).

4

Implement rate limiting

Add delays between page loads to avoid overwhelming target sites, even with rotating proxies.

5

Test the pipeline

Run a small batch of URL loads and verify documents are fetched successfully through the proxy.

LangChain Web Loader Proxy Integration

LangChain provides document loaders that fetch content from the web for use in RAG (Retrieval-Augmented Generation) pipelines, chatbots, and AI applications. These loaders -- WebBaseLoader, AsyncHtmlLoader, PlaywrightURLLoader, and others -- make HTTP requests to external websites that often block automated access.

Why LangChain Loaders Need Proxies

LangChain web loaders face blocking because:

  • **Bulk loading patterns**: RAG pipelines often load dozens or hundreds of pages during indexing, creating burst traffic that triggers rate limiting.
  • **Server-side execution**: Production LangChain applications run on cloud servers with datacenter IPs that are blocked by many websites.
  • **Repeated loading**: RAG systems periodically refresh their document index, requiring sustained access to source websites over time.
  • **Diverse sources**: A single RAG pipeline may ingest content from documentation sites, blogs, forums, and news outlets -- each with different protection levels.

Configuring Proxies for LangChain Loaders

#### WebBaseLoader with Proxy

from langchain_community.document_loaders import WebBaseLoader

# Create a session with proxy configuration session = requests.Session() session.proxies = { "http": "http://user:pass@gate.hexproxies.com:8080", "https": "http://user:pass@gate.hexproxies.com:8080" }

# Pass the proxied session to WebBaseLoader loader = WebBaseLoader( web_paths=["https://example.com/docs/page1", "https://example.com/docs/page2"], session=session )

documents = loader.load() ```

#### AsyncHtmlLoader with Proxy

from langchain_community.document_loaders import AsyncHtmlLoader

# Set proxy via environment variables for async loaders os.environ["HTTP_PROXY"] = "http://user:pass@gate.hexproxies.com:8080" os.environ["HTTPS_PROXY"] = "http://user:pass@gate.hexproxies.com:8080"

loader = AsyncHtmlLoader( urls=["https://example.com/page1", "https://example.com/page2"] )

documents = await loader.aload() ```

#### PlaywrightURLLoader with Proxy (for JavaScript-rendered pages)

loader = PlaywrightURLLoader( urls=["https://spa-site.com/content"], headless=True, proxy={ "server": "http://gate.hexproxies.com:8080", "username": "user", "password": "your-password" } )

documents = loader.load() ```

RAG Pipeline Integration

For a complete RAG pipeline with proxied web loading:

from langchain_community.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

# Proxied session for web loading session = requests.Session() session.proxies = { "http": "http://user:pass@gate.hexproxies.com:8080", "https": "http://user:pass@gate.hexproxies.com:8080" }

# Load documents through proxy urls = [ "https://docs.example.com/guide", "https://blog.example.com/best-practices", "https://forum.example.com/faq", ]

loader = WebBaseLoader(web_paths=urls, session=session) documents = loader.load()

# Split and embed splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200) chunks = splitter.split_documents(documents)

embeddings = OpenAIEmbeddings() vectorstore = FAISS.from_documents(chunks, embeddings) ```

Batch Loading with Rate Limiting

For loading many pages, implement rate-limited batch loading:

import time

def load_with_proxy(urls, delay=2): """Load URLs through proxy with rate limiting.""" session = requests.Session() session.proxies = { "http": "http://user:pass@gate.hexproxies.com:8080", "https": "http://user:pass@gate.hexproxies.com:8080" }

documents = [] for url in urls: try: loader = WebBaseLoader(web_paths=[url], session=session) docs = loader.load() documents.extend(docs) time.sleep(delay) # Rate limit between requests except Exception as e: print(f"Failed to load {url}: {e}") continue return documents ```

Cost Estimation for RAG Pipelines

Web page content averages 50-200 KB of text per page. For a RAG pipeline ingesting 1,000 pages: - HTTP loading: ~100 MB = $0.43 at residential rates - Browser loading (with JS): ~2 GB = $8.50 at residential rates

Refresh cycles (daily/weekly) multiply these costs proportionally.

Tips

  • *Use WebBaseLoader with a proxied requests.Session for the simplest integration -- it handles authentication and rotation automatically.
  • *Set environment variables (HTTP_PROXY) for async loaders that do not accept session parameters directly.
  • *Add 1-2 second delays between page loads even with rotating proxies to maintain long-term access to source sites.
  • *Use PlaywrightURLLoader with proxy for JavaScript-rendered content that WebBaseLoader cannot parse.
  • *Cache loaded documents locally to avoid re-fetching unchanged content during RAG index refreshes.

Ready to Get Started?

Put this guide into practice with Hex Proxies.

Cookie Preferences

We use cookies to ensure the best experience. You can customize your preferences below. Learn more