Web Scraping at Scale with Proxies
Scraping a few hundred pages requires no special infrastructure. Scraping millions of pages per day — reliably, without bans, at production quality — requires a purpose-built proxy architecture. This guide covers the patterns that separate hobby scrapers from production systems.
The Scale Problem
At scale, three forces work against you simultaneously:
- **Rate Limiting**: Websites enforce per-IP request limits. A single IP might handle 100 requests per hour before triggering a block.
- **Fingerprinting**: Anti-bot systems track request patterns — timing, headers, TLS fingerprints — to identify automated traffic.
- **Infrastructure Load**: Processing millions of responses requires distributed systems that can handle failures gracefully.
Proxy Tier Selection
Different targets require different proxy types:
| Target Type | Recommended Proxy | Why | |---|---|---| | Low-protection sites | ISP proxies | Speed + unlimited bandwidth | | High-protection sites | Residential rotating | IP diversity defeats fingerprinting | | JS-rendered pages | Residential + headless browser | Mimics real user behavior | | API endpoints | ISP sticky sessions | Consistent identity for auth flows |
Architecture Overview
A production scraping system has five layers:
URL Queue → Scheduler → Proxy Manager → Fetcher Pool → Pipeline
↓
Hex Proxies Gateway
(gate.hexproxies.com)Proxy Manager Implementation
The proxy manager selects the right proxy configuration per request based on the target domain and recent success rates:
from dataclasses import dataclass, field
from collections import defaultdict@dataclass(frozen=True) class ProxyConfig: host: str = "gate.hexproxies.com" port: int = 8080 username: str = "" password: str = "" session_id: str = "" country: str = ""
@property def url(self) -> str: user_part = self.username if self.session_id: user_part = f"{self.username}-session-{self.session_id}" if self.country: user_part = f"{user_part}-country-{self.country}" return f"http://{user_part}:{self.password}@{self.host}:{self.port}"
class ProxyManager: def __init__(self, username: str, password: str): self._username = username self._password = password self._domain_stats: dict[str, dict[str, int]] = defaultdict( lambda: {"success": 0, "failure": 0} ) self._session_counter = 0
def get_rotating_proxy(self, country: str = "") -> ProxyConfig: return ProxyConfig( username=self._username, password=self._password, country=country, )
def get_sticky_proxy(self, session_label: str, country: str = "") -> ProxyConfig: return ProxyConfig( username=self._username, password=self._password, session_id=session_label, country=country, )
def record_result(self, domain: str, success: bool) -> None: key = "success" if success else "failure" stats = self._domain_stats[domain] self._domain_stats[domain] = {**stats, key: stats[key] + 1}
def success_rate(self, domain: str) -> float: stats = self._domain_stats[domain] total = stats["success"] + stats["failure"] return stats["success"] / max(total, 1) ```
Concurrent Fetcher Pool
Use asyncio to maintain high throughput while respecting per-domain limits:
import asyncio
import aiohttpclass FetcherPool: def __init__(self, proxy_manager: ProxyManager, concurrency: int = 100): self._proxy_manager = proxy_manager self._semaphore = asyncio.Semaphore(concurrency) self._domain_locks: dict[str, asyncio.Lock] = defaultdict(asyncio.Lock)
async def fetch(self, url: str, session: aiohttp.ClientSession) -> dict: domain = urlparse(url).netloc async with self._semaphore: async with self._domain_locks[domain]: proxy = self._proxy_manager.get_rotating_proxy() try: async with session.get(url, proxy=proxy.url, timeout=aiohttp.ClientTimeout(total=30)) as resp: text = await resp.text() self._proxy_manager.record_result(domain, resp.status == 200) return {"url": url, "status": resp.status, "body": text} except Exception as e: self._proxy_manager.record_result(domain, False) return {"url": url, "status": 0, "error": str(e)} ```
Anti-Detection Headers
Rotate realistic browser headers to avoid fingerprint-based detection:
USER_AGENTS = [ "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36", "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:126.0) Gecko/20100101 Firefox/126.0", ]
def build_headers() -> dict[str, str]: return { "User-Agent": random.choice(USER_AGENTS), "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language": "en-US,en;q=0.9", "Accept-Encoding": "gzip, deflate, br", "Connection": "keep-alive", "Upgrade-Insecure-Requests": "1", } ```
Failure Recovery and Retry Strategy
Implement exponential backoff with proxy rotation on failure:
async def fetch_with_retry(
url: str,
session: aiohttp.ClientSession,
proxy_manager: ProxyManager,
max_retries: int = 3,
) -> dict:
for attempt in range(max_retries):
proxy = proxy_manager.get_rotating_proxy()
try:
headers = build_headers()
async with session.get(url, proxy=proxy.url, headers=headers, timeout=aiohttp.ClientTimeout(total=30)) as resp:
if resp.status == 200:
return {"url": url, "status": 200, "body": await resp.text()}
if resp.status == 429:
await asyncio.sleep(2 ** attempt * 5)
continue
except Exception:
await asyncio.sleep(2 ** attempt)
return {"url": url, "status": 0, "error": "max retries exceeded"}Monitoring and Observability
Track success rates, response times, and bandwidth consumption per domain and proxy type. Alert when success rates drop below 90% for critical domains — this usually indicates the target has updated its anti-bot rules and your scraping profile needs adjustment.
Hex Proxies processes 50 billion requests per week across 800TB daily. Our infrastructure is built for exactly this kind of production workload.