Web Scraping at Scale: Proxy Architecture Guide

Web Scraping at Scale with Proxies

Scraping a few hundred pages requires no special infrastructure. Scraping millions of pages per day — reliably, without bans, at production quality — requires a purpose-built proxy architecture. This guide covers the patterns that separate hobby scrapers from production systems.

The Scale Problem

At scale, three forces work against you simultaneously:

Rate Limiting: Websites enforce per-IP request limits. A single IP might handle 100 requests per hour before triggering a block.
Fingerprinting: Anti-bot systems track request patterns — timing, headers, TLS fingerprints — to identify automated traffic.
Infrastructure Load: Processing millions of responses requires distributed systems that can handle failures gracefully.

Proxy Tier Selection

Different targets require different proxy types:

Target Type	Recommended Proxy	Why
Low-protection sites	ISP proxies	Speed + unlimited bandwidth
High-protection sites	Residential rotating	IP diversity defeats fingerprinting
JS-rendered pages	Residential + headless browser	Mimics real user behavior
API endpoints	ISP sticky sessions	Consistent identity for auth flows

Architecture Overview

A production scraping system has five layers:

URL Queue → Scheduler → Proxy Manager → Fetcher Pool → Pipeline
                            ↓
                    Hex Proxies Gateway
                  (gate.hexproxies.com)

Proxy Manager Implementation

The proxy manager selects the right proxy configuration per request based on the target domain and recent success rates:

from dataclasses import dataclass, field
from collections import defaultdict
import time

@dataclass(frozen=True)
class ProxyConfig:
    host: str = "gate.hexproxies.com"
    port: int = 8080
    username: str = ""
    password: str = ""
    session_id: str = ""
    country: str = ""

    @property
    def url(self) -> str:
        user_part = self.username
        if self.session_id:
            user_part = f"{self.username}-session-{self.session_id}"
        if self.country:
            user_part = f"{user_part}-country-{self.country}"
        return f"http://{user_part}:{self.password}@{self.host}:{self.port}"

class ProxyManager:
    def __init__(self, username: str, password: str):
        self._username = username
        self._password = password
        self._domain_stats: dict[str, dict[str, int]] = defaultdict(
            lambda: {"success": 0, "failure": 0}
        )
        self._session_counter = 0

    def get_rotating_proxy(self, country: str = "") -> ProxyConfig:
        return ProxyConfig(
            username=self._username,
            password=self._password,
            country=country,
        )

    def get_sticky_proxy(self, session_label: str, country: str = "") -> ProxyConfig:
        return ProxyConfig(
            username=self._username,
            password=self._password,
            session_id=session_label,
            country=country,
        )

    def record_result(self, domain: str, success: bool) -> None:
        key = "success" if success else "failure"
        stats = self._domain_stats[domain]
        self._domain_stats[domain] = {**stats, key: stats[key] + 1}

    def success_rate(self, domain: str) -> float:
        stats = self._domain_stats[domain]
        total = stats["success"] + stats["failure"]
        return stats["success"] / max(total, 1)

Concurrent Fetcher Pool

Use asyncio to maintain high throughput while respecting per-domain limits:

import asyncio
import aiohttp
from urllib.parse import urlparse

class FetcherPool:
    def __init__(self, proxy_manager: ProxyManager, concurrency: int = 100):
        self._proxy_manager = proxy_manager
        self._semaphore = asyncio.Semaphore(concurrency)
        self._domain_locks: dict[str, asyncio.Lock] = defaultdict(asyncio.Lock)

    async def fetch(self, url: str, session: aiohttp.ClientSession) -> dict:
        domain = urlparse(url).netloc
        async with self._semaphore:
            async with self._domain_locks[domain]:
                proxy = self._proxy_manager.get_rotating_proxy()
                try:
                    async with session.get(url, proxy=proxy.url, timeout=aiohttp.ClientTimeout(total=30)) as resp:
                        text = await resp.text()
                        self._proxy_manager.record_result(domain, resp.status == 200)
                        return {"url": url, "status": resp.status, "body": text}
                except Exception as e:
                    self._proxy_manager.record_result(domain, False)
                    return {"url": url, "status": 0, "error": str(e)}

Anti-Detection Headers

Rotate realistic browser headers to avoid fingerprint-based detection:

import random

USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:126.0) Gecko/20100101 Firefox/126.0",
]

def build_headers() -> dict[str, str]:
    return {
        "User-Agent": random.choice(USER_AGENTS),
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1",
    }

Failure Recovery and Retry Strategy

Implement exponential backoff with proxy rotation on failure:

async def fetch_with_retry(
    url: str,
    session: aiohttp.ClientSession,
    proxy_manager: ProxyManager,
    max_retries: int = 3,
) -> dict:
    for attempt in range(max_retries):
        proxy = proxy_manager.get_rotating_proxy()
        try:
            headers = build_headers()
            async with session.get(url, proxy=proxy.url, headers=headers, timeout=aiohttp.ClientTimeout(total=30)) as resp:
                if resp.status == 200:
                    return {"url": url, "status": 200, "body": await resp.text()}
                if resp.status == 429:
                    await asyncio.sleep(2 ** attempt * 5)
                    continue
        except Exception:
            await asyncio.sleep(2 ** attempt)
    return {"url": url, "status": 0, "error": "max retries exceeded"}

Monitoring and Observability

Track success rates, response times, and bandwidth consumption per domain and proxy type. Alert when success rates drop below 90% for critical domains — this usually indicates the target has updated its anti-bot rules and your scraping profile needs adjustment.

Hex Proxies operates a Cloudflare-fronted global edge with primary US POPs and multi-Gbps capacity. Our infrastructure is built for exactly this kind of production workload.

Web Scraping at Scale with Proxies

Prerequisites

Steps

Select proxy tier per target

Implement the proxy manager

Configure the concurrent fetcher pool

Add anti-detection headers

Implement retry and monitoring

Web Scraping at Scale with Proxies

The Scale Problem

Proxy Tier Selection

Architecture Overview

Proxy Manager Implementation

Concurrent Fetcher Pool

Anti-Detection Headers

Failure Recovery and Retry Strategy

Monitoring and Observability

Tips

Ready to Get Started?

Related Resources

How to Set Up Rotating Proxies in Python

Mechanize (Ruby) Integration

Best Proxies for Web Scraping in 2026

Proxies for Web Scraping

Web Scraping Ethics and Compliance: A Practical Guide

Residential Proxies