v1.10.90-0e025b8
Skip to main content
Back to Hex Proxies

Web Scraping at Scale with Proxies

Last updated: April 2026

By Hex Proxies Engineering Team

A production-grade guide to scaling web scraping operations from hundreds to millions of pages per day using intelligent proxy management and distributed architecture.

advanced30 minutesai-data-science

Prerequisites

  • Python or Node.js proficiency
  • Understanding of HTTP protocol
  • Hex Proxies residential or ISP plan

Steps

1

Select proxy tier per target

Classify your target sites by protection level and choose ISP proxies for speed-sensitive targets or residential proxies for heavily protected sites.

2

Implement the proxy manager

Build a proxy manager that tracks per-domain success rates and automatically selects rotating or sticky sessions based on the target.

3

Configure the concurrent fetcher pool

Set up an asyncio-based fetcher with domain-level locking, connection pooling, and configurable concurrency limits.

4

Add anti-detection headers

Rotate User-Agent strings and realistic browser headers to avoid fingerprint-based blocking.

5

Implement retry and monitoring

Add exponential backoff with proxy rotation on failures and track success rates per domain for observability.

Web Scraping at Scale with Proxies

Scraping a few hundred pages requires no special infrastructure. Scraping millions of pages per day — reliably, without bans, at production quality — requires a purpose-built proxy architecture. This guide covers the patterns that separate hobby scrapers from production systems.

The Scale Problem

At scale, three forces work against you simultaneously:

  1. Rate Limiting: Websites enforce per-IP request limits. A single IP might handle 100 requests per hour before triggering a block.
  2. Fingerprinting: Anti-bot systems track request patterns — timing, headers, TLS fingerprints — to identify automated traffic.
  3. Infrastructure Load: Processing millions of responses requires distributed systems that can handle failures gracefully.

Proxy Tier Selection

Different targets require different proxy types:

Target TypeRecommended ProxyWhy
Low-protection sitesISP proxiesSpeed + unlimited bandwidth
High-protection sitesResidential rotatingIP diversity defeats fingerprinting
JS-rendered pagesResidential + headless browserMimics real user behavior
API endpointsISP sticky sessionsConsistent identity for auth flows

Architecture Overview

A production scraping system has five layers:

URL Queue → Scheduler → Proxy Manager → Fetcher Pool → Pipeline
                            ↓
                    Hex Proxies Gateway
                  (gate.hexproxies.com)

Proxy Manager Implementation

The proxy manager selects the right proxy configuration per request based on the target domain and recent success rates:

from dataclasses import dataclass, field
from collections import defaultdict
import time

@dataclass(frozen=True) class ProxyConfig: host: str = "gate.hexproxies.com" port: int = 8080 username: str = "" password: str = "" session_id: str = "" country: str = ""

@property def url(self) -> str: user_part = self.username if self.session_id: user_part = f"{self.username}-session-{self.session_id}" if self.country: user_part = f"{user_part}-country-{self.country}" return f"http://{user_part}:{self.password}@{self.host}:{self.port}"

class ProxyManager: def __init__(self, username: str, password: str): self._username = username self._password = password self._domain_stats: dict[str, dict[str, int]] = defaultdict( lambda: {"success": 0, "failure": 0} ) self._session_counter = 0

def get_rotating_proxy(self, country: str = "") -> ProxyConfig: return ProxyConfig( username=self._username, password=self._password, country=country, )

def get_sticky_proxy(self, session_label: str, country: str = "") -> ProxyConfig: return ProxyConfig( username=self._username, password=self._password, session_id=session_label, country=country, )

def record_result(self, domain: str, success: bool) -> None: key = "success" if success else "failure" stats = self._domain_stats[domain] self._domain_stats[domain] = {**stats, key: stats[key] + 1}

def success_rate(self, domain: str) -> float: stats = self._domain_stats[domain] total = stats["success"] + stats["failure"] return stats["success"] / max(total, 1) ```

Concurrent Fetcher Pool

Use asyncio to maintain high throughput while respecting per-domain limits:

import asyncio
import aiohttp
from urllib.parse import urlparse

class FetcherPool: def __init__(self, proxy_manager: ProxyManager, concurrency: int = 100): self._proxy_manager = proxy_manager self._semaphore = asyncio.Semaphore(concurrency) self._domain_locks: dict[str, asyncio.Lock] = defaultdict(asyncio.Lock)

async def fetch(self, url: str, session: aiohttp.ClientSession) -> dict: domain = urlparse(url).netloc async with self._semaphore: async with self._domain_locks[domain]: proxy = self._proxy_manager.get_rotating_proxy() try: async with session.get(url, proxy=proxy.url, timeout=aiohttp.ClientTimeout(total=30)) as resp: text = await resp.text() self._proxy_manager.record_result(domain, resp.status == 200) return {"url": url, "status": resp.status, "body": text} except Exception as e: self._proxy_manager.record_result(domain, False) return {"url": url, "status": 0, "error": str(e)} ```

Anti-Detection Headers

Rotate realistic browser headers to avoid fingerprint-based detection:

import random

USER_AGENTS = [ "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36", "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:126.0) Gecko/20100101 Firefox/126.0", ]

def build_headers() -> dict[str, str]: return { "User-Agent": random.choice(USER_AGENTS), "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8", "Accept-Language": "en-US,en;q=0.9", "Accept-Encoding": "gzip, deflate, br", "Connection": "keep-alive", "Upgrade-Insecure-Requests": "1", } ```

Failure Recovery and Retry Strategy

Implement exponential backoff with proxy rotation on failure:

async def fetch_with_retry(
    url: str,
    session: aiohttp.ClientSession,
    proxy_manager: ProxyManager,
    max_retries: int = 3,
) -> dict:
    for attempt in range(max_retries):
        proxy = proxy_manager.get_rotating_proxy()
        try:
            headers = build_headers()
            async with session.get(url, proxy=proxy.url, headers=headers, timeout=aiohttp.ClientTimeout(total=30)) as resp:
                if resp.status == 200:
                    return {"url": url, "status": 200, "body": await resp.text()}
                if resp.status == 429:
                    await asyncio.sleep(2 ** attempt * 5)
                    continue
        except Exception:
            await asyncio.sleep(2 ** attempt)
    return {"url": url, "status": 0, "error": "max retries exceeded"}

Monitoring and Observability

Track success rates, response times, and bandwidth consumption per domain and proxy type. Alert when success rates drop below 90% for critical domains — this usually indicates the target has updated its anti-bot rules and your scraping profile needs adjustment.

Hex Proxies operates a Cloudflare-fronted global edge with primary US POPs and multi-Gbps capacity. Our infrastructure is built for exactly this kind of production workload.

Tips

  • Start with ISP proxies for simple targets — they offer unlimited bandwidth and sub-50ms latency.
  • Switch to residential rotating proxies when you need IP diversity across thousands of unique addresses.
  • Implement per-domain concurrency limits — blasting 100 concurrent requests at one domain guarantees blocks.
  • Rotate User-Agent strings per request, but keep them consistent within a sticky session.
  • Monitor success rates in real-time and auto-pause domains that drop below 80% to preserve IP reputation.
  • Use Hex Proxies country targeting to match the geographic location expected by each target site.

Ready to Get Started?

Put this guide into practice with Hex Proxies.