v1.8.91-d84675c
← Back to Hex Proxies

Web Scraping at Scale with Proxies

Last updated: April 2026

By Hex Proxies Engineering Team

A production-grade guide to scaling web scraping operations from hundreds to millions of pages per day using intelligent proxy management and distributed architecture.

advanced30 minutesai-data-science

Prerequisites

  • Python or Node.js proficiency
  • Understanding of HTTP protocol
  • Hex Proxies residential or ISP plan

Steps

1

Select proxy tier per target

Classify your target sites by protection level and choose ISP proxies for speed-sensitive targets or residential proxies for heavily protected sites.

2

Implement the proxy manager

Build a proxy manager that tracks per-domain success rates and automatically selects rotating or sticky sessions based on the target.

3

Configure the concurrent fetcher pool

Set up an asyncio-based fetcher with domain-level locking, connection pooling, and configurable concurrency limits.

4

Add anti-detection headers

Rotate User-Agent strings and realistic browser headers to avoid fingerprint-based blocking.

5

Implement retry and monitoring

Add exponential backoff with proxy rotation on failures and track success rates per domain for observability.

Web Scraping at Scale with Proxies

Scraping a few hundred pages requires no special infrastructure. Scraping millions of pages per day — reliably, without bans, at production quality — requires a purpose-built proxy architecture. This guide covers the patterns that separate hobby scrapers from production systems.

The Scale Problem

At scale, three forces work against you simultaneously:

  1. **Rate Limiting**: Websites enforce per-IP request limits. A single IP might handle 100 requests per hour before triggering a block.
  2. **Fingerprinting**: Anti-bot systems track request patterns — timing, headers, TLS fingerprints — to identify automated traffic.
  3. **Infrastructure Load**: Processing millions of responses requires distributed systems that can handle failures gracefully.

Proxy Tier Selection

Different targets require different proxy types:

| Target Type | Recommended Proxy | Why | |---|---|---| | Low-protection sites | ISP proxies | Speed + unlimited bandwidth | | High-protection sites | Residential rotating | IP diversity defeats fingerprinting | | JS-rendered pages | Residential + headless browser | Mimics real user behavior | | API endpoints | ISP sticky sessions | Consistent identity for auth flows |

Architecture Overview

A production scraping system has five layers:

URL Queue → Scheduler → Proxy Manager → Fetcher Pool → Pipeline
                            ↓
                    Hex Proxies Gateway
                  (gate.hexproxies.com)

Proxy Manager Implementation

The proxy manager selects the right proxy configuration per request based on the target domain and recent success rates:

from dataclasses import dataclass, field
from collections import defaultdict

@dataclass(frozen=True) class ProxyConfig: host: str = "gate.hexproxies.com" port: int = 8080 username: str = "" password: str = "" session_id: str = "" country: str = ""

@property def url(self) -> str: user_part = self.username if self.session_id: user_part = f"{self.username}-session-{self.session_id}" if self.country: user_part = f"{user_part}-country-{self.country}" return f"http://{user_part}:{self.password}@{self.host}:{self.port}"

class ProxyManager: def __init__(self, username: str, password: str): self._username = username self._password = password self._domain_stats: dict[str, dict[str, int]] = defaultdict( lambda: {"success": 0, "failure": 0} ) self._session_counter = 0

def get_rotating_proxy(self, country: str = "") -> ProxyConfig: return ProxyConfig( username=self._username, password=self._password, country=country, )

def get_sticky_proxy(self, session_label: str, country: str = "") -> ProxyConfig: return ProxyConfig( username=self._username, password=self._password, session_id=session_label, country=country, )

def record_result(self, domain: str, success: bool) -> None: key = "success" if success else "failure" stats = self._domain_stats[domain] self._domain_stats[domain] = {**stats, key: stats[key] + 1}

def success_rate(self, domain: str) -> float: stats = self._domain_stats[domain] total = stats["success"] + stats["failure"] return stats["success"] / max(total, 1) ```

Concurrent Fetcher Pool

Use asyncio to maintain high throughput while respecting per-domain limits:

import asyncio
import aiohttp

class FetcherPool: def __init__(self, proxy_manager: ProxyManager, concurrency: int = 100): self._proxy_manager = proxy_manager self._semaphore = asyncio.Semaphore(concurrency) self._domain_locks: dict[str, asyncio.Lock] = defaultdict(asyncio.Lock)

async def fetch(self, url: str, session: aiohttp.ClientSession) -> dict: domain = urlparse(url).netloc async with self._semaphore: async with self._domain_locks[domain]: proxy = self._proxy_manager.get_rotating_proxy() try: async with session.get(url, proxy=proxy.url, timeout=aiohttp.ClientTimeout(total=30)) as resp: text = await resp.text() self._proxy_manager.record_result(domain, resp.status == 200) return {"url": url, "status": resp.status, "body": text} except Exception as e: self._proxy_manager.record_result(domain, False) return {"url": url, "status": 0, "error": str(e)} ```

Anti-Detection Headers

Rotate realistic browser headers to avoid fingerprint-based detection:

USER_AGENTS = [ "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36", "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:126.0) Gecko/20100101 Firefox/126.0", ]

def build_headers() -> dict[str, str]: return { "User-Agent": random.choice(USER_AGENTS), "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language": "en-US,en;q=0.9", "Accept-Encoding": "gzip, deflate, br", "Connection": "keep-alive", "Upgrade-Insecure-Requests": "1", } ```

Failure Recovery and Retry Strategy

Implement exponential backoff with proxy rotation on failure:

async def fetch_with_retry(
    url: str,
    session: aiohttp.ClientSession,
    proxy_manager: ProxyManager,
    max_retries: int = 3,
) -> dict:
    for attempt in range(max_retries):
        proxy = proxy_manager.get_rotating_proxy()
        try:
            headers = build_headers()
            async with session.get(url, proxy=proxy.url, headers=headers, timeout=aiohttp.ClientTimeout(total=30)) as resp:
                if resp.status == 200:
                    return {"url": url, "status": 200, "body": await resp.text()}
                if resp.status == 429:
                    await asyncio.sleep(2 ** attempt * 5)
                    continue
        except Exception:
            await asyncio.sleep(2 ** attempt)
    return {"url": url, "status": 0, "error": "max retries exceeded"}

Monitoring and Observability

Track success rates, response times, and bandwidth consumption per domain and proxy type. Alert when success rates drop below 90% for critical domains — this usually indicates the target has updated its anti-bot rules and your scraping profile needs adjustment.

Hex Proxies processes 50 billion requests per week across 800TB daily. Our infrastructure is built for exactly this kind of production workload.

Tips

  • *Start with ISP proxies for simple targets — they offer unlimited bandwidth and sub-50ms latency.
  • *Switch to residential rotating proxies when you need IP diversity across thousands of unique addresses.
  • *Implement per-domain concurrency limits — blasting 100 concurrent requests at one domain guarantees blocks.
  • *Rotate User-Agent strings per request, but keep them consistent within a sticky session.
  • *Monitor success rates in real-time and auto-pause domains that drop below 80% to preserve IP reputation.
  • *Use Hex Proxies country targeting to match the geographic location expected by each target site.

Ready to Get Started?

Put this guide into practice with Hex Proxies.

Cookie Preferences

We use cookies to ensure the best experience. You can customize your preferences below. Learn more