v1.8.91-d84675c
Web ScrapingTips

How to Avoid IP Bans When Web Scraping

10 min read

By Hex Proxies Engineering Team

How to Avoid IP Bans When Web Scraping

Getting banned while scraping is one of the most frustrating experiences in data collection. You've built your scraper, configured your pipeline, and started collecting data — only to find your requests returning 403 errors, CAPTCHAs, or empty pages after a few hundred requests. IP bans are the most common obstacle in web scraping, but they're also the most preventable. This guide covers every practical strategy for keeping your scrapers running smoothly.

Why Websites Ban IP Addresses

Understanding why bans happen is the first step to avoiding them. Websites detect and block scrapers through several mechanisms:

Request Volume

The most basic detection method. If an IP address sends 500 requests per minute to the same website, it's clearly not a human browsing casually. Most web servers log request frequency by IP, and crossing a threshold triggers an automatic block.

Request Patterns

Human browsing is unpredictable. We click links, read content, go back, click something else. Scrapers typically follow predictable patterns: sequential URLs (page/1, page/2, page/3), uniform timing between requests, and systematic crawling of entire site sections. These patterns are easy to detect statistically.

IP Reputation

Anti-bot services maintain databases of IP addresses categorized by type and risk. Datacenter IPs, known VPN exit nodes, and IPs with a history of bot activity are flagged before they even make a request. Services like Cloudflare, DataDome, and PerimeterX check every incoming request against these databases.

Missing or Suspicious Headers

A normal web browser sends a rich set of HTTP headers: User-Agent, Accept, Accept-Language, Accept-Encoding, Connection, and often Referer. A bare-bones scraper sending only a User-Agent (or no headers at all) is immediately suspicious.

Browser Fingerprinting

Advanced detection goes beyond IP and headers. JavaScript challenges test for browser characteristics like canvas rendering, WebGL capabilities, screen resolution, installed plugins, and timing behaviors. Headless browsers have tells that sophisticated detection can identify.

Behavioral Analysis

The newest generation of anti-bot systems analyzes entire sessions, not just individual requests. They track mouse movement patterns, scroll behavior, click locations, and navigation sequences. A session with no mouse activity that hits 50 pages in 2 minutes is obviously automated.

Strategy 1: Use the Right Proxies

Proxies are the foundation of ban avoidance. Without them, you're limited to the requests your single IP can make before getting blocked.

Rotating Residential Proxies

For scraping well-protected sites, rotating residential proxies are the most effective option. Each request goes through a different IP from a pool of millions, and the IPs belong to real ISPs, so they carry high trust scores.

import requests

# Gateway-based rotation — new IP per request automatically
proxy = {
    "http": "http://YOUR_USERNAME-country-us:pass@gate.hexproxies.com:8080",
    "https": "http://YOUR_USERNAME-country-us:pass@gate.hexproxies.com:8080"
}

response = requests.get("https://target-site.com/data", proxies=proxy)

For detailed proxy selection advice, see our guide on the best proxies for web scraping.

ISP Proxies for Session Scraping

When you need to maintain a login session while scraping, ISP proxies provide static IPs with residential-level trust. Use one IP per account session to avoid detection. See our ISP vs. datacenter comparison for details.

Proxy Pool Management

Don't burn through your entire proxy pool on one site. Assign subsets of your pool to different targets:

import time

class ProxyPoolManager:
    def __init__(self, proxies):
        self.pools = {}
        self.all_proxies = proxies
        self.failed_proxies = {}

    def get_pool(self, domain):
        if domain not in self.pools:
            # Assign a slice of the proxy pool to this domain
            start = hash(domain) % len(self.all_proxies)
            pool_size = min(50, len(self.all_proxies) // 4)
            self.pools[domain] = self.all_proxies[start:start + pool_size]
        return self.pools[domain]

    def mark_failed(self, proxy, domain):
        """Track failed proxies per domain to avoid reusing banned IPs."""
        key = f"{proxy}:{domain}"
        self.failed_proxies[key] = time.time()

    def is_available(self, proxy, domain, cooldown=3600):
        """Check if a proxy is available for a domain (not recently failed)."""
        key = f"{proxy}:{domain}"
        if key in self.failed_proxies:
            return (time.time() - self.failed_proxies[key]) > cooldown
        return True

Strategy 2: Implement Intelligent Rate Limiting

Respect the Site's Limits

Many sites publish rate limits in their API documentation or robots.txt file. Always check these first:

import requests
from urllib.robotparser import RobotFileParser

def get_crawl_delay(domain):
    """Check robots.txt for crawl delay guidance."""
    rp = RobotFileParser()
    rp.set_url(f"https://{domain}/robots.txt")
    rp.read()
    delay = rp.crawl_delay("*")
    return delay if delay else 2.0  # Default to 2 seconds

Add Random Delays

Uniform delays are a bot signal. Add randomness to your request timing:

import time
import random

def human_delay(base=2.0, variance=1.5):
    """Generate a human-like random delay."""
    delay = base + random.uniform(-variance, variance)
    # Occasionally add a longer pause (simulating reading a page)
    if random.random() < 0.1:
        delay += random.uniform(3.0, 8.0)
    time.sleep(max(0.5, delay))

Adaptive Rate Limiting

Adjust your request rate based on response signals:

class AdaptiveRateLimiter:
    def __init__(self, initial_delay=2.0):
        self.delay = initial_delay
        self.min_delay = 0.5
        self.max_delay = 30.0
        self.consecutive_success = 0
        self.consecutive_fail = 0

    def record_response(self, status_code):
        if status_code == 200:
            self.consecutive_success += 1
            self.consecutive_fail = 0
            # Speed up gradually after consistent success
            if self.consecutive_success > 10:
                self.delay = max(self.min_delay, self.delay * 0.9)
        elif status_code in (429, 403, 503):
            self.consecutive_fail += 1
            self.consecutive_success = 0
            # Slow down aggressively on blocks
            self.delay = min(self.max_delay, self.delay * 2.0)

    def wait(self):
        jitter = random.uniform(0.5, 1.5)
        time.sleep(self.delay * jitter)

Strategy 3: Rotate and Randomize Headers

User-Agent Rotation

A single User-Agent string across thousands of requests is a dead giveaway. Maintain a list of current, realistic User-Agent strings:

import random

USER_AGENTS = [
    # Chrome on Windows
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Safari/537.36",
    # Chrome on Mac
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Safari/537.36",
    # Firefox on Windows
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:134.0) Gecko/20100101 Firefox/134.0",
    # Firefox on Mac
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:134.0) Gecko/20100101 Firefox/134.0",
    # Safari on Mac
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15",
    # Edge on Windows
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Safari/537.36 Edg/121.0.0.0",
]

def get_random_headers():
    ua = random.choice(USER_AGENTS)
    headers = {
        "User-Agent": ua,
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1",
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "none",
        "Sec-Fetch-User": "?1",
        "Cache-Control": "max-age=0",
    }
    return headers

Consistent Header-UA Pairing

Different browsers send slightly different header sets. Make sure your headers are consistent with your User-Agent. A Chrome User-Agent sending Firefox-specific headers is a red flag.

Referer Headers

Include realistic Referer headers to simulate navigation. When scraping product pages, the Referer should be the category or search results page:

def scrape_product(product_url, category_url):
    headers = get_random_headers()
    headers["Referer"] = category_url
    response = requests.get(product_url, headers=headers, proxies=proxy)
    return response

Strategy 4: Manage Sessions and Cookies

Accept and Use Cookies

Real browsers accept and return cookies. Your scraper should too:

session = requests.Session()
session.proxies = proxy

# First visit — get cookies
session.get("https://target-site.com", headers=get_random_headers())

# Subsequent visits — cookies are automatically included
for page in range(1, 100):
    response = session.get(
        f"https://target-site.com/products?page={page}",
        headers=get_random_headers()
    )
    human_delay()

Session Lifecycle

Don't use a single session for thousands of requests. Create new sessions periodically:

def create_scraping_session(proxy_config):
    """Create a new session with fresh cookies and a consistent identity."""
    session = requests.Session()
    session.proxies = proxy_config
    session.headers.update(get_random_headers())

    # Warm up the session by visiting the homepage
    session.get("https://target-site.com")
    time.sleep(random.uniform(1, 3))

    return session

# Rotate sessions every 50-100 requests
requests_made = 0
session = create_scraping_session(proxy)

for url in urls_to_scrape:
    if requests_made >= random.randint(50, 100):
        session.close()
        session = create_scraping_session(get_new_proxy())
        requests_made = 0

    response = session.get(url)
    requests_made += 1
    human_delay()

Strategy 5: Handle Anti-Bot Challenges

Detect Soft Blocks

Not all blocks are obvious 403 errors. Watch for:

  • CAPTCHAs embedded in otherwise 200 responses
  • Redirects to challenge pages
  • Different content than expected (honeypot pages)
  • Empty or truncated responses
  • Soft bans that return outdated or dummy data
def is_blocked(response):
    """Check for signs of blocking beyond just HTTP status codes."""
    if response.status_code in (403, 429, 503):
        return True

    content = response.text.lower()
    block_indicators = [
        "captcha",
        "access denied",
        "please verify",
        "unusual traffic",
        "rate limit",
        "bot detection",
        "challenge-platform",
    ]

    return any(indicator in content for indicator in block_indicators)

Implement Backoff on Detection

When you detect a block, don't immediately retry. Back off, switch proxies, and wait:

def scrape_with_backoff(url, max_retries=5):
    for attempt in range(max_retries):
        proxy = get_fresh_proxy()
        headers = get_random_headers()

        try:
            response = requests.get(url, proxies=proxy, headers=headers, timeout=30)

            if not is_blocked(response):
                return response

            # Blocked — exponential backoff
            wait_time = (2 ** attempt) * random.uniform(1, 2)
            print(f"Blocked on attempt {attempt + 1}. Waiting {wait_time:.1f}s...")
            time.sleep(wait_time)

        except requests.exceptions.RequestException as e:
            time.sleep(2 ** attempt)

    return None

Strategy 6: Use Headless Browsers When Needed

Some sites require JavaScript rendering to serve content. In these cases, use a headless browser with proxy support:

from playwright.sync_api import sync_playwright

def scrape_js_site(url, proxy_config):
    with sync_playwright() as p:
        browser = p.chromium.launch(
            headless=True,
            proxy={
                "server": "http://gate.hexproxies.com:8080",
                "username": "YOUR_USERNAME",
                "password": "YOUR_PASSWORD"
            }
        )

        context = browser.new_context(
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/133.0.0.0",
            viewport={"width": 1920, "height": 1080},
            locale="en-US"
        )

        page = context.new_page()
        page.goto(url, wait_until="networkidle")

        # Wait for content to render
        page.wait_for_selector(".product-list", timeout=10000)

        content = page.content()
        browser.close()

        return content

Headless browsers are slower and more resource-intensive than HTTP requests, so use them only for sites that require JavaScript rendering. For static content, stick with raw HTTP requests through proxies.

Strategy 7: Respect robots.txt and Terms of Service

This isn't just an ethical consideration — it's practical. Sites that see you respecting their stated rules are less likely to invest effort in blocking you.

  • Check robots.txt before scraping
  • Honor crawl-delay directives
  • Avoid scraping sensitive areas (login pages, user profiles, admin sections)
  • Don't overload small sites that can't handle high traffic

Quick Reference: Anti-Ban Checklist

Before running any scraping job, verify these items:

  1. Proxies configured with rotation appropriate for the target
  2. Rate limiting set to a conservative starting point
  3. Random delays between requests (not uniform)
  4. User-Agent rotation with realistic, current browser strings
  5. Full header sets matching the User-Agent browser
  6. Cookie handling enabled in your session
  7. Error detection for soft blocks (CAPTCHAs, redirects)
  8. Retry logic with exponential backoff
  9. robots.txt checked and respected
  10. Monitoring for success rate drops during the run

Common Mistakes That Lead to Bans

Scraping too fast from the start. Begin slowly and increase speed gradually while monitoring for blocks.

Using the same proxy for different sites. If a proxy gets banned on one site, that proxy's reputation may affect other sites too. Use separate proxy pools per target.

Ignoring response content. Checking only the HTTP status code misses soft blocks. Always validate the actual response content.

Not rotating anything besides IP. If 1000 different IPs all send the same User-Agent, headers, and access patterns, the site can cluster them as a single bot operation.

Running scrapers 24/7 without monitoring. Automated scraping should include automated monitoring. Alert yourself when success rates drop so you can adjust before a full ban.

Conclusion

Avoiding IP bans when web scraping is about combining multiple strategies rather than relying on any single technique. The right proxies provide the foundation, but rate limiting, header rotation, session management, and behavioral mimicry all play critical roles.

Start with quality rotating residential proxies for the best baseline success rate, implement the rate limiting and header strategies described above, and monitor your results continuously. When you see success rates dropping, adjust your approach before escalating to a full ban.

For proxy setup instructions, see our rotating proxy setup guide. For choosing the right proxy type for your scraping needs, check out our web scraping proxy comparison.

Cookie Preferences

We use cookies to ensure the best experience. You can customize your preferences below. Learn more