Scrapy Proxy Integration

Why Scrapy for Proxy-Powered Crawling

Scrapy is the most battle-tested crawling framework in the Python ecosystem. Its middleware architecture was designed for exactly the kind of request manipulation that proxy integration requires: intercepting outgoing requests to inject proxy settings, inspecting responses for ban signals, and rerouting retries through fresh IPs. Unlike simpler HTTP clients, Scrapy manages request queues, concurrency limits, politeness delays, and retry scheduling out of the box, so your proxy integration layer can focus purely on IP management.

Scrapy's downloader middleware pipeline processes every request and response in order, which makes it the ideal place to implement proxy rotation logic. You can write a single middleware class that assigns a proxy to each request, detects blocks in responses, and feeds blocked URLs back into the scheduler with a new IP assignment, all without modifying your spider code.

Complete Middleware and Configuration Example

# middlewares.py
import os
import random
import string

class HexProxyMiddleware:
    def __init__(self):
        self.proxy_user = os.environ["PROXY_USER"]
        self.proxy_pass = os.environ["PROXY_PASS"]
        self.proxy_host = "gate.hexproxies.com:8080"

    def _session_id(self) -> str:
        return "".join(random.choices(string.ascii_lowercase, k=8))

    def process_request(self, request, spider):
        session = request.meta.get("proxy_session", self._session_id())
        user_with_session = f"{self.proxy_user}-session-{session}"
        request.meta["proxy"] = f"http://{user_with_session}:{self.proxy_pass}@{self.proxy_host}"

    def process_response(self, request, response, spider):
        if response.status in (403, 429, 503):
            request.meta.pop("proxy_session", None)
            request.dont_filter = True
            return request
        return response

# settings.py
DOWNLOADER_MIDDLEWARES = {
    "myproject.middlewares.HexProxyMiddleware": 350,
    "scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware": 400,
}
CONCURRENT_REQUESTS = 32
CONCURRENT_REQUESTS_PER_DOMAIN = 8
DOWNLOAD_TIMEOUT = 30
RETRY_TIMES = 3

Scrapy-Specific Proxy Architecture

Scrapy's HttpProxyMiddleware reads the `proxy` key from `request.meta` and applies it to the underlying Twisted transport. By placing your custom middleware at priority 350 (before the built-in HttpProxyMiddleware at 400), you ensure that every request has a proxy assigned before Scrapy attempts the connection. This ordering is critical and is the most common source of confusion when proxy requests go direct.

Common Pitfalls with Scrapy Proxies

The biggest mistake is setting the proxy globally in `settings.py` via environment variables. While this works for simple cases, it means every request uses the same proxy connection, which defeats the purpose of residential IP rotation. Always assign proxies per-request in middleware so each request can get a fresh IP.

Another pitfall is Scrapy's duplicate filter. When a blocked request is retried with a new proxy, Scrapy may filter it as a duplicate because the URL hasn't changed. Set `request.dont_filter = True` on retried requests to bypass this filter. Without this, your retry logic silently drops blocked URLs.

Ban Detection Beyond Status Codes

Status codes alone are insufficient for ban detection. Many sites return 200 with a CAPTCHA page or empty content when they detect automated traffic. Implement content-based checks in your middleware's `process_response` method. Check for known CAPTCHA indicators, abnormally short response bodies, or redirect chains to challenge pages. When detected, rotate the IP and retry just as you would for a 403.

Geo-Targeted Crawling

For localized content crawling, embed the target country in your proxy username using Hex Proxies' geo targeting syntax. Your middleware can read a `target_country` meta key from the request and construct the appropriate username dynamically, allowing a single spider to crawl region-specific content by simply yielding requests with different meta values.

Python Scrapy Proxy Integration

Why Scrapy for Proxy-Powered Crawling

Complete Middleware and Configuration Example

Scrapy-Specific Proxy Architecture

Common Pitfalls with Scrapy Proxies

Ban Detection Beyond Status Codes

Geo-Targeted Crawling

Integration Steps

Create a custom downloader middleware

Configure middleware ordering in settings.py

Implement ban detection and automatic retry

Test with a small crawl and monitor proxy metrics

Operational Tips

Frequently Asked Questions

Ready to Integrate?

Related Resources

How to Set Up Rotating Proxies in Python

Python Proxy Integration

Scrapy Integration

Frontier Residential Proxies

Python httpx Integration

Residential Proxies