Why Scrapy for Proxy-Powered Crawling
Scrapy is the most battle-tested crawling framework in the Python ecosystem. Its middleware architecture was designed for exactly the kind of request manipulation that proxy integration requires: intercepting outgoing requests to inject proxy settings, inspecting responses for ban signals, and rerouting retries through fresh IPs. Unlike simpler HTTP clients, Scrapy manages request queues, concurrency limits, politeness delays, and retry scheduling out of the box, so your proxy integration layer can focus purely on IP management.
Scrapy's downloader middleware pipeline processes every request and response in order, which makes it the ideal place to implement proxy rotation logic. You can write a single middleware class that assigns a proxy to each request, detects blocks in responses, and feeds blocked URLs back into the scheduler with a new IP assignment, all without modifying your spider code.
Complete Middleware and Configuration Example
# middlewares.py
import os
import randomclass HexProxyMiddleware: def __init__(self): self.proxy_user = os.environ["PROXY_USER"] self.proxy_pass = os.environ["PROXY_PASS"] self.proxy_host = "gate.hexproxies.com:8080"
def _session_id(self) -> str: return "".join(random.choices(string.ascii_lowercase, k=8))
def process_request(self, request, spider): session = request.meta.get("proxy_session", self._session_id()) user_with_session = f"{self.proxy_user}-session-{session}" request.meta["proxy"] = f"http://{user_with_session}:{self.proxy_pass}@{self.proxy_host}"
def process_response(self, request, response, spider): if response.status in (403, 429, 503): request.meta.pop("proxy_session", None) request.dont_filter = True return request return response
# settings.py DOWNLOADER_MIDDLEWARES = { "myproject.middlewares.HexProxyMiddleware": 350, "scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware": 400, } CONCURRENT_REQUESTS = 32 CONCURRENT_REQUESTS_PER_DOMAIN = 8 DOWNLOAD_TIMEOUT = 30 RETRY_TIMES = 3 ```
Scrapy-Specific Proxy Architecture
Scrapy's HttpProxyMiddleware reads the `proxy` key from `request.meta` and applies it to the underlying Twisted transport. By placing your custom middleware at priority 350 (before the built-in HttpProxyMiddleware at 400), you ensure that every request has a proxy assigned before Scrapy attempts the connection. This ordering is critical and is the most common source of confusion when proxy requests go direct.
Common Pitfalls with Scrapy Proxies
The biggest mistake is setting the proxy globally in `settings.py` via environment variables. While this works for simple cases, it means every request uses the same proxy connection, which defeats the purpose of residential IP rotation. Always assign proxies per-request in middleware so each request can get a fresh IP.
Another pitfall is Scrapy's duplicate filter. When a blocked request is retried with a new proxy, Scrapy may filter it as a duplicate because the URL hasn't changed. Set `request.dont_filter = True` on retried requests to bypass this filter. Without this, your retry logic silently drops blocked URLs.
Ban Detection Beyond Status Codes
Status codes alone are insufficient for ban detection. Many sites return 200 with a CAPTCHA page or empty content when they detect automated traffic. Implement content-based checks in your middleware's `process_response` method. Check for known CAPTCHA indicators, abnormally short response bodies, or redirect chains to challenge pages. When detected, rotate the IP and retry just as you would for a 403.
Geo-Targeted Crawling
For localized content crawling, embed the target country in your proxy username using Hex Proxies' geo targeting syntax. Your middleware can read a `target_country` meta key from the request and construct the appropriate username dynamically, allowing a single spider to crawl region-specific content by simply yielding requests with different meta values.