v1.8.91-d84675c
← Back to Hex Proxies

Python Scrapy Proxy Integration

Configure Hex Proxies in Scrapy with downloader middleware for automatic proxy rotation, retry handling, and geo-targeted crawling at scale.

Why Scrapy for Proxy-Powered Crawling

Scrapy is the most battle-tested crawling framework in the Python ecosystem. Its middleware architecture was designed for exactly the kind of request manipulation that proxy integration requires: intercepting outgoing requests to inject proxy settings, inspecting responses for ban signals, and rerouting retries through fresh IPs. Unlike simpler HTTP clients, Scrapy manages request queues, concurrency limits, politeness delays, and retry scheduling out of the box, so your proxy integration layer can focus purely on IP management.

Scrapy's downloader middleware pipeline processes every request and response in order, which makes it the ideal place to implement proxy rotation logic. You can write a single middleware class that assigns a proxy to each request, detects blocks in responses, and feeds blocked URLs back into the scheduler with a new IP assignment, all without modifying your spider code.

Complete Middleware and Configuration Example

# middlewares.py
import os
import random

class HexProxyMiddleware: def __init__(self): self.proxy_user = os.environ["PROXY_USER"] self.proxy_pass = os.environ["PROXY_PASS"] self.proxy_host = "gate.hexproxies.com:8080"

def _session_id(self) -> str: return "".join(random.choices(string.ascii_lowercase, k=8))

def process_request(self, request, spider): session = request.meta.get("proxy_session", self._session_id()) user_with_session = f"{self.proxy_user}-session-{session}" request.meta["proxy"] = f"http://{user_with_session}:{self.proxy_pass}@{self.proxy_host}"

def process_response(self, request, response, spider): if response.status in (403, 429, 503): request.meta.pop("proxy_session", None) request.dont_filter = True return request return response

# settings.py DOWNLOADER_MIDDLEWARES = { "myproject.middlewares.HexProxyMiddleware": 350, "scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware": 400, } CONCURRENT_REQUESTS = 32 CONCURRENT_REQUESTS_PER_DOMAIN = 8 DOWNLOAD_TIMEOUT = 30 RETRY_TIMES = 3 ```

Scrapy-Specific Proxy Architecture

Scrapy's HttpProxyMiddleware reads the `proxy` key from `request.meta` and applies it to the underlying Twisted transport. By placing your custom middleware at priority 350 (before the built-in HttpProxyMiddleware at 400), you ensure that every request has a proxy assigned before Scrapy attempts the connection. This ordering is critical and is the most common source of confusion when proxy requests go direct.

Common Pitfalls with Scrapy Proxies

The biggest mistake is setting the proxy globally in `settings.py` via environment variables. While this works for simple cases, it means every request uses the same proxy connection, which defeats the purpose of residential IP rotation. Always assign proxies per-request in middleware so each request can get a fresh IP.

Another pitfall is Scrapy's duplicate filter. When a blocked request is retried with a new proxy, Scrapy may filter it as a duplicate because the URL hasn't changed. Set `request.dont_filter = True` on retried requests to bypass this filter. Without this, your retry logic silently drops blocked URLs.

Ban Detection Beyond Status Codes

Status codes alone are insufficient for ban detection. Many sites return 200 with a CAPTCHA page or empty content when they detect automated traffic. Implement content-based checks in your middleware's `process_response` method. Check for known CAPTCHA indicators, abnormally short response bodies, or redirect chains to challenge pages. When detected, rotate the IP and retry just as you would for a 403.

Geo-Targeted Crawling

For localized content crawling, embed the target country in your proxy username using Hex Proxies' geo targeting syntax. Your middleware can read a `target_country` meta key from the request and construct the appropriate username dynamically, allowing a single spider to crawl region-specific content by simply yielding requests with different meta values.

Integration Steps

1

Create a custom downloader middleware

Write a middleware class with process_request and process_response methods. Assign a unique proxy session per request in process_request and handle ban detection in process_response.

2

Configure middleware ordering in settings.py

Register your custom middleware at priority 350 so it runs before Scrapy built-in HttpProxyMiddleware at 400. Set CONCURRENT_REQUESTS and DOWNLOAD_TIMEOUT to match your proxy plan limits.

3

Implement ban detection and automatic retry

Check response status codes and body content for block signals. On detection, strip the proxy session from request.meta, set dont_filter=True, and return the request to the scheduler for retry with a fresh IP.

4

Test with a small crawl and monitor proxy metrics

Run your spider on 50-100 URLs and check the proxy dashboard for success rate, latency distribution, and IP diversity. Adjust concurrency and session duration based on observed block rates.

Operational Tips

Keep sessions stable for workflows that depend on consistent identity. For high-volume collection, rotate IPs and reduce concurrency if you see timeouts or 403 responses.

  • Prefer sticky sessions for multi-step flows (auth, checkout, forms).
  • Rotate per request for scale and broad coverage.
  • Use timeouts and retries to handle transient failures.

Frequently Asked Questions

How do I rotate proxies per-request in Scrapy instead of using one proxy for all requests?

Assign the proxy in a downloader middleware process_request method rather than in settings.py. Generate a unique session ID per request and embed it in the proxy username. This tells the Hex Proxies gateway to assign a different residential IP for each session ID.

Why are my Scrapy retries not using a new proxy IP?

Scrapy reuses request.meta across retries by default. Your middleware must clear the proxy_session key from request.meta in process_response when a block is detected, forcing the next process_request call to generate a fresh session ID and therefore a new IP.

Can I target specific countries with Scrapy and Hex Proxies?

Yes. Append the country code to your proxy username (e.g., user-country-us) in your middleware. You can read a target_country key from request.meta to make this dynamic per-request, allowing one spider to crawl localized content across multiple regions.

Ready to Integrate?

Start using residential proxies with Python Scrapy today.

Cookie Preferences

We use cookies to ensure the best experience. You can customize your preferences below. Learn more