v1.10.82-f67ee7d
Skip to main content
← Back to Code Snippets

Python Scrapy Proxy

Complete Scrapy proxy middleware integration example for Python with Hex Proxies. Includes authentication, rotation, and error handling.

PythonScrapy
Install:pip install scrapy
Python / Scrapy
# settings.py
DOWNLOADER_MIDDLEWARES = {
    "myproject.middlewares.HexProxyMiddleware": 350,
}

# middlewares.py
class HexProxyMiddleware:
    PROXY_URL = "http://user:pass@gate.hexproxies.com:8080"

    def process_request(self, request, spider):
        request.meta["proxy"] = self.PROXY_URL
        spider.logger.debug(f"Using proxy: {self.PROXY_URL}")

# spider.py
import scrapy

class IPSpider(scrapy.Spider):
    name = "ip_check"
    start_urls = ["https://httpbin.org/ip"]

    def parse(self, response):
        self.logger.info(f"Origin IP: {response.json()['origin']}")
        yield {"ip": response.json()["origin"]}

Why Scrapy for Proxy Work

Scrapy is not just an HTTP library with proxy support; it is a complete web crawling framework built from the ground up for large-scale data extraction. Its architecture of spiders, middlewares, pipelines, and item processors provides the structure that ad-hoc scripts lack when projects grow beyond a few hundred pages. The downloader middleware system is specifically designed for proxy integration, giving you a clean hook to inject proxy configuration, implement rotation logic, and handle proxy-specific errors without polluting your spider's parsing code.

Scrapy's Twisted-based async engine processes hundreds of concurrent requests through a single Python process. This event-driven architecture meshes well with proxy gateways like gate.hexproxies.com:8080 because it maintains persistent connections and efficiently manages the I/O wait inherent in proxied requests. Scrapy also provides built-in support for request deduplication, crawl depth limiting, robots.txt compliance, and export to JSON, CSV, or databases, covering the full lifecycle from proxied fetch to structured data output.

Configuration Patterns

Scrapy's middleware system offers three levels of proxy integration sophistication. The simplest is a middleware that stamps every request with `request.meta["proxy"]`, as shown above. The intermediate approach adds rotation logic by maintaining a pool of proxy endpoints and cycling through them. The advanced pattern implements a stateful middleware that tracks success rates per proxy, removes failing proxies from rotation, and re-tests them periodically.

Scrapy settings control concurrency behavior that directly affects proxy performance. Set `CONCURRENT_REQUESTS=16` to limit total parallelism, `CONCURRENT_REQUESTS_PER_DOMAIN=8` to avoid hammering individual targets, and `DOWNLOAD_DELAY=0.5` to add breathing room between requests. The `RETRY_TIMES=3` and `RETRY_HTTP_CODES=[500, 502, 503, 504, 408, 429]` settings handle transient proxy and target failures automatically.

Common Pitfalls

Scrapy newcomers often place their proxy middleware at the wrong priority number, causing it to run before or after critical built-in middlewares. The Scrapy middleware stack processes requests in ascending priority order. Place your proxy middleware at priority 350, which puts it after the default UserAgent middleware (400 runs later) but before the HttpCompression middleware. If your middleware runs too late, the request may already have failed.

Another common issue is not handling the `process_exception` method in your middleware. When a proxy connection fails entirely (not an HTTP error, but a TCP-level failure), Scrapy calls `process_exception` rather than `process_response`. If your middleware does not implement this method, the exception propagates to the spider's `errback` or gets silently swallowed. Implement it to log the failure, optionally retry with a different proxy, or yield a meaningful error item.

Performance Optimization

Scrapy's built-in AutoThrottle extension is a powerful tool for proxy workloads. Enable it with `AUTOTHROTTLE_ENABLED=True` and it automatically adjusts download delay based on proxy and target server latency. This prevents overwhelming your proxy allocation during traffic spikes while maximizing throughput during quiet periods. Set `AUTOTHROTTLE_TARGET_CONCURRENCY=8.0` as a starting point and monitor the adjustment behavior in your logs.

For very large crawls spanning millions of pages, enable Scrapy's jobdir feature for pause/resume capability. Combined with a proxy middleware that logs per-request proxy performance, you can analyze which proxy configurations yield the highest success rates and lowest latency after each crawl segment. Feed these metrics back into your middleware to dynamically weight proxy selection toward your best-performing configurations.

Tips

  • 1
    Implement a custom middleware for full control over proxy rotation logic.
  • 2
    Use CONCURRENT_REQUESTS and DOWNLOAD_DELAY settings to respect proxy rate limits.
  • 3
    Log proxy responses to detect blocks early and switch sessions automatically.

Ready to Integrate?

Get proxy credentials and start coding in minutes.