v1.10.82-f67ee7d
Skip to main content
← Back to Hex Proxies

Scrapy Proxy Setup

Configure proxy middleware in Scrapy for large-scale web scraping. Rotating proxy setup with Hex Proxies.

Requirements

  • Python 3.8+
  • Scrapy 2.8+
  • Twisted (installed with Scrapy)

Installation

pip install scrapy

Code Example

# settings.py - Enable proxy middleware
DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.ProxyMiddleware': 350,
    'scrapy.downloadermiddlewares.httpproxy'
    '.HttpProxyMiddleware': 400,
}

# middlewares.py - Custom proxy middleware
class ProxyMiddleware:
    def process_request(self, request, spider):
        request.meta['proxy'] = (
            'http://user:pass@gate.hexproxies.com:8080'
        )

# spider.py - Using proxy in spider
import scrapy

class ExampleSpider(scrapy.Spider):
    name = 'example'
    start_urls = ['https://httpbin.org/ip']

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(
                url,
                meta={'proxy': (
                    'http://user:pass@'
                    'gate.hexproxies.com:8080'
                )},
            )

    def parse(self, response):
        self.logger.info(response.text)

Setup Steps

1

Create proxy middleware

Write a ProxyMiddleware class with process_request that sets request.meta["proxy"] to your Hex Proxies URL.

2

Register in settings

Add your middleware to DOWNLOADER_MIDDLEWARES with priority 350, before HttpProxyMiddleware at 400.

3

Tune concurrency

Set CONCURRENT_REQUESTS=50 and DOWNLOAD_TIMEOUT=60 in settings.py for proxied traffic.

4

Test with a simple spider

Create a spider that hits httpbin.org/ip and verify the response shows a Hex Proxies IP.

5

Add retry handling

Implement process_exception in your middleware to handle connection failures with fresh proxy sessions.

6

Enable monitoring

Use Scrapy stats collector to track request counts, success rates, and error types per crawl.

Configuration Options

OptionDescription
Proxy URLhttp://user:pass@gate.hexproxies.com:8080 set in middleware process_request.
Middleware Priority350 for custom ProxyMiddleware, before built-in HttpProxyMiddleware at 400.
Concurrent RequestsCONCURRENT_REQUESTS controls parallelism. Start at 20, scale to 50-100.
Download TimeoutDOWNLOAD_TIMEOUT=60 to allow for proxy routing overhead.
AutoThrottleAUTOTHROTTLE_ENABLED=True for adaptive pacing that respects destination rate limits.

Best Practices

  • Use a downloader middleware for proxy routing rather than setting meta on every request.
  • Set CONCURRENT_REQUESTS between 20-100 based on your proxy plan and destination tolerance.
  • Enable AUTOTHROTTLE to adaptively pace requests and avoid triggering destination rate limits.
  • Store proxy credentials in Scrapy settings loaded from environment variables.
  • Implement process_exception in middleware for custom retry logic on proxy connection failures.
  • Use Scrapy stats collector to monitor proxy success rates and error distributions per crawl.
  • Set DOWNLOAD_TIMEOUT=60 for proxied traffic; the default may be too aggressive.
  • Use HTTPCACHE_ENABLED=True during development to avoid wasting proxy bandwidth on repeated requests.

Scrapy Proxy Setup

Scrapy's middleware architecture is purpose-built for proxy integration at scale. Unlike requests or aiohttp where you configure proxies per-request, Scrapy's downloader middleware pipeline intercepts every outgoing request automatically. This means you write the proxy logic once and every spider in your project inherits it without per-spider configuration.

Scrapy uses Twisted's asynchronous reactor under the hood, making it naturally concurrent without the complexity of asyncio. Combined with Hex Proxies, a single Scrapy process with CONCURRENT_REQUESTS=50 can handle tens of thousands of pages per hour with automatic proxy rotation.

Prerequisites

Before you begin, make sure you have: - An active Hex Proxies account with proxy credentials - Python 3.8+ - Scrapy 2.8+ - Twisted (installed automatically with Scrapy)

Installation

pip install scrapy

Basic Proxy Configuration

Scrapy supports proxies via the request.meta['proxy'] field or through a custom downloader middleware. The middleware approach is recommended for consistent proxy application across all spiders.

# settings.py - Enable proxy middleware
DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.ProxyMiddleware': 350,
    'scrapy.downloadermiddlewares.httpproxy'
    '.HttpProxyMiddleware': 400,

# middlewares.py - Custom proxy middleware class ProxyMiddleware: def process_request(self, request, spider): request.meta['proxy'] = ( 'http://user:pass@gate.hexproxies.com:8080' )

# spider.py - Using proxy in spider import scrapy

class ExampleSpider(scrapy.Spider): name = 'example' start_urls = ['https://httpbin.org/ip']

def start_requests(self): for url in self.start_urls: yield scrapy.Request( url, meta={'proxy': ( 'http://user:pass@' 'gate.hexproxies.com:8080' )}, )

def parse(self, response): self.logger.info(response.text) ```

Middleware Priority and Ordering

Set your ProxyMiddleware priority to 350, which runs before Scrapy's built-in HttpProxyMiddleware at 400. This ensures your proxy URL is set before Scrapy processes the request. If you use retry middleware, set RetryMiddleware priority higher (numerically lower) so retries also go through the proxy.

Automatic Retry with Proxy Rotation

Implement process_exception in your middleware to catch connection failures and retry with a different proxy session. Combined with Hex Proxies per-request rotation, each retry gets a fresh IP automatically without maintaining a proxy list.

Scrapy Settings for Proxy Performance

Key settings for proxied scraping: CONCURRENT_REQUESTS=50 (match your proxy plan), DOWNLOAD_TIMEOUT=60 (higher for proxied traffic), DOWNLOAD_DELAY=0.5 (respect destinations), and RETRY_TIMES=3 (automatic retries on failure).

Configuration Options

  • **Proxy URL** -- http://user:pass@gate.hexproxies.com:8080 set in middleware or request.meta.
  • **Middleware Priority** -- 350 for ProxyMiddleware, before HttpProxyMiddleware at 400.
  • **Concurrent Requests** -- CONCURRENT_REQUESTS controls parallel connections through the proxy.
  • **Download Timeout** -- DOWNLOAD_TIMEOUT=60 to accommodate proxy overhead.
  • **AutoThrottle** -- Enable AUTOTHROTTLE_ENABLED=True for adaptive request pacing.

Error Handling

Scrapy handles proxy errors through the middleware pipeline and retry system.

  1. twisted.internet.error.TCPTimedOutError
  2. - The proxy connection timed out during the TCP handshake
  3. - Increase DOWNLOAD_TIMEOUT from default 180 to 300 seconds
  4. - Check if gate.hexproxies.com:8080 is reachable from your scraping server

2. twisted.internet.error.ConnectionRefusedError - The proxy gateway refused the connection - Verify your Hex Proxies subscription is active - Check CONCURRENT_REQUESTS is not exceeding your plan limits

3. HTTP 407 in response.status - Proxy authentication failed - Verify credentials in the middleware are correct - Ensure the proxy URL includes both username and password

4. scrapy.spidermiddlewares.httperror.HttpError (403/429) - The destination site blocked or rate-limited the request - Reduce CONCURRENT_REQUESTS and increase DOWNLOAD_DELAY - Switch to residential proxies for better success rates

5. twisted.internet.error.ConnectionLost - The proxy or destination dropped the connection mid-transfer - Enable RETRY_ENABLED=True with RETRY_TIMES=3 - Implement process_exception in your middleware for custom retry logic ```

Log all errors with spider.logger and use Scrapy's stats collector to track proxy error rates per crawl.

Error Handling

## Error Handling

Scrapy handles proxy errors through the middleware pipeline and retry system.

1. twisted.internet.error.TCPTimedOutError - The proxy connection timed out during the TCP handshake - Increase DOWNLOAD_TIMEOUT from default 180 to 300 seconds - Check if gate.hexproxies.com:8080 is reachable from your scraping server

2. twisted.internet.error.ConnectionRefusedError - The proxy gateway refused the connection - Verify your Hex Proxies subscription is active - Check CONCURRENT_REQUESTS is not exceeding your plan limits

3. HTTP 407 in response.status - Proxy authentication failed - Verify credentials in the middleware are correct - Ensure the proxy URL includes both username and password

4. scrapy.spidermiddlewares.httperror.HttpError (403/429) - The destination site blocked or rate-limited the request - Reduce CONCURRENT_REQUESTS and increase DOWNLOAD_DELAY - Switch to residential proxies for better success rates

5. twisted.internet.error.ConnectionLost - The proxy or destination dropped the connection mid-transfer - Enable RETRY_ENABLED=True with RETRY_TIMES=3 - Implement process_exception in your middleware for custom retry logic ```

Log all errors with spider.logger and use Scrapy's stats collector to track proxy error rates per crawl.

Frequently Asked Questions

Should I use middleware or meta for proxy configuration?

Use middleware for project-wide proxy routing. Use meta for per-request proxy overrides, such as different proxies for different domains.

How many concurrent requests should I run with proxies?

Start with CONCURRENT_REQUESTS=20 and increase to 50-100 based on your Hex Proxies plan and destination tolerance.

Does Scrapy support SOCKS5 proxies?

Not natively. Install scrapy-socks for SOCKS5 support, or use the HTTP proxy on gate.hexproxies.com:8080 which covers most web scraping needs.

How do I rotate proxies in Scrapy?

Hex Proxies rotates IPs automatically per-request. Each new Scrapy request through the gateway gets a fresh IP without any additional configuration.

Ready to Get Started?

Set up Scrapy with Hex Proxies in minutes.