Scrapy Proxy Setup
Scrapy's middleware architecture is purpose-built for proxy integration at scale. Unlike requests or aiohttp where you configure proxies per-request, Scrapy's downloader middleware pipeline intercepts every outgoing request automatically. This means you write the proxy logic once and every spider in your project inherits it without per-spider configuration.
Scrapy uses Twisted's asynchronous reactor under the hood, making it naturally concurrent without the complexity of asyncio. Combined with Hex Proxies, a single Scrapy process with CONCURRENT_REQUESTS=50 can handle tens of thousands of pages per hour with automatic proxy rotation.
Prerequisites
Before you begin, make sure you have: - An active Hex Proxies account with proxy credentials - Python 3.8+ - Scrapy 2.8+ - Twisted (installed automatically with Scrapy)
Installation
pip install scrapyBasic Proxy Configuration
Scrapy supports proxies via the request.meta['proxy'] field or through a custom downloader middleware. The middleware approach is recommended for consistent proxy application across all spiders.
# settings.py - Enable proxy middleware
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.ProxyMiddleware': 350,
'scrapy.downloadermiddlewares.httpproxy'
'.HttpProxyMiddleware': 400,# middlewares.py - Custom proxy middleware class ProxyMiddleware: def process_request(self, request, spider): request.meta['proxy'] = ( 'http://user:pass@gate.hexproxies.com:8080' )
# spider.py - Using proxy in spider import scrapy
class ExampleSpider(scrapy.Spider): name = 'example' start_urls = ['https://httpbin.org/ip']
def start_requests(self): for url in self.start_urls: yield scrapy.Request( url, meta={'proxy': ( 'http://user:pass@' 'gate.hexproxies.com:8080' )}, )
def parse(self, response): self.logger.info(response.text) ```
Middleware Priority and Ordering
Set your ProxyMiddleware priority to 350, which runs before Scrapy's built-in HttpProxyMiddleware at 400. This ensures your proxy URL is set before Scrapy processes the request. If you use retry middleware, set RetryMiddleware priority higher (numerically lower) so retries also go through the proxy.
Automatic Retry with Proxy Rotation
Implement process_exception in your middleware to catch connection failures and retry with a different proxy session. Combined with Hex Proxies per-request rotation, each retry gets a fresh IP automatically without maintaining a proxy list.
Scrapy Settings for Proxy Performance
Key settings for proxied scraping: CONCURRENT_REQUESTS=50 (match your proxy plan), DOWNLOAD_TIMEOUT=60 (higher for proxied traffic), DOWNLOAD_DELAY=0.5 (respect destinations), and RETRY_TIMES=3 (automatic retries on failure).
Configuration Options
- **Proxy URL** -- http://user:pass@gate.hexproxies.com:8080 set in middleware or request.meta.
- **Middleware Priority** -- 350 for ProxyMiddleware, before HttpProxyMiddleware at 400.
- **Concurrent Requests** -- CONCURRENT_REQUESTS controls parallel connections through the proxy.
- **Download Timeout** -- DOWNLOAD_TIMEOUT=60 to accommodate proxy overhead.
- **AutoThrottle** -- Enable AUTOTHROTTLE_ENABLED=True for adaptive request pacing.
Error Handling
Scrapy handles proxy errors through the middleware pipeline and retry system.
- twisted.internet.error.TCPTimedOutError
- - The proxy connection timed out during the TCP handshake
- - Increase DOWNLOAD_TIMEOUT from default 180 to 300 seconds
- - Check if gate.hexproxies.com:8080 is reachable from your scraping server
2. twisted.internet.error.ConnectionRefusedError - The proxy gateway refused the connection - Verify your Hex Proxies subscription is active - Check CONCURRENT_REQUESTS is not exceeding your plan limits
3. HTTP 407 in response.status - Proxy authentication failed - Verify credentials in the middleware are correct - Ensure the proxy URL includes both username and password
4. scrapy.spidermiddlewares.httperror.HttpError (403/429) - The destination site blocked or rate-limited the request - Reduce CONCURRENT_REQUESTS and increase DOWNLOAD_DELAY - Switch to residential proxies for better success rates
5. twisted.internet.error.ConnectionLost - The proxy or destination dropped the connection mid-transfer - Enable RETRY_ENABLED=True with RETRY_TIMES=3 - Implement process_exception in your middleware for custom retry logic ```
Log all errors with spider.logger and use Scrapy's stats collector to track proxy error rates per crawl.