Why Colly for Proxy Work
Colly is the most popular web scraping framework in the Go ecosystem, combining Go's raw performance with a developer-friendly callback-based API inspired by jQuery and Scrapy. For proxy-driven scraping, Colly provides built-in proxy rotation, request queuing, rate limiting, and robots.txt compliance, all features that would require significant custom code with raw `net/http`. Colly's scraping speed is unmatched in its class: a single Colly process can crawl thousands of pages per minute through gate.hexproxies.com:8080 while maintaining clean separation between navigation logic, data extraction, and error handling.
Colly's architecture is event-driven through a collector pattern. You register callbacks for HTML elements, response events, and errors, and Colly manages the request lifecycle. This design is ideal for proxy scraping because proxy failures, retries, and rotation can be handled in dedicated callbacks without cluttering your parsing logic. The collector also supports recursive crawling with automatic deduplication, so you can point it at a website root and let it discover and proxy-route all linked pages automatically.
Configuration Patterns
Colly offers two proxy configuration approaches. The simple method uses `c.SetProxy()` for a single static proxy, which is sufficient when Hex Proxies handles rotation at the gateway level. The advanced method uses `c.SetProxyFunc()` with a custom function that returns a different proxy URL for each request, enabling client-side rotation logic or geographic proxy selection based on the target URL.
Colly's `LimitRule` system controls per-domain request rates with `DomainGlob`, `Parallelism`, `Delay`, and `RandomDelay` fields. Set `Parallelism` to the number of concurrent requests per domain and `Delay` to the minimum gap between requests. These rules apply after proxy routing, so they control how aggressively your scraper hits each target domain through the proxy, not how many proxy connections you open overall.
Common Pitfalls
Colly's collector cloning (`c.Clone()`) creates a new collector that inherits the parent's configuration, including proxy settings. However, cloned collectors do not share visited URL state, which means the same URL can be scraped multiple times across clones. When using cloned collectors for parallel domain scraping through proxies, implement external deduplication using a shared set or database to avoid wasting proxy requests on duplicate pages.
A subtle issue arises when Colly's built-in async mode interacts with proxy rate limits. Enabling `c.Async = true` allows Visit calls to return immediately and execute in parallel, but without a `LimitRule`, this can fire hundreds of simultaneous requests through the proxy. Always define limit rules before enabling async mode, and use `c.Wait()` to block until all queued requests complete before exiting the program.
Performance Optimization
Colly's request queue can be backed by an in-memory store (default) or a persistent store using the `queue` package. For large-scale proxy crawls spanning millions of URLs, switch to a Redis-backed or SQLite-backed queue so your crawl survives process restarts without re-fetching already-visited pages. This saves proxy bandwidth and prevents redundant requests when a crawl takes hours or days.
Enable Colly's caching layer with `colly.CacheDir()` to store raw HTML responses on disk. When developing parsing logic, you can iterate on your extraction code using cached responses without routing new requests through the proxy. This development workflow saves significant proxy usage during the extraction tuning phase. Once your parser is stable, disable caching for production crawls. Profile your Colly scraper with Go's `pprof` tool to identify whether bottlenecks are in network I/O (proxy latency), HTML parsing (CPU-bound), or pipeline processing (data handling).