Why wget for Proxy Work
wget occupies a distinct niche from cURL in the proxy toolbox: it is purpose-built for downloading files and recursively mirroring websites, making it the superior choice for proxy workflows centered on content retrieval rather than API interaction. While cURL excels at single-request operations and protocol flexibility, wget's built-in retry logic, recursive download capabilities, and bandwidth throttling make it the right tool when you need to download entire directory structures, mirror static sites, or batch-retrieve files through gate.hexproxies.com:8080 with automatic resume on failure.
wget's non-interactive design makes it inherently suited for automated proxy workflows. It requires no user input, handles interrupted downloads by resuming from where it left off (with `-c`), and follows redirects by default. These characteristics make wget scripts robust against the intermittent failures that are common in proxy-routed downloads: temporary proxy congestion, network hiccups, and target server rate limiting are all handled gracefully by wget's retry and resume mechanisms without custom error handling code.
Configuration Patterns
wget supports proxy configuration through three mechanisms with a clear priority order. Command-line options take highest priority, followed by environment variables (`http_proxy`, `https_proxy`), followed by the `~/.wgetrc` configuration file. The `.wgetrc` approach is ideal for persistent proxy setups on dedicated scraping servers where every wget call should route through the proxy. Set `use_proxy = on`, `http_proxy`, and `https_proxy` in `.wgetrc` to make proxied downloads the default behavior.
For authenticated proxies, embed credentials in the proxy URL within the environment variable or `.wgetrc`. wget also accepts `--proxy-user` and `--proxy-password` command-line flags for scripts that source credentials from a secrets manager. The `no_proxy` environment variable (or `no_proxy` directive in `.wgetrc`) lets you bypass the proxy for specific domains, useful for internal services that should not be routed through external proxy infrastructure.
Common Pitfalls
wget's recursive download mode (`-r`) through a proxy can generate enormous traffic volumes if not properly constrained. Without depth limiting (`-l`), domain restriction (`-D`), and accept/reject patterns (`-A`/`-R`), a recursive wget crawl follows every link it discovers, potentially downloading entire websites and consuming your proxy bandwidth allocation in minutes. Always set `-l 2` for shallow crawls, `-D example.com` to stay on the target domain, and `-R "*.pdf,*.zip"` to skip large binary files unless they are your target.
wget converts absolute URLs to relative paths when mirroring sites, which can cause issues if the downloaded content references resources that were not included in the mirror. More importantly for proxy work, wget's timestamp-based conditional downloads (`-N`) rely on the server's Last-Modified header, which may be inaccurate when fetched through a caching proxy. Use `--no-cache` to ensure wget requests fresh content through the proxy rather than receiving cached versions.
Performance Optimization
wget's `--limit-rate` flag is essential for long-running proxy downloads that should not saturate your bandwidth allocation. Set it to a value that leaves headroom for other proxy consumers: `--limit-rate=500k` limits each wget process to 500KB/s. Combine with `--wait=1` and `--random-wait` to add pauses between requests during recursive downloads, reducing the chance of triggering rate limits on target servers.
For parallel downloads through the proxy, use GNU Parallel or xargs to run multiple wget instances simultaneously. The command `cat urls.txt | xargs -P 5 -I {} wget -q --timeout=30 -O /dev/null {}` runs 5 concurrent wget processes, each routing through the proxy configured in the environment. Monitor aggregate throughput by piping wget's progress output through a summary script, and adjust parallelism based on the proxy gateway's response times and error rates.