The Kubernetes Proxy Sidecar Pattern for Scraping at Scale
When you run scrapers in Kubernetes, the obvious question is where the proxy client lives. Putting it inside the scraper binary works for small fleets but couples rotation logic to application code. The sidecar pattern solves this: each Pod runs a tiny proxy process (HAProxy or Envoy) alongside the scraper container, and the scraper talks to localhost:3128 without knowing anything about upstream gateways.
The payoff is operational: you can swap providers, rotate credentials, change routing rules, and run chaos experiments without rebuilding the scraper image. Credentials live in a Secret mounted only into the sidecar. Network policy forces all external traffic through the sidecar, so a compromised scraper container cannot leak IPs.
Deployment with init container gate
The init container pattern ensures the scraper never starts before the sidecar is reachable. Without it, your scraper container races the sidecar on Pod startup and the first 30 seconds of every Pod life are garbage.
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: scraper-worker
namespace: scraping
labels:
app: scraper-worker
spec:
replicas: 6
selector:
matchLabels:
app: scraper-worker
template:
metadata:
labels:
app: scraper-worker
spec:
initContainers:
- name: proxy-healthcheck
image: curlimages/curl:8.10.1
command:
- /bin/sh
- -c
- |
set -e
echo "waiting for proxy sidecar..."
for i in $(seq 1 30); do
if curl -sf --max-time 2 \
-x http://localhost:3128 \
https://httpbin.org/ip; then
echo "proxy ready"
exit 0
fi
sleep 1
done
echo "proxy never became ready"
exit 1
containers:
- name: scraper
image: ghcr.io/example/scraper:1.8.3
env:
- name: HTTP_PROXY
value: "http://localhost:3128"
- name: HTTPS_PROXY
value: "http://localhost:3128"
- name: NO_PROXY
value: "localhost,127.0.0.1,.svc.cluster.local"
resources:
requests: { cpu: "250m", memory: "256Mi" }
limits: { cpu: "1000m", memory: "1Gi" }
livenessProbe:
exec:
command: ["/bin/sh", "-c", "pgrep -f scraper >/dev/null"]
initialDelaySeconds: 20
periodSeconds: 10
- name: proxy-sidecar
image: haproxy:3.0.5-alpine
ports:
- containerPort: 3128
name: proxy
volumeMounts:
- name: haproxy-config
mountPath: /usr/local/etc/haproxy
readOnly: true
resources:
requests: { cpu: "100m", memory: "64Mi" }
limits: { cpu: "500m", memory: "256Mi" }
livenessProbe:
tcpSocket:
port: 3128
initialDelaySeconds: 5
periodSeconds: 5
volumes:
- name: haproxy-config
configMap:
name: haproxy-proxy-config
HAProxy ConfigMap
HAProxy is the right choice here over Envoy for two reasons. First, the config is 30 lines instead of 300. Second, HAProxy has better memory footprint at low concurrency (40MB vs 80MB idle), which matters when you run 30 replicas. Envoy wins if you need gRPC, WASM filters, or xDS integration — for plain HTTP forwarding, HAProxy is cleaner.
# configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: haproxy-proxy-config
namespace: scraping
data:
haproxy.cfg: |
global
log stdout format raw local0
maxconn 4096
tune.ssl.default-dh-param 2048
defaults
mode http
log global
option httplog
timeout connect 5s
timeout client 30s
timeout server 30s
retries 3
frontend local_proxy
bind *:3128
default_backend hex_upstream
backend hex_upstream
balance roundrobin
option httpchk GET /health
# Forward everything through the Hex Proxies gateway.
# Credentials come from the Secret mounted as env vars.
server hex-us-1 gate.hexproxies.com:7777 check inter 10s fall 3 rise 2
server hex-eu-1 gate-eu.hexproxies.com:7777 check inter 10s fall 3 rise 2 backup
The backup keyword on the EU server means traffic only goes there when all hex-us-* servers fail health checks. For active-active load balancing, drop the backup keyword.
Service and autoscaling
The HPA scales on CPU. For scrapers this is usually the right signal because the workload is CPU-bound on HTML parsing and TLS. If your scrapers are I/O-bound you should scale on a custom metric like queue depth (via Prometheus adapter) instead.
# service.yaml
apiVersion: v1
kind: Service
metadata:
name: scraper-worker-metrics
namespace: scraping
spec:
selector:
app: scraper-worker
ports:
- name: metrics
port: 9100
targetPort: 9100
clusterIP: None
---
# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: scraper-worker
namespace: scraping
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: scraper-worker
minReplicas: 3
maxReplicas: 30
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 25
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 30
Locking down egress
The NetworkPolicy is the critical piece. Without it, a compromised scraper container can make direct outbound connections and bypass the sidecar entirely — meaning your real IP is leaked and your proxy spend is wasted. The policy above allows only (a) pod-to-pod traffic within the app (so scraper can reach its own sidecar on localhost — though this is really loopback and always allowed) and (b) DNS to kube-system. Everything else is blocked.
# networkpolicy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: scraper-egress
namespace: scraping
spec:
podSelector:
matchLabels:
app: scraper-worker
policyTypes:
- Egress
egress:
# Scraper container can only talk to localhost (its own sidecar).
# All external egress must go through the sidecar.
- to:
- podSelector:
matchLabels:
app: scraper-worker
- ports:
- protocol: UDP
port: 53
- protocol: TCP
port: 53
to:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: kube-system
Test this by exec'ing into a scraper pod and running curl https://httpbin.org/ip. It should fail. Then curl -x http://localhost:3128 https://httpbin.org/ip — it should succeed and return a Hex Proxies exit IP.
Credentials handling
Put the proxy credentials in a Kubernetes Secret and mount them into the HAProxy container only, not the scraper. HAProxy 3.0+ supports @env variable interpolation in the config, so you can template http-request set-header Proxy-Authorization ... at runtime without baking secrets into ConfigMaps.
When the sidecar pattern is wrong
Sidecars have a fixed overhead: ~100m CPU and 64MB RAM per Pod. If you run thousands of tiny Pods, that overhead dominates. For very-high-replica workloads, switch to a node-local proxy DaemonSet instead — one HAProxy per node, shared by all Pods on that node. The trade-off is weaker isolation: a noisy Pod can saturate the node proxy.
For most teams, sidecars are the right default. See our distributed scraping pipeline guide for how this fits into a full Kafka + worker architecture, and pricing for Hex Proxies plans.