v1.10.90-0e025b8
Skip to main content
TutorialKubernetesCode

The Kubernetes Proxy Sidecar Pattern for Scraping at Scale

11 min read

By Hex Proxies Engineering Team

The Kubernetes Proxy Sidecar Pattern for Scraping at Scale

When you run scrapers in Kubernetes, the obvious question is where the proxy client lives. Putting it inside the scraper binary works for small fleets but couples rotation logic to application code. The sidecar pattern solves this: each Pod runs a tiny proxy process (HAProxy or Envoy) alongside the scraper container, and the scraper talks to localhost:3128 without knowing anything about upstream gateways.

The payoff is operational: you can swap providers, rotate credentials, change routing rules, and run chaos experiments without rebuilding the scraper image. Credentials live in a Secret mounted only into the sidecar. Network policy forces all external traffic through the sidecar, so a compromised scraper container cannot leak IPs.

Deployment with init container gate

The init container pattern ensures the scraper never starts before the sidecar is reachable. Without it, your scraper container races the sidecar on Pod startup and the first 30 seconds of every Pod life are garbage.

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: scraper-worker
  namespace: scraping
  labels:
    app: scraper-worker
spec:
  replicas: 6
  selector:
    matchLabels:
      app: scraper-worker
  template:
    metadata:
      labels:
        app: scraper-worker
    spec:
      initContainers:
        - name: proxy-healthcheck
          image: curlimages/curl:8.10.1
          command:
            - /bin/sh
            - -c
            - |
              set -e
              echo "waiting for proxy sidecar..."
              for i in $(seq 1 30); do
                if curl -sf --max-time 2 \
                   -x http://localhost:3128 \
                   https://httpbin.org/ip; then
                  echo "proxy ready"
                  exit 0
                fi
                sleep 1
              done
              echo "proxy never became ready"
              exit 1
      containers:
        - name: scraper
          image: ghcr.io/example/scraper:1.8.3
          env:
            - name: HTTP_PROXY
              value: "http://localhost:3128"
            - name: HTTPS_PROXY
              value: "http://localhost:3128"
            - name: NO_PROXY
              value: "localhost,127.0.0.1,.svc.cluster.local"
          resources:
            requests: { cpu: "250m", memory: "256Mi" }
            limits:   { cpu: "1000m", memory: "1Gi" }
          livenessProbe:
            exec:
              command: ["/bin/sh", "-c", "pgrep -f scraper >/dev/null"]
            initialDelaySeconds: 20
            periodSeconds: 10
        - name: proxy-sidecar
          image: haproxy:3.0.5-alpine
          ports:
            - containerPort: 3128
              name: proxy
          volumeMounts:
            - name: haproxy-config
              mountPath: /usr/local/etc/haproxy
              readOnly: true
          resources:
            requests: { cpu: "100m", memory: "64Mi" }
            limits:   { cpu: "500m", memory: "256Mi" }
          livenessProbe:
            tcpSocket:
              port: 3128
            initialDelaySeconds: 5
            periodSeconds: 5
      volumes:
        - name: haproxy-config
          configMap:
            name: haproxy-proxy-config

HAProxy ConfigMap

HAProxy is the right choice here over Envoy for two reasons. First, the config is 30 lines instead of 300. Second, HAProxy has better memory footprint at low concurrency (40MB vs 80MB idle), which matters when you run 30 replicas. Envoy wins if you need gRPC, WASM filters, or xDS integration — for plain HTTP forwarding, HAProxy is cleaner.

# configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: haproxy-proxy-config
  namespace: scraping
data:
  haproxy.cfg: |
    global
        log stdout format raw local0
        maxconn 4096
        tune.ssl.default-dh-param 2048

    defaults
        mode http
        log global
        option httplog
        timeout connect 5s
        timeout client  30s
        timeout server  30s
        retries 3

    frontend local_proxy
        bind *:3128
        default_backend hex_upstream

    backend hex_upstream
        balance roundrobin
        option httpchk GET /health
        # Forward everything through the Hex Proxies gateway.
        # Credentials come from the Secret mounted as env vars.
        server hex-us-1 gate.hexproxies.com:7777 check inter 10s fall 3 rise 2
        server hex-eu-1 gate-eu.hexproxies.com:7777 check inter 10s fall 3 rise 2 backup

The backup keyword on the EU server means traffic only goes there when all hex-us-* servers fail health checks. For active-active load balancing, drop the backup keyword.

Service and autoscaling

The HPA scales on CPU. For scrapers this is usually the right signal because the workload is CPU-bound on HTML parsing and TLS. If your scrapers are I/O-bound you should scale on a custom metric like queue depth (via Prometheus adapter) instead.

# service.yaml
apiVersion: v1
kind: Service
metadata:
  name: scraper-worker-metrics
  namespace: scraping
spec:
  selector:
    app: scraper-worker
  ports:
    - name: metrics
      port: 9100
      targetPort: 9100
  clusterIP: None
---
# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: scraper-worker
  namespace: scraping
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: scraper-worker
  minReplicas: 3
  maxReplicas: 30
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 25
          periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
        - type: Percent
          value: 100
          periodSeconds: 30

Locking down egress

The NetworkPolicy is the critical piece. Without it, a compromised scraper container can make direct outbound connections and bypass the sidecar entirely — meaning your real IP is leaked and your proxy spend is wasted. The policy above allows only (a) pod-to-pod traffic within the app (so scraper can reach its own sidecar on localhost — though this is really loopback and always allowed) and (b) DNS to kube-system. Everything else is blocked.

# networkpolicy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: scraper-egress
  namespace: scraping
spec:
  podSelector:
    matchLabels:
      app: scraper-worker
  policyTypes:
    - Egress
  egress:
    # Scraper container can only talk to localhost (its own sidecar).
    # All external egress must go through the sidecar.
    - to:
        - podSelector:
            matchLabels:
              app: scraper-worker
    - ports:
        - protocol: UDP
          port: 53
        - protocol: TCP
          port: 53
      to:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: kube-system

Test this by exec'ing into a scraper pod and running curl https://httpbin.org/ip. It should fail. Then curl -x http://localhost:3128 https://httpbin.org/ip — it should succeed and return a Hex Proxies exit IP.

Credentials handling

Put the proxy credentials in a Kubernetes Secret and mount them into the HAProxy container only, not the scraper. HAProxy 3.0+ supports @env variable interpolation in the config, so you can template http-request set-header Proxy-Authorization ... at runtime without baking secrets into ConfigMaps.

When the sidecar pattern is wrong

Sidecars have a fixed overhead: ~100m CPU and 64MB RAM per Pod. If you run thousands of tiny Pods, that overhead dominates. For very-high-replica workloads, switch to a node-local proxy DaemonSet instead — one HAProxy per node, shared by all Pods on that node. The trade-off is weaker isolation: a noisy Pod can saturate the node proxy.

For most teams, sidecars are the right default. See our distributed scraping pipeline guide for how this fits into a full Kafka + worker architecture, and pricing for Hex Proxies plans.