v1.10.90-0e025b8
Skip to main content
TutorialDockerCode

A Docker Compose Stack for Proxy-Based Scraping

11 min read

By Hex Proxies Engineering Team

A Docker Compose Stack for Proxy-Based Scraping

For teams that need more than a script but less than Kubernetes, Docker Compose is the sweet spot. You get isolated services, health checks, declarative networking, and restart policies — all in a single YAML file you can version control and spin up with docker compose up.

This guide assembles a four-service scraping stack: Postgres for persistent storage, Redis for a work queue, a proxy router service that forwards requests to Hex Proxies, and a pool of scraper workers. Network isolation ensures the scraper workers cannot reach the internet except through the proxy router.

The full stack

# docker-compose.yml
name: scraper-stack

services:
  postgres:
    image: postgres:17-alpine
    restart: unless-stopped
    environment:
      POSTGRES_USER: scraper
      POSTGRES_PASSWORD_FILE: /run/secrets/pg_password
      POSTGRES_DB: scraping
    secrets: [pg_password]
    volumes:
      - pg_data:/var/lib/postgresql/data
      - ./sql/init.sql:/docker-entrypoint-initdb.d/init.sql:ro
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U scraper -d scraping"]
      interval: 5s
      timeout: 5s
      retries: 5
    networks: [backend]
    deploy:
      resources:
        limits: { cpus: "1.0", memory: 1G }

  redis:
    image: redis:7.4-alpine
    restart: unless-stopped
    command: ["redis-server", "--save", "60", "1", "--appendonly", "yes"]
    volumes:
      - redis_data:/data
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 5s
      timeout: 3s
      retries: 5
    networks: [backend]

  proxy-router:
    build:
      context: ./proxy-router
      dockerfile: Dockerfile
    restart: unless-stopped
    environment:
      - HEX_USER_FILE=/run/secrets/hex_user
      - HEX_PASS_FILE=/run/secrets/hex_pass
      - HEX_GATEWAY=gate.hexproxies.com:7777
    secrets: [hex_user, hex_pass]
    ports:
      - "127.0.0.1:3128:3128"   # only bound to loopback on host
    healthcheck:
      test: ["CMD", "wget", "-qO-", "http://localhost:3128/health"]
      interval: 10s
      timeout: 5s
      retries: 3
    networks: [backend, egress]

  scraper-worker:
    build:
      context: ./worker
      dockerfile: Dockerfile
    restart: unless-stopped
    depends_on:
      postgres:    { condition: service_healthy }
      redis:       { condition: service_healthy }
      proxy-router:{ condition: service_healthy }
    environment:
      - DATABASE_URL=postgresql://scraper@postgres:5432/scraping
      - DATABASE_PASSWORD_FILE=/run/secrets/pg_password
      - REDIS_URL=redis://redis:6379/0
      - HTTP_PROXY=http://proxy-router:3128
      - HTTPS_PROXY=http://proxy-router:3128
      - NO_PROXY=postgres,redis,proxy-router,localhost
    secrets: [pg_password]
    deploy:
      replicas: 4
      resources:
        limits: { cpus: "0.5", memory: 512M }
    networks: [backend]

volumes:
  pg_data:
  redis_data:

secrets:
  pg_password:
    file: ./secrets/pg_password.txt
  hex_user:
    file: ./secrets/hex_user.txt
  hex_pass:
    file: ./secrets/hex_pass.txt

networks:
  backend:
    driver: bridge
    internal: true   # no direct internet
  egress:
    driver: bridge   # only proxy-router uses this

The network design is the interesting part. backend is marked internal: true, which in Compose means it has no default gateway — containers on this network cannot initiate outbound internet connections. Only proxy-router is attached to both backend and egress, so it's the only path out. If a scraper gets compromised, it cannot phone home directly.

The secrets mechanism mounts credential files under /run/secrets/ inside each service. This is cleaner than environment variables because docker inspect does not reveal secrets and child processes don't inherit them by accident.

Proxy router Dockerfile

Multi-stage build with a non-root user. The router itself is a tiny aiohttp application that accepts HTTP CONNECT and proxies to the upstream Hex gateway — 80 lines of Python, not shown here, but you can drop in any forwarding proxy (tinyproxy, HAProxy) as long as it can read credentials from /run/secrets/.

# proxy-router/Dockerfile
FROM python:3.12-slim AS builder
WORKDIR /build
COPY requirements.txt .
RUN pip install --no-cache-dir --target=/install -r requirements.txt

FROM python:3.12-slim
RUN useradd --system --no-create-home router
WORKDIR /app
COPY --from=builder /install /usr/local/lib/python3.12/site-packages
COPY router.py .
USER router
EXPOSE 3128
CMD ["python", "router.py"]

# requirements.txt
# aiohttp==3.10.10

Worker Dockerfile

dumb-init as PID 1 is important. Without it, SIGTERM signals don't propagate properly to Python, and docker compose stop takes 10 seconds per container as Docker waits for the kill timeout. dumb-init reaps zombies and forwards signals — it's 20KB and it fixes a class of subtle bugs.

# worker/Dockerfile
FROM python:3.12-slim AS builder
WORKDIR /build
COPY requirements.txt .
RUN pip install --no-cache-dir --target=/install -r requirements.txt

FROM python:3.12-slim
RUN useradd --system --no-create-home --uid 10001 worker \
    && apt-get update && apt-get install -y --no-install-recommends dumb-init \
    && rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY --from=builder /install /usr/local/lib/python3.12/site-packages
COPY worker.py .
USER worker
ENTRYPOINT ["dumb-init", "--"]
CMD ["python", "-u", "worker.py"]

# requirements.txt
# httpx[http2]==0.27.2
# redis==5.1.1
# psycopg[binary]==3.2.3

Schema bootstrap

The SQL file mounted at /docker-entrypoint-initdb.d/ runs once on first boot of the Postgres container. After that, migrations should be managed by your application (Alembic, Flyway, etc.).

# sql/init.sql
CREATE TABLE IF NOT EXISTS fetched_pages (
  id          BIGSERIAL PRIMARY KEY,
  url         TEXT NOT NULL,
  status_code INT  NOT NULL,
  body_len    INT  NOT NULL,
  fetched_at  TIMESTAMPTZ DEFAULT now(),
  UNIQUE (url, fetched_at)
);
CREATE INDEX IF NOT EXISTS idx_fetched_pages_url
  ON fetched_pages (url);

Health checks and dependency ordering

The condition: service_healthy in the worker's depends_on is the other trick. It tells Compose not to start the worker until Postgres, Redis, and the proxy router all report healthy. Without it, workers start before Postgres is ready, crash on the first connection attempt, and restart in a loop until Postgres catches up. With it, startup is clean and deterministic.

Scaling and replacement

Use docker compose up --scale scraper-worker=10 to bump worker count without editing the file. For zero-downtime restarts, use docker compose up -d --no-deps --build scraper-worker which rebuilds and replaces only that service.

When to graduate to Kubernetes

Compose is the right tool up to about 3-5 hosts. Beyond that, you need something that handles multi-host scheduling, rolling updates, and failover — which is Kubernetes. See our Kubernetes sidecar pattern guide for the next step up.

Hex Proxies plugs into this stack via the HEX_GATEWAY environment variable on the router service. Pricing starts at $2.08/IP for ISP proxies.