A Docker Compose Stack for Proxy-Based Scraping
For teams that need more than a script but less than Kubernetes, Docker Compose is the sweet spot. You get isolated services, health checks, declarative networking, and restart policies — all in a single YAML file you can version control and spin up with docker compose up.
This guide assembles a four-service scraping stack: Postgres for persistent storage, Redis for a work queue, a proxy router service that forwards requests to Hex Proxies, and a pool of scraper workers. Network isolation ensures the scraper workers cannot reach the internet except through the proxy router.
The full stack
# docker-compose.yml
name: scraper-stack
services:
postgres:
image: postgres:17-alpine
restart: unless-stopped
environment:
POSTGRES_USER: scraper
POSTGRES_PASSWORD_FILE: /run/secrets/pg_password
POSTGRES_DB: scraping
secrets: [pg_password]
volumes:
- pg_data:/var/lib/postgresql/data
- ./sql/init.sql:/docker-entrypoint-initdb.d/init.sql:ro
healthcheck:
test: ["CMD-SHELL", "pg_isready -U scraper -d scraping"]
interval: 5s
timeout: 5s
retries: 5
networks: [backend]
deploy:
resources:
limits: { cpus: "1.0", memory: 1G }
redis:
image: redis:7.4-alpine
restart: unless-stopped
command: ["redis-server", "--save", "60", "1", "--appendonly", "yes"]
volumes:
- redis_data:/data
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 5s
timeout: 3s
retries: 5
networks: [backend]
proxy-router:
build:
context: ./proxy-router
dockerfile: Dockerfile
restart: unless-stopped
environment:
- HEX_USER_FILE=/run/secrets/hex_user
- HEX_PASS_FILE=/run/secrets/hex_pass
- HEX_GATEWAY=gate.hexproxies.com:7777
secrets: [hex_user, hex_pass]
ports:
- "127.0.0.1:3128:3128" # only bound to loopback on host
healthcheck:
test: ["CMD", "wget", "-qO-", "http://localhost:3128/health"]
interval: 10s
timeout: 5s
retries: 3
networks: [backend, egress]
scraper-worker:
build:
context: ./worker
dockerfile: Dockerfile
restart: unless-stopped
depends_on:
postgres: { condition: service_healthy }
redis: { condition: service_healthy }
proxy-router:{ condition: service_healthy }
environment:
- DATABASE_URL=postgresql://scraper@postgres:5432/scraping
- DATABASE_PASSWORD_FILE=/run/secrets/pg_password
- REDIS_URL=redis://redis:6379/0
- HTTP_PROXY=http://proxy-router:3128
- HTTPS_PROXY=http://proxy-router:3128
- NO_PROXY=postgres,redis,proxy-router,localhost
secrets: [pg_password]
deploy:
replicas: 4
resources:
limits: { cpus: "0.5", memory: 512M }
networks: [backend]
volumes:
pg_data:
redis_data:
secrets:
pg_password:
file: ./secrets/pg_password.txt
hex_user:
file: ./secrets/hex_user.txt
hex_pass:
file: ./secrets/hex_pass.txt
networks:
backend:
driver: bridge
internal: true # no direct internet
egress:
driver: bridge # only proxy-router uses this
The network design is the interesting part. backend is marked internal: true, which in Compose means it has no default gateway — containers on this network cannot initiate outbound internet connections. Only proxy-router is attached to both backend and egress, so it's the only path out. If a scraper gets compromised, it cannot phone home directly.
The secrets mechanism mounts credential files under /run/secrets/ inside each service. This is cleaner than environment variables because docker inspect does not reveal secrets and child processes don't inherit them by accident.
Proxy router Dockerfile
Multi-stage build with a non-root user. The router itself is a tiny aiohttp application that accepts HTTP CONNECT and proxies to the upstream Hex gateway — 80 lines of Python, not shown here, but you can drop in any forwarding proxy (tinyproxy, HAProxy) as long as it can read credentials from /run/secrets/.
# proxy-router/Dockerfile
FROM python:3.12-slim AS builder
WORKDIR /build
COPY requirements.txt .
RUN pip install --no-cache-dir --target=/install -r requirements.txt
FROM python:3.12-slim
RUN useradd --system --no-create-home router
WORKDIR /app
COPY --from=builder /install /usr/local/lib/python3.12/site-packages
COPY router.py .
USER router
EXPOSE 3128
CMD ["python", "router.py"]
# requirements.txt
# aiohttp==3.10.10
Worker Dockerfile
dumb-init as PID 1 is important. Without it, SIGTERM signals don't propagate properly to Python, and docker compose stop takes 10 seconds per container as Docker waits for the kill timeout. dumb-init reaps zombies and forwards signals — it's 20KB and it fixes a class of subtle bugs.
# worker/Dockerfile
FROM python:3.12-slim AS builder
WORKDIR /build
COPY requirements.txt .
RUN pip install --no-cache-dir --target=/install -r requirements.txt
FROM python:3.12-slim
RUN useradd --system --no-create-home --uid 10001 worker \
&& apt-get update && apt-get install -y --no-install-recommends dumb-init \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY --from=builder /install /usr/local/lib/python3.12/site-packages
COPY worker.py .
USER worker
ENTRYPOINT ["dumb-init", "--"]
CMD ["python", "-u", "worker.py"]
# requirements.txt
# httpx[http2]==0.27.2
# redis==5.1.1
# psycopg[binary]==3.2.3
Schema bootstrap
The SQL file mounted at /docker-entrypoint-initdb.d/ runs once on first boot of the Postgres container. After that, migrations should be managed by your application (Alembic, Flyway, etc.).
# sql/init.sql
CREATE TABLE IF NOT EXISTS fetched_pages (
id BIGSERIAL PRIMARY KEY,
url TEXT NOT NULL,
status_code INT NOT NULL,
body_len INT NOT NULL,
fetched_at TIMESTAMPTZ DEFAULT now(),
UNIQUE (url, fetched_at)
);
CREATE INDEX IF NOT EXISTS idx_fetched_pages_url
ON fetched_pages (url);
Health checks and dependency ordering
The condition: service_healthy in the worker's depends_on is the other trick. It tells Compose not to start the worker until Postgres, Redis, and the proxy router all report healthy. Without it, workers start before Postgres is ready, crash on the first connection attempt, and restart in a loop until Postgres catches up. With it, startup is clean and deterministic.
Scaling and replacement
Use docker compose up --scale scraper-worker=10 to bump worker count without editing the file. For zero-downtime restarts, use docker compose up -d --no-deps --build scraper-worker which rebuilds and replaces only that service.
When to graduate to Kubernetes
Compose is the right tool up to about 3-5 hosts. Beyond that, you need something that handles multi-host scheduling, rolling updates, and failover — which is Kubernetes. See our Kubernetes sidecar pattern guide for the next step up.
Hex Proxies plugs into this stack via the HEX_GATEWAY environment variable on the router service. Pricing starts at $2.08/IP for ISP proxies.