v1.10.90-0e025b8
Skip to main content
TutorialNode.jsCode

Scraping With the Node.js cluster Module and a Shared Proxy Pool

12 min read

By Hex Proxies Engineering Team

Scraping With the Node.js cluster Module and a Shared Proxy Pool

Node is single-threaded by default. For a CPU-bound workload you need more than one process, and the cluster module is still the cleanest way to get them. This guide builds a clustered scraper where a primary process spawns N workers, each owning a disjoint slice of the proxy session pool, with graceful shutdown, automatic worker respawn, and periodic stats reporting.

The other option is Worker Threads. Use those when you need shared memory (SharedArrayBuffer), parallel work inside a single request, or tighter coupling between workers. For scraping, which is embarrassingly parallel and process-isolated by nature, cluster is simpler and more fault-tolerant.

package.json

// package.json
{
  "name": "scraper-cluster",
  "type": "module",
  "engines": { "node": ">=22.0.0" },
  "dependencies": {
    "undici": "6.20.1",
    "p-queue": "8.0.1",
    "pino": "9.5.0"
  }
}

undici is the modern HTTP client bundled with Node 22 and is ~2x faster than the built-in http.request for the proxy use case. It also has a proper ProxyAgent that supports keep-alive against upstream proxies — something the legacy http module handles badly.

Primary process

The primary's only jobs are: spawn N workers, give each worker its slice of the session pool, listen for exit events and respawn dead workers, forward shutdown signals. No scraping logic, no HTTP calls. This separation is why cluster is easier to reason about than threads — primary death is catastrophic but rare; worker death is routine and the primary handles it.

// src/primary.mjs
import cluster from "node:cluster";
import os from "node:os";
import { fileURLToPath } from "node:url";
import path from "node:path";
import pino from "pino";

const log = pino({ name: "primary" });
const WORKER_COUNT = Math.min(Number(process.env.WORKERS) || os.cpus().length, 16);
const SHUTDOWN_TIMEOUT_MS = 10_000;

// Shared proxy assignment: each worker owns a disjoint slice of sessions.
const SESSION_POOL_SIZE = 40;

cluster.setupPrimary({
  exec: fileURLToPath(new URL("./worker.mjs", import.meta.url)),
});

const workers = new Map();

function spawnWorker(slot) {
  const perWorker = SESSION_POOL_SIZE / WORKER_COUNT;
  const start = Math.floor(slot * perWorker);
  const end = Math.floor((slot + 1) * perWorker);
  const env = {
    ...process.env,
    WORKER_SLOT: String(slot),
    SESSION_RANGE_START: String(start),
    SESSION_RANGE_END: String(end),
  };
  const w = cluster.fork(env);
  workers.set(w.id, { worker: w, slot });

  w.on("message", (msg) => {
    if (msg?.type === "stats") {
      log.info({ slot, ...msg.payload }, "worker stats");
    }
  });

  w.on("exit", (code, signal) => {
    log.warn({ slot, code, signal }, "worker exited");
    workers.delete(w.id);
    if (!shuttingDown) {
      log.info({ slot }, "respawning worker");
      spawnWorker(slot);
    }
  });
}

for (let i = 0; i < WORKER_COUNT; i++) spawnWorker(i);

let shuttingDown = false;
function shutdown(signal) {
  if (shuttingDown) return;
  shuttingDown = true;
  log.info({ signal }, "primary shutting down");

  for (const { worker } of workers.values()) {
    worker.send({ type: "shutdown" });
  }

  const deadline = Date.now() + SHUTDOWN_TIMEOUT_MS;
  const interval = setInterval(() => {
    if (workers.size === 0) {
      clearInterval(interval);
      process.exit(0);
    }
    if (Date.now() > deadline) {
      log.warn("force killing remaining workers");
      for (const { worker } of workers.values()) worker.kill("SIGKILL");
      clearInterval(interval);
      process.exit(1);
    }
  }, 250);
}

process.on("SIGTERM", () => shutdown("SIGTERM"));
process.on("SIGINT", () => shutdown("SIGINT"));

Worker exits with code 0 are treated the same as crashes from the respawn perspective. If you want drained-queue workers to stop cleanly, set a flag in the worker exit message and check it in the exit handler.

Worker process

Each worker runs its own p-queue with bounded concurrency (25 in-flight requests per worker). One undici ProxyAgent per session ID, cached in a Map — this gives you keep-alive connection reuse for each sticky session, which in turn gives you 5-10x throughput on repeated fetches vs fresh connections every time.

// src/worker.mjs
import { ProxyAgent, Agent, request } from "undici";
import PQueue from "p-queue";
import pino from "pino";

const log = pino({ name: `worker-${process.env.WORKER_SLOT}` });

const GATEWAY = "http://gate.hexproxies.com:7777";
const USER = process.env.HEX_USER;
const PASS = process.env.HEX_PASS;
const SLOT_START = Number(process.env.SESSION_RANGE_START);
const SLOT_END = Number(process.env.SESSION_RANGE_END);

// One ProxyAgent per session id, each with its own keep-alive pool.
const agents = new Map();
function agentFor(sessionId) {
  let agent = agents.get(sessionId);
  if (agent) return agent;
  const user = `${USER}-session-${sessionId}`;
  const auth = Buffer.from(`${user}:${PASS}`).toString("base64");
  agent = new ProxyAgent({
    uri: GATEWAY,
    token: `Basic ${auth}`,
    connections: 10,
    pipelining: 1,
    keepAliveTimeout: 60_000,
    keepAliveMaxTimeout: 600_000,
  });
  agents.set(sessionId, agent);
  return agent;
}

// Round-robin through the worker's slice of sessions.
let rr = SLOT_START;
function nextSession() {
  const id = `s${String(rr).padStart(3, "0")}`;
  rr = rr + 1;
  if (rr >= SLOT_END) rr = SLOT_START;
  return id;
}

const queue = new PQueue({ concurrency: 25 });
let successCount = 0;
let failureCount = 0;

async function fetchUrl(url) {
  const session = nextSession();
  const dispatcher = agentFor(session);
  const started = Date.now();
  try {
    const res = await request(url, {
      dispatcher,
      method: "GET",
      headersTimeout: 10_000,
      bodyTimeout: 20_000,
      maxRedirections: 3,
    });
    // Drain body to free the connection for keep-alive.
    await res.body.text();
    if (res.statusCode >= 400) throw new Error(`status ${res.statusCode}`);
    successCount++;
    log.debug({ url, session, ms: Date.now() - started }, "ok");
  } catch (err) {
    failureCount++;
    log.warn({ url, session, err: err.message }, "failed");
  }
}

// Simulate an incoming job stream. In a real system, pull from Redis/Kafka.
const urls = Array.from({ length: 1000 }, (_, i) => `https://httpbin.org/ip?n=${i}`);
for (const url of urls) queue.add(() => fetchUrl(url));

// Periodic stats reporting to primary.
const statsTimer = setInterval(() => {
  process.send?.({
    type: "stats",
    payload: { success: successCount, failure: failureCount, queued: queue.size },
  });
}, 5_000);

let shutdownRequested = false;
process.on("message", (msg) => {
  if (msg?.type === "shutdown") {
    shutdownRequested = true;
    log.info("draining queue before exit");
    queue.onIdle().then(async () => {
      clearInterval(statsTimer);
      for (const agent of agents.values()) await agent.close();
      process.exit(0);
    });
  }
});

// Safety: if the queue is empty and no shutdown came, still exit eventually.
queue.onIdle().then(() => {
  if (!shutdownRequested) {
    log.info("queue drained, exiting");
    clearInterval(statsTimer);
    process.exit(0);
  }
});

Memory leak prevention

The two sneaky leak sources in clustered scrapers are (1) undrained response bodies and (2) unbounded agent caches. Draining with await res.body.text() is essential — without it, undici keeps the response object alive waiting for you to read, and after a few thousand requests you'll see the worker RSS climb to 2GB. The agent Map in this worker is bounded by SLOT_END - SLOT_START (typically 5-20 entries) so it will never leak.

For extra safety in long-running workers, periodically call agent.close() on idle agents. The stats timer is a good place to hook that.

Graceful shutdown

The shutdown protocol is: primary receives SIGTERM, sends {type: "shutdown"} to each worker via IPC, workers stop accepting new work but finish the in-flight queue, then exit. Primary gives them 10 seconds; after that, SIGKILL. This is the standard pattern and it works with Kubernetes, systemd, and Docker.

When cluster is the wrong choice

If your scraping workload is a single long request per URL (e.g. Playwright rendering with 3-5s pages), cluster is overkill and you should just use Promise.all with a semaphore. Cluster earns its keep when you have high request volume, need process isolation for stability, or want to use every CPU core on a multi-core box.

For more on architectures, see our distributed scraping pipeline guide. Hex Proxies sticky sessions work natively with the username-suffix pattern in the worker code — view plans.