A Typesafe GraphQL Scraper API With DataLoader and Apollo Server
REST scraper APIs always end up the same: one endpoint per shape of data, clients over-fetch or under-fetch, and the server grows a rat's nest of ad-hoc query parameters. GraphQL gives you a typed schema and lets clients ask for exactly what they need — which is ideal for scraping, where the underlying fetch cost is high and you don't want to do redundant work.
This guide builds a scraper API with Apollo Server 4, TypeScript, graphql-codegen for typesafe resolvers, and DataLoader for request-scoped batching and deduplication. A single GraphQL query fetches multiple URLs in parallel through Hex Proxies with automatic deduplication of repeated URLs.
Dependencies
# package.json deps
{
"dependencies": {
"@apollo/server": "4.11.0",
"graphql": "16.9.0",
"graphql-tag": "2.12.6",
"dataloader": "2.2.2",
"undici": "6.20.1",
"zod": "3.23.8"
},
"devDependencies": {
"@graphql-codegen/cli": "5.0.3",
"@graphql-codegen/typescript": "4.1.0",
"@graphql-codegen/typescript-resolvers": "4.4.0",
"typescript": "5.6.3"
}
}
graphql-codegen turns your schema into TypeScript types, which means resolvers are checked against the schema at compile time. This eliminates the class of bugs where you rename a field in the schema and forget to update the resolver — TypeScript will fail the build instead of the server returning nulls at runtime.
Schema
Two queries: pages for batch fetches and page for single. Both go through the same DataLoader, so calling pages(urls: ["a","b","a"]) only fetches a once.
# schema.graphql
type Query {
"""Fetch and parse multiple URLs in a single call."""
pages(urls: [String!]!): [Page!]!
"""Fetch a single URL (usually called via batch internally)."""
page(url: String!): Page
}
type Page {
url: String!
statusCode: Int!
title: String
fetchedAt: String!
proxyLabel: String!
bodyLength: Int!
}
Code generation
Run graphql-codegen every time you touch the schema. Wire it into your build with a prebuild script so it runs automatically. The generated Resolvers type is what makes the whole thing typesafe.
// codegen.ts
import type { CodegenConfig } from "@graphql-codegen/cli";
const config: CodegenConfig = {
schema: "./schema.graphql",
generates: {
"./src/generated/types.ts": {
plugins: ["typescript", "typescript-resolvers"],
config: {
useIndexSignature: true,
contextType: "../context#GraphQLContext",
},
},
},
};
export default config;
// Run: npx graphql-codegen --config codegen.ts
Context and DataLoader
The context is rebuilt per request, which means the DataLoader cache is per-request too. That is the right scope: within one GraphQL query, pages(urls: [...]) deduplicates identical URLs; between queries, each gets a fresh cache. If you want cross-request caching, wire in Redis explicitly — don't abuse DataLoader for it.
// src/context.ts
import DataLoader from "dataloader";
import { ProxyAgent, request } from "undici";
import { z } from "zod";
const PROXIES = [
{ url: "http://USER:PASS@gate.hexproxies.com:7777", label: "hex-us" },
{ url: "http://USER:PASS@gate-eu.hexproxies.com:7777", label: "hex-eu" },
];
const agents = PROXIES.map((p) => ({
label: p.label,
dispatcher: new ProxyAgent({ uri: p.url, connections: 25 }),
}));
let rr = 0;
function pickAgent() {
const a = agents[rr % agents.length];
rr++;
return a;
}
const PageSchema = z.object({
url: z.string(),
statusCode: z.number(),
title: z.string().nullable(),
fetchedAt: z.string(),
proxyLabel: z.string(),
bodyLength: z.number(),
});
export type Page = z.infer<typeof PageSchema>;
async function fetchOne(url: string): Promise<Page> {
const agent = pickAgent();
const res = await request(url, {
dispatcher: agent.dispatcher,
method: "GET",
headersTimeout: 8_000,
bodyTimeout: 15_000,
maxRedirections: 3,
});
const body = await res.body.text();
const titleMatch = body.match(/<title[^>]*>([^<]*)<\/title>/i);
return PageSchema.parse({
url,
statusCode: res.statusCode,
title: titleMatch?.[1]?.trim() ?? null,
fetchedAt: new Date().toISOString(),
proxyLabel: agent.label,
bodyLength: body.length,
});
}
export interface GraphQLContext {
pageLoader: DataLoader<string, Page>;
}
export function buildContext(): GraphQLContext {
return {
pageLoader: new DataLoader<string, Page>(
async (urls) => {
// Batch: fetch all URLs in parallel with bounded concurrency.
const results = await Promise.allSettled(
urls.map((u) => fetchOne(u)),
);
return results.map((r, i) =>
r.status === "fulfilled"
? r.value
: new Error(`fetch failed for ${urls[i]}: ${r.reason}`),
);
},
{
// Cache only within a single request (new context per request).
cache: true,
maxBatchSize: 20,
},
),
};
}
Zod validates the parsed page shape at the boundary. This is belt-and-braces since the TypeScript types already guarantee the shape in-process, but Zod gives you a runtime check that catches subtle bugs (e.g. regex match returning undefined when you expected a string).
Resolvers and server
The resolvers are a thin layer over the DataLoader. All the actual work — batching, deduplication, concurrency limits — happens in the loader. Resolvers stay trivial, which is exactly what you want in a GraphQL service.
// src/resolvers.ts
import type { Resolvers, Page } from "./generated/types";
export const resolvers: Resolvers = {
Query: {
pages: async (_parent, { urls }, ctx) => {
// Deduplicate at the DataLoader layer: identical URLs in one
// request reuse the same in-flight fetch.
const results = await Promise.all(urls.map((u) => ctx.pageLoader.load(u)));
return results;
},
page: async (_parent, { url }, ctx) => {
try {
return await ctx.pageLoader.load(url);
} catch {
return null;
}
},
},
};
// src/server.ts
import { ApolloServer } from "@apollo/server";
import { startStandaloneServer } from "@apollo/server/standalone";
import { readFileSync } from "node:fs";
import { buildContext, type GraphQLContext } from "./context";
import { resolvers } from "./resolvers";
const typeDefs = readFileSync("./schema.graphql", "utf8");
const server = new ApolloServer<GraphQLContext>({
typeDefs,
resolvers,
introspection: process.env.NODE_ENV !== "production",
});
const { url } = await startStandaloneServer(server, {
listen: { port: Number(process.env.PORT) || 4000 },
context: async () => buildContext(),
});
console.log(`graphql ready at ${url}`);
Why DataLoader matters
DataLoader solves the N+1 problem. Without it, a query like pages(urls: [a, b, c]) would call three separate resolvers, each making one fetch. With it, the three load() calls in the same tick of the event loop are collected and passed to the batch function as a single array — one round-trip for all three. For GraphQL APIs fronting expensive backends (databases or, here, scraped pages), DataLoader is not optional.
The other benefit is request-scoped memoization: load("https://example.com") called twice in the same request returns the cached result the second time. On a GraphQL query that resolves many fields across many types, this prevents the same URL from being fetched multiple times during a single query execution.
Rate limiting the batch
DataLoader batches but does not rate-limit. If a client sends pages(urls: [...200 urls]), the batch function fires 200 parallel fetches and may saturate your proxy agent pool. Fix this with a simple concurrency limiter in the batch function — wrap Promise.allSettled with a p-limit pool of 25. For aggressive rate limiting, see our rate limiting strategies guide.
Production notes
Turn off introspection in production. Add query complexity analysis (graphql-query-complexity) to reject requests that would fetch thousands of URLs. Run behind an authentication layer — Apollo Server accepts any Express-style middleware, so slot in your usual JWT handler. Log resolver errors to Sentry or equivalent; Apollo's default error formatting loses stack traces.
Hex Proxies gateway slots into the ProxyAgent constructor with no code changes. For the broader architecture, see our distributed scraping pipeline guide and pricing.