v1.10.90-0e025b8
Skip to main content
AI/MLTraining DataGuide

Curating Quality LLM Training Data from the Web: Deduplication, Filtering, and Licensing

13 min read

By Hex Proxies Engineering Team

Curating Quality LLM Training Data from the Web: Deduplication, Filtering, and Licensing

Pretraining corpora are the foundation of every modern LLM. C4, The Pile, RedPajama, FineWeb, and Dolma are all derived from web crawls, and each one took months of filtering work to reach a usable state. The raw crawl is not the dataset -- it is maybe 5-15% of the dataset, once you strip near-duplicates, low-quality pages, and content without compatible licensing.

This post is about what happens between a raw crawl and a training-ready token stream: dedupe algorithms, quality classifiers, license handling, and where distributed collection with proxies fits in. It is not a guide to training models on copyrighted material without permission. Use published corpora and properly licensed data.

Published Corpora You Should Start With

Before crawling anything, check whether an existing corpus covers your need:

  • CommonCrawl: ~250 billion pages since 2008, free, WARC format. The raw ingredient for almost every open dataset.
  • C4 (Raffel et al., Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, arXiv:1910.10683): filtered CommonCrawl snapshot used to train T5. ~750GB English.
  • The Pile (Gao et al., arXiv:2101.00027): 825GB across 22 domains including PubMed, ArXiv, GitHub, and books.
  • RedPajama-v2: 30 trillion tokens across 84 CommonCrawl snapshots, with quality signals precomputed.
  • FineWeb and FineWeb-Edu (Penedo et al., 2024): 15T tokens filtered with educational-content classifiers; notable for strong ablation results.
  • Dolma (AI2): 3T tokens, permissively licensed, documented provenance.

If these cover 90% of your need, you save months and inherit the dedupe and quality work. The remaining 10% is where custom collection makes sense: recent content, non-English, niche domains.

Licensing and Provenance

Treat licensing as a first-class property of every document. Minimum metadata per record:

  • Source URL and HTTP response headers at fetch time
  • robots.txt policy at fetch time
  • Detected license (Creative Commons, public domain, proprietary)
  • Timestamp and crawler identity

The C2PA content credentials spec and the emerging ai.txt / llms.txt conventions give site owners a way to declare training-data preferences. Respecting these is the difference between a dataset you can publish and one you cannot.

Deduplication: Exact, Near, and Semantic

Dedupe is the single most valuable filter. Lee et al., Deduplicating Training Data Makes Language Models Better (arXiv:2107.06499), showed up to 10% perplexity improvements and substantial reductions in memorization from aggressive dedupe. Three levels:

  1. Exact dedupe: SHA-256 of normalized text. Catches perfect copies; misses everything else.
  2. Near-dedupe via MinHash/LSH: shingle the document into n-grams (typically 5-grams), compute MinHash signatures, bucket with locality-sensitive hashing. Jaccard threshold around 0.8 is standard. This is what GPT-3, The Pile, and FineWeb used.
  3. Semantic dedupe: embed documents, cluster by cosine similarity. Catches paraphrases. SemDeDup (Abbas et al., arXiv:2303.09540) showed you can remove 50% of LAION and improve model quality.

For 1TB+ corpora, MinHash+LSH in Spark or Ray is the practical choice. datasketch and the Dolma toolkit both provide production implementations.

Quality Filtering

Not all text is worth training on. Typical filters, applied in roughly this order:

  • Language ID: fastText lid.176 or CLD3. Drop documents below confidence 0.65.
  • Rule-based heuristics (Gopher filters, Rae et al. arXiv:2112.11446): mean word length 3-10, ratio of alphabetic characters > 0.8, fewer than 10% lines ending with ellipsis, symbol-to-word ratio under 0.1.
  • Perplexity filters: score with a small KenLM or reference LM, drop top and bottom percentiles.
  • Classifier filters: train a fastText classifier on "good" vs "bad" seed documents. FineWeb-Edu used this pattern with GPT-4-labeled educational content as positives.
  • Toxicity and PII: Detoxify, Presidio for PII removal.

Where Distributed Collection with Proxies Fits

Most training-data work is downloading existing crawls, not crawling from scratch. But there are legitimate reasons to run your own collection:

  • Building a domain-specific corpus from public technical documentation
  • Refreshing a regional news corpus on a weekly cadence
  • Collecting multilingual content where CommonCrawl coverage is thin
  • Capturing open-license community content (forums, wikis) at a point in time

In each case the crawl hits the same operational problems as any large-scale ingestion: rate limits, geographic variability, CDN tarpits. A proxy pool distributes the fetch load and lets you respect per-host concurrency limits without sacrificing aggregate throughput. Typical setup: 1 request per IP per 5 seconds per host, 500+ IPs concurrent, polite back-off on 429.

import asyncio, httpx
from urllib import robotparser

async def polite_fetch(url, proxy, rp):
    if not rp.can_fetch("CorpusBot/1.0", url):
        return None
    async with httpx.AsyncClient(proxy=proxy, timeout=30,
            headers={"User-Agent": "CorpusBot/1.0 (+contact@example.com)"}) as c:
        r = await c.get(url)
        if r.status_code == 429:
            await asyncio.sleep(int(r.headers.get("Retry-After", "60")))
            return None
        return r.text if r.status_code == 200 else None

Contamination and Eval Leakage

One underappreciated risk: eval contamination. If your training corpus contains MMLU questions, GSM8K problems, or HumanEval solutions, your benchmarks are meaningless. Dodge et al., Documenting Large Webtext Corpora (arXiv:2104.08758), quantified this for C4; the fix is to hash-match and remove known benchmark strings before training.

Tokenization and Format

The final training artifact is not raw text; it is tokenized shards. Most teams use SentencePiece BPE or tiktoken-compatible vocabularies, pack documents into fixed-length sequences with <eos> separators, and write to Arrow/Parquet or Mosaic StreamingDataset format. Document boundaries matter: sequence packing across documents is fine, but make sure attention masks respect boundaries if your model training supports it.

A Realistic Pipeline

  1. Seed URL discovery from sitemaps, RSS, and known good hubs
  2. Distributed fetch with proxies, respecting robots.txt and rate limits
  3. WARC writeout for reproducibility
  4. Text extraction (trafilatura or resiliparse)
  5. Language ID and licensing check
  6. Rule-based quality filters
  7. MinHash LSH near-dedupe
  8. Classifier-based quality filter
  9. Semantic dedupe on survivors
  10. PII scrubbing and toxicity filtering
  11. Benchmark-contamination removal
  12. Tokenization and shard packing

Expect to keep 5-15% of the raw crawl. A 10TB crawl becomes a 500GB-1.5TB training set.

Closing

Training-data curation is unglamorous but determines the ceiling of your model. The recent FineWeb-Edu ablations made this concrete: better filtering on the same base crawl produced measurable downstream gains with no architectural change. Whether you use published corpora or augment with your own collection, the discipline is the same: dedupe aggressively, filter by quality, document provenance, and never mix eval into train.

If you are running distributed collection for corpus work, see our distributed scraping pipeline guide for worker orchestration patterns.