Creating Fine-Tuning Datasets from Public Web Content
Fine-tuning is the cheapest way to bend a general-purpose LLM toward a specific task: formatting conventions, domain vocabulary, response style, tool-use patterns. The model weights come free; the dataset is the expensive part. Done well, a 10k-50k example dataset can move a model's task-specific performance by 10-30 percentage points. Done carelessly, the same dataset will overfit, collapse style diversity, or quietly teach the model to hallucinate.
This post covers dataset assembly from public sources, license compliance, format conversion, and LLM-judge quality scoring.
How Much Data You Actually Need
The empirical rule from LIMA (Zhou et al., LIMA: Less Is More for Alignment, arXiv:2305.11206) and subsequent work: 1,000-10,000 high-quality examples beat 100,000 mediocre ones for instruction fine-tuning. For task-specific fine-tuning (e.g. converting user questions to SQL), 5,000-20,000 is the usual sweet spot. Lower if the base model is already close to the task, higher if it is far.
This changes your economics. Instead of scraping indiscriminately, you can afford to be selective and pay for quality.
License Compliance First
Before assembling anything, confirm the source allows training use. Common licenses and what they mean:
- Public domain / CC0: no restrictions
- CC-BY: attribution required; include source URLs in your dataset card
- CC-BY-SA: attribution + derivatives under the same license; viral for derived datasets
- CC-BY-NC: non-commercial only; usually disqualifying for commercial models
- Proprietary / ToS-restricted: do not use for training without explicit permission
Public-domain-dense sources worth knowing: Wikipedia and Wikidata (CC-BY-SA), Project Gutenberg, PubMed Central Open Access subset, US federal documents, arXiv (varies per paper), Stack Exchange data dumps (CC-BY-SA), GitHub code under permissive licenses (check per-repo LICENSE).
Collection Architecture
For permissively licensed sources with bulk exports (Wikipedia dumps, Stack Exchange dumps, arXiv bulk data), download the dumps and skip crawling entirely. For sources without dumps but with open licenses, crawl politely: respect robots.txt, identify yourself in User-Agent, rate-limit per host, and distribute across a proxy pool so single-host concurrency stays reasonable while aggregate throughput is high enough to finish in days not weeks.
import httpx, asyncio
from urllib.robotparser import RobotFileParser
UA = "DatasetBot/1.0 (+contact@example.com)"
async def fetch(url, proxy, rp: RobotFileParser):
if not rp.can_fetch(UA, url):
return None
async with httpx.AsyncClient(proxy=proxy, timeout=30,
headers={"User-Agent": UA}) as c:
r = await c.get(url)
return r.text if r.status_code == 200 else None
Format Conversion
Fine-tuning APIs expect specific formats. The main ones:
Alpaca
{
"instruction": "Summarize the passage in one sentence.",
"input": "The capital of France is Paris.",
"output": "Paris is the capital of France."
}
ShareGPT / conversation format
{
"conversations": [
{"from": "human", "value": "What is X?"},
{"from": "gpt", "value": "X is …"}
]
}
OpenAI chat format
{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is X?"},
{"role": "assistant", "content": "X is …"}
]
}
Write one canonical intermediate representation, then convert to target format at export time. Keep raw source text and license metadata with each record so the dataset stays auditable.
Quality Scoring with LLM Judges
Human labeling is accurate and slow. LLM-as-judge scoring is fast, good enough for bulk filtering, and can be calibrated against a small human-labeled sample. A workable recipe:
- Draft 5-10 scoring dimensions (correctness, clarity, instruction-following, format, specificity)
- Write a rubric prompt with a 1-5 scale and concrete anchors per score
- Run a capable judge model (GPT-4o, Claude Sonnet) over your candidates
- Calibrate: human-label 200 examples, compute correlation with judge scores, iterate the rubric until Spearman ρ > 0.7
- Keep examples above a threshold (commonly 4+ on a 5-point scale)
RUBRIC = """Rate the following instruction-response pair on a 1-5 scale.
5 = Correct, specific, well-formatted, directly follows instructions
4 = Correct and specific with minor issues
3 = Mostly correct but vague or poorly formatted
2 = Partially incorrect or ignores part of the instruction
1 = Incorrect or off-topic
Return only the integer.
Instruction: {instr}
Response: {resp}
"""
Known biases (Zheng et al., arXiv:2306.05685): verbosity bias (longer = higher score), position bias in pairwise comparisons, self-preference bias. Mitigate with explicit length-neutral instructions, randomized ordering, and a different model as judge than as generator.
Deduplication
Fine-tuning datasets are extraordinarily sensitive to duplicates. A single example repeated 100 times will get memorized verbatim. Run the same dedupe stack used for pretraining: SHA-256 exact + MinHash LSH at 0.8 Jaccard. For instruction-response pairs, hash the instruction alone to catch reworded duplicates that share a prompt.
Diversity
Diversity matters more than raw count. Metrics worth tracking:
- Instruction type distribution: classify each example into a taxonomy (summarization, classification, extraction, reasoning, code, etc.) and look at the histogram
- Length distribution: avoid a corpus where every response is 200 tokens
- Topic distribution: cluster embeddings of instructions and inspect the clusters
- Source diversity: how many distinct domains or authors
Self-Instruct (Wang et al., arXiv:2212.10560) is a common way to bootstrap diversity: seed with a few hundred human-written examples, use an LLM to generate variants, filter with a judge. Works well when the seed set is broad.
Contamination and Holdouts
Hold out evaluation data before the fine-tuning cut. Hash-check that none of your fine-tune examples match eval prompts (or the target-response suffixes). Also check that fine-tune examples do not contain reserved system prompts or tool definitions that would confuse the chat template.
Running the Fine-Tune
Once the dataset is clean, the fine-tune itself is the cheap part. OpenAI's SFT API, Anthropic's fine-tuning (where available), and open tools like Axolotl, LLaMA-Factory, and Together fine-tuning all take a JSONL file in the same rough shape. Typical hyperparameters: 2-4 epochs, LR 1e-5 to 5e-5, cosine schedule, LoRA rank 16-32 for parameter-efficient training.
Evaluation After Fine-Tune
Always compare against the base model on:
- Task-specific golden set
- General capability suite (a subset of MMLU or similar) to check for regression
- Safety and instruction-following battery
- Held-out examples from the same source distribution
If task performance improved and general performance did not drop more than a few points, the fine-tune is a net win.
Dataset Card
Ship every dataset with a dataset card covering: sources and licenses, collection dates, filtering criteria, dedupe procedure, PII handling, known biases, and intended use. This is the single most important artifact for reproducibility and audit.
Closing
Quality, license compliance, diversity, and deduplication do more for fine-tuned model performance than dataset size. Collect from permissively licensed sources, respect robots.txt and rate limits, score aggressively with a calibrated judge, and keep provenance on every row. The result is a dataset you can trust and publish, not just use once.