Validating Synthetic Data Against Real-World Distributions

Synthetic data is cheap, controllable, and privacy-safe. It is also, by default, wrong. A generator trained on one distribution and used to augment another will quietly shift your feature distributions, and the model trained on the mixture will silently degrade on production traffic. Validation against real-world distributions is the step that keeps synthetic data honest.

This post covers the statistical tests that matter, distribution drift detection, and why ground-truth collection benefits from distributed sampling.

Why Synthetic Data Gets Used

Four recurring motivations:

Privacy: regulated data (health, finance) cannot leave the cleanroom, but a synthetic copy with similar statistical properties can.
Class imbalance: rare classes (fraud, defects, minority languages) can be oversampled synthetically.
Counterfactual coverage: generate conditions that rarely occur naturally (edge-case weather for autonomous driving, unusual transaction combinations).
Labeling cost: synthetic data comes with free ground truth.

Each of these assumes the synthetic distribution is close enough to real data. Validation is how you check.

Distribution Similarity Tests

Kolmogorov-Smirnov (KS)

For continuous univariate features, the two-sample KS test computes the maximum distance between empirical CDFs. Cheap, nonparametric, and well-understood. Implementation:

from scipy.stats import ks_2samp

stat, p = ks_2samp(real["price"], synthetic["price"])
# reject null if p < 0.05: distributions differ

KS is sensitive to the center of the distribution and weak on the tails. For tail-sensitive applications use Anderson-Darling instead.

Chi-Square

For categorical features, chi-square tests whether observed frequencies match expected. Implementation:

from scipy.stats import chisquare
import numpy as np

real_counts = real["category"].value_counts().sort_index()
syn_counts  = synthetic["category"].value_counts().reindex(real_counts.index, fill_value=0)
expected = syn_counts.sum() * real_counts / real_counts.sum()
chi, p = chisquare(syn_counts, f_exp=expected)

Chi-square needs expected counts ≥5 per cell; pool rare categories first.

Wasserstein (Earth Mover's)

Wasserstein distance measures how much "work" is needed to transform one distribution into the other. Unlike KL divergence it is defined even when supports differ. Scale-sensitive, so normalize first. scipy.stats.wasserstein_distance for 1D; POT library for multidim.

Maximum Mean Discrepancy (MMD)

MMD embeds both samples into a reproducing kernel Hilbert space and measures the distance between their means. Works well on high-dimensional continuous data. Gretton et al., A Kernel Two-Sample Test, JMLR 2012, is the canonical reference.

Discriminative Two-Sample Tests

Train a classifier to distinguish real from synthetic. If a held-out AUC is near 0.5, the two distributions are indistinguishable; near 1.0 they are trivially separable. This is the most practical multivariate test and catches interactions univariate tests miss.

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
import pandas as pd

X = pd.concat([real, synthetic])
y = [0] * len(real) + [1] * len(synthetic)
auc = cross_val_score(GradientBoostingClassifier(), X, y,
                      scoring="roc_auc", cv=5).mean()
# auc ~= 0.5 is ideal; >0.7 is a red flag

Beyond Marginals: Joint Structure

Matching marginal distributions is necessary but not sufficient. A synthetic dataset can pass KS on every column and still misrepresent correlations. Check:

Pairwise correlations: compare Pearson/Spearman matrices, report Frobenius norm of the difference.
Mutual information: per-pair MI differences catch nonlinear dependence KS misses.
Conditional distributions: for important feature pairs (X|Y), compare the conditional shape, not just the marginals.

Drift Detection Over Time

Real distributions drift. Synthetic data generated against last quarter's distribution may be stale by next quarter. Monitor:

Feature drift: PSI (Population Stability Index) or KS against a rolling baseline
Label drift: class frequencies over time
Concept drift: relationship between X and y, typically detected via model performance decay on a held-out stream

PSI is the credit-risk standard: values under 0.1 are stable, 0.1-0.25 is drift to investigate, over 0.25 is significant drift.

PSI = ∑_i (actual_i - expected_i) · ln(actual_i / expected_i)

Why Ground Truth Is Hard

Validation only works if you have a trustworthy real-world sample. For web-derived features (prices, listings, inventory, reviews), collecting that sample uniformly is the hard part. Three failure modes:

IP bias: a single-IP collection sees the site as one visitor with a frozen history. Prices, product availability, and A/B bucketing will skew.
Geographic bias: a collection run from one region misses regional pricing, currency, language.
Temporal bias: a batch collected on one day is a snapshot, not a distribution. Spread collection across time windows.

Distributed collection via a diverse proxy pool removes the first two: you get samples from many apparent IPs and many regions, averaging out the IP-conditioned views sites return. It does not fix temporal bias -- that requires scheduling.

A Validation Checklist

Per-feature KS (continuous) and chi-square (categorical)
Pairwise correlation delta
Discriminative AUC: train GBM on real vs synthetic, target ~0.5
Downstream task performance: train the actual production model on synthetic, evaluate on real holdout. The gap is the real cost of synthesis.
Slice validation: compute metrics per demographic, geographic, and temporal slice. Aggregate similarity can mask subgroup failure.
Drift monitors over time

Generator-Specific Caveats

Different synthetic-data methods fail differently:

SMOTE and variants: interpolate within the convex hull; cannot generate out-of-distribution edge cases.
CTGAN/TVAE: catch nonlinear structure but can memorize, producing near-duplicates of real rows. Always run a near-neighbor check.
LLM-generated text: risk of style collapse (everything sounds like the model) and topic collapse (sampling diversity collapses after a few hundred examples).
Diffusion for tabular: strong on marginals and correlations but slow.

Privacy Validation

When privacy is the motivation, statistical similarity is not enough. You also need privacy guarantees:

Membership inference tests: can an adversary tell whether a specific record was in the training set?
Attribute inference: can sensitive attributes be recovered from non-sensitive ones?
Differential privacy budget: if the generator was trained with DP-SGD, report ε and δ explicitly.

Closing

Synthetic data is a useful tool, not a free lunch. Every synthetic dataset should ship with a validation report: marginal tests, joint structure tests, discriminative AUC, and downstream task parity. Pair that with an honest, distributed real-world sample for ground truth and synthetic data becomes a reliable part of the data stack rather than a silent source of regressions.