LLM Evaluation Pipelines with Geo-Aware Testing
Evaluation is the discipline that separates an LLM feature from an LLM product. The canonical advice is to build a golden set, run it on every model change, and gate deploys on metric movement. What that advice undersells is the geographic dimension: the same model, behind the same prompt, returns different answers depending on the request's apparent location. If your users are global, your evals need to be too.
This post covers eval pipeline design, the frameworks worth using, and the specific patterns for geo-aware testing of LLM applications that touch external data.
What an LLM Eval Actually Measures
Eval is a catch-all for several distinct things:
- Task accuracy: exact match, F1, BLEU, ROUGE on a labeled dataset
- Faithfulness: does the output stick to the provided context (RAG)
- Hallucination rate: does the output assert facts not grounded in source
- Instruction following: does the output obey format and constraint rules
- Safety: toxicity, PII leakage, jailbreak susceptibility
- Stability: output consistency under paraphrase and adversarial prompts
Most production systems care most about faithfulness and instruction following. Academic leaderboards (MMLU, HellaSwag, GPQA) measure something else entirely and correlate weakly with production quality.
Frameworks
Promptfoo
Promptfoo (promptfoo.dev) is a YAML-driven eval harness with first-class support for regression testing across prompts and models. Good for matrix testing: N prompts × M models × K test cases. Integrates with CI.
DeepEval
DeepEval (confident-ai.com) is the pytest-style Python framework. It ships with metric implementations for faithfulness, answer relevancy, contextual precision, hallucination, and G-Eval (the LLM-as-judge pattern from Liu et al., arXiv:2303.16634).
RAGAS
RAGAS (arXiv:2309.15217) is focused on RAG systems. The key metrics -- faithfulness, answer relevance, context precision, context recall -- are computed using LLM-based judges with specific reference-free prompts.
Inspect AI and lm-eval-harness
UK AISI's Inspect and EleutherAI's lm-eval-harness target academic-style evaluations. Useful for model selection, less useful for application-level eval.
A Minimal Eval Loop with DeepEval
from deepeval import evaluate
from deepeval.metrics import FaithfulnessMetric, AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
cases = [
LLMTestCase(
input="What is the VAT rate in Germany?",
actual_output=my_rag_pipeline("What is the VAT rate in Germany?"),
expected_output="19% standard rate",
retrieval_context=[fetched_doc_text]
),
# ...
]
metrics = [
FaithfulnessMetric(threshold=0.8, model="gpt-4o"),
AnswerRelevancyMetric(threshold=0.75, model="gpt-4o")
]
results = evaluate(cases, metrics)
Geo-Aware Testing: Why It Matters
Any LLM feature that fetches web data at inference time -- RAG with a live retriever, agents with browser tools, functions that call geo-sensitive APIs -- produces outputs that depend on the egress location of the request. Common failure modes:
- A search-augmented assistant answering a legal question returns US-centric rules when the user is in the EU.
- A shopping agent quotes prices in the wrong currency because the retailer detected a US IP and defaulted to USD.
- An RAG system pulls the wrong language edition of Wikipedia because the fetch went out through an English-speaking region.
- A multilingual chat assistant responds in English because no locale signal reached the retriever.
Running the eval from a single region hides all of these. You need to parameterize the eval run by egress location and rerun the golden set per locale.
Geo-Eval Pattern
The minimum setup:
- Define a list of target locales with matching proxy regions. Example: (en-US, US), (en-GB, UK), (de-DE, DE), (fr-FR, FR), (ja-JP, JP), (pt-BR, BR).
- For each locale, route all external fetches through a proxy with egress in that region.
- Run the full golden set and compute metrics per locale.
- Alert on cross-locale divergence beyond a threshold.
import os
LOCALES = [
("en-US", "http://us.hexproxies.com:7777"),
("en-GB", "http://uk.hexproxies.com:7777"),
("de-DE", "http://de.hexproxies.com:7777"),
("fr-FR", "http://fr.hexproxies.com:7777"),
("ja-JP", "http://jp.hexproxies.com:7777"),
]
for locale, proxy in LOCALES:
os.environ["HTTP_PROXY"] = proxy
os.environ["HTTPS_PROXY"] = proxy
results = evaluate(cases, metrics)
publish(f"eval.{locale}", results)
Hallucination Detection Per Locale
Hallucination rate often varies by locale. Two reasons: pretraining data is English-dominant, and retrieval quality is worse in lower-resource languages. Measure hallucination rate per locale with a grounded-faithfulness judge -- ask a separate model whether each factual claim in the answer is supported by the retrieved context. RAGAS implements this as faithfulness.
A useful signal is claim coverage: decompose the answer into atomic claims, count how many have a supporting span in the context. Low coverage in one locale but high in another means your retriever is failing for that locale, not your generator.
Multilingual Response Testing
Whether the model responds in the right language is a distinct axis from whether it is correct. Detect response language with fastText lid.176 or CLD3 and compute a language-match rate per locale. A 90%+ match rate is the usual bar; below 80% the system is routinely answering in the wrong language.
Judge Reliability
LLM-as-judge has known biases (Zheng et al., Judging LLM-as-a-Judge, arXiv:2306.05685): position bias, verbosity bias, self-preference. Mitigations:
- Randomize answer order when comparing pairs
- Use a different model as judge than as generator
- Calibrate the judge against 100-200 human-labeled examples, report correlation
- For high-stakes metrics, use ensemble of judges
Regression Gates in CI
Eval only pays off when it blocks bad deploys. Pattern:
- On every PR that touches prompts, model configs, or retrievers, run the eval suite
- Compare aggregate metrics to main branch baseline
- Fail the CI job on regression beyond a threshold (e.g. ≥2% drop in faithfulness)
- Require explicit override with a reviewer comment to merge
Running the full multi-locale suite in CI is slow (minutes to hours). Split into a fast smoke suite per PR and a nightly full suite.
Dataset Maintenance
Eval datasets decay. Prompts that used to fail start to pass; prompts that used to pass become trivial; new failure modes are not represented. Discipline:
- Add every production failure to the golden set (with permission and redaction)
- Version the dataset; never silently modify historical cases
- Rotate out cases that no model fails
- Keep a held-out adversarial slice for red-team findings
Closing
Geo-aware eval is not a nice-to-have for global products; it is the only way to notice the failure modes that users in non-primary markets actually experience. The engineering cost is low: a proxy-per-region configuration and a loop around your existing eval harness. The payoff is catching regressions that single-region CI cannot see.
For related content on global testing, see our SEO rank tracking guide, which covers similar geo-routing patterns.