v1.10.90-0e025b8
Skip to main content
AI/MLEvaluationGuide

LLM Evaluation Pipelines with Geo-Aware Testing

12 min read

By Hex Proxies Engineering Team

LLM Evaluation Pipelines with Geo-Aware Testing

Evaluation is the discipline that separates an LLM feature from an LLM product. The canonical advice is to build a golden set, run it on every model change, and gate deploys on metric movement. What that advice undersells is the geographic dimension: the same model, behind the same prompt, returns different answers depending on the request's apparent location. If your users are global, your evals need to be too.

This post covers eval pipeline design, the frameworks worth using, and the specific patterns for geo-aware testing of LLM applications that touch external data.

What an LLM Eval Actually Measures

Eval is a catch-all for several distinct things:

  • Task accuracy: exact match, F1, BLEU, ROUGE on a labeled dataset
  • Faithfulness: does the output stick to the provided context (RAG)
  • Hallucination rate: does the output assert facts not grounded in source
  • Instruction following: does the output obey format and constraint rules
  • Safety: toxicity, PII leakage, jailbreak susceptibility
  • Stability: output consistency under paraphrase and adversarial prompts

Most production systems care most about faithfulness and instruction following. Academic leaderboards (MMLU, HellaSwag, GPQA) measure something else entirely and correlate weakly with production quality.

Frameworks

Promptfoo

Promptfoo (promptfoo.dev) is a YAML-driven eval harness with first-class support for regression testing across prompts and models. Good for matrix testing: N prompts × M models × K test cases. Integrates with CI.

DeepEval

DeepEval (confident-ai.com) is the pytest-style Python framework. It ships with metric implementations for faithfulness, answer relevancy, contextual precision, hallucination, and G-Eval (the LLM-as-judge pattern from Liu et al., arXiv:2303.16634).

RAGAS

RAGAS (arXiv:2309.15217) is focused on RAG systems. The key metrics -- faithfulness, answer relevance, context precision, context recall -- are computed using LLM-based judges with specific reference-free prompts.

Inspect AI and lm-eval-harness

UK AISI's Inspect and EleutherAI's lm-eval-harness target academic-style evaluations. Useful for model selection, less useful for application-level eval.

A Minimal Eval Loop with DeepEval

from deepeval import evaluate
from deepeval.metrics import FaithfulnessMetric, AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

cases = [
    LLMTestCase(
        input="What is the VAT rate in Germany?",
        actual_output=my_rag_pipeline("What is the VAT rate in Germany?"),
        expected_output="19% standard rate",
        retrieval_context=[fetched_doc_text]
    ),
    # ...
]

metrics = [
    FaithfulnessMetric(threshold=0.8, model="gpt-4o"),
    AnswerRelevancyMetric(threshold=0.75, model="gpt-4o")
]

results = evaluate(cases, metrics)

Geo-Aware Testing: Why It Matters

Any LLM feature that fetches web data at inference time -- RAG with a live retriever, agents with browser tools, functions that call geo-sensitive APIs -- produces outputs that depend on the egress location of the request. Common failure modes:

  • A search-augmented assistant answering a legal question returns US-centric rules when the user is in the EU.
  • A shopping agent quotes prices in the wrong currency because the retailer detected a US IP and defaulted to USD.
  • An RAG system pulls the wrong language edition of Wikipedia because the fetch went out through an English-speaking region.
  • A multilingual chat assistant responds in English because no locale signal reached the retriever.

Running the eval from a single region hides all of these. You need to parameterize the eval run by egress location and rerun the golden set per locale.

Geo-Eval Pattern

The minimum setup:

  1. Define a list of target locales with matching proxy regions. Example: (en-US, US), (en-GB, UK), (de-DE, DE), (fr-FR, FR), (ja-JP, JP), (pt-BR, BR).
  2. For each locale, route all external fetches through a proxy with egress in that region.
  3. Run the full golden set and compute metrics per locale.
  4. Alert on cross-locale divergence beyond a threshold.
import os

LOCALES = [
    ("en-US", "http://us.hexproxies.com:7777"),
    ("en-GB", "http://uk.hexproxies.com:7777"),
    ("de-DE", "http://de.hexproxies.com:7777"),
    ("fr-FR", "http://fr.hexproxies.com:7777"),
    ("ja-JP", "http://jp.hexproxies.com:7777"),
]

for locale, proxy in LOCALES:
    os.environ["HTTP_PROXY"] = proxy
    os.environ["HTTPS_PROXY"] = proxy
    results = evaluate(cases, metrics)
    publish(f"eval.{locale}", results)

Hallucination Detection Per Locale

Hallucination rate often varies by locale. Two reasons: pretraining data is English-dominant, and retrieval quality is worse in lower-resource languages. Measure hallucination rate per locale with a grounded-faithfulness judge -- ask a separate model whether each factual claim in the answer is supported by the retrieved context. RAGAS implements this as faithfulness.

A useful signal is claim coverage: decompose the answer into atomic claims, count how many have a supporting span in the context. Low coverage in one locale but high in another means your retriever is failing for that locale, not your generator.

Multilingual Response Testing

Whether the model responds in the right language is a distinct axis from whether it is correct. Detect response language with fastText lid.176 or CLD3 and compute a language-match rate per locale. A 90%+ match rate is the usual bar; below 80% the system is routinely answering in the wrong language.

Judge Reliability

LLM-as-judge has known biases (Zheng et al., Judging LLM-as-a-Judge, arXiv:2306.05685): position bias, verbosity bias, self-preference. Mitigations:

  • Randomize answer order when comparing pairs
  • Use a different model as judge than as generator
  • Calibrate the judge against 100-200 human-labeled examples, report correlation
  • For high-stakes metrics, use ensemble of judges

Regression Gates in CI

Eval only pays off when it blocks bad deploys. Pattern:

  1. On every PR that touches prompts, model configs, or retrievers, run the eval suite
  2. Compare aggregate metrics to main branch baseline
  3. Fail the CI job on regression beyond a threshold (e.g. ≥2% drop in faithfulness)
  4. Require explicit override with a reviewer comment to merge

Running the full multi-locale suite in CI is slow (minutes to hours). Split into a fast smoke suite per PR and a nightly full suite.

Dataset Maintenance

Eval datasets decay. Prompts that used to fail start to pass; prompts that used to pass become trivial; new failure modes are not represented. Discipline:

  • Add every production failure to the golden set (with permission and redaction)
  • Version the dataset; never silently modify historical cases
  • Rotate out cases that no model fails
  • Keep a held-out adversarial slice for red-team findings

Closing

Geo-aware eval is not a nice-to-have for global products; it is the only way to notice the failure modes that users in non-primary markets actually experience. The engineering cost is low: a proxy-per-region configuration and a loop around your existing eval harness. The payoff is catching regressions that single-region CI cannot see.

For related content on global testing, see our SEO rank tracking guide, which covers similar geo-routing patterns.