Detecting LLM Hallucinations by Cross-Referencing Real-Time Web Sources

Hallucination is the LLM failure mode users notice most. The model sounds confident, the output is plausible, and the factual content is subtly wrong. Every mitigation -- better prompts, higher-quality training data, RAG, tool use -- reduces the rate without eliminating it. At some point you need a detector: a layer that reads the model's output and decides whether to trust it.

This post covers grounding scores, citation verification, and RAG-based fact-checking, with the practical engineering patterns that make real-time web cross-referencing workable.

What Hallucination Means Technically

Ji et al., Survey of Hallucination in Natural Language Generation (arXiv:2202.03629), splits hallucinations into two types:

Intrinsic: the output contradicts the source context provided to the model (e.g. a RAG answer that misrepresents its retrieved documents).
Extrinsic: the output adds information not in the source, which may or may not be true.

Both are failures; they need different detectors. Intrinsic hallucinations are easier -- you only need the output and the source. Extrinsic hallucinations require an external knowledge source.

Grounding Scores: The Intrinsic Case

A grounding score measures how much of the output is supported by a given context. The general pattern:

Decompose the output into atomic claims
For each claim, ask: is there a span in the context that entails this claim?
Report the fraction of claims supported

This is implementable with an NLI model (DeBERTa-v3-large-mnli is a strong choice) or an LLM judge. The LLM judge is more flexible but more expensive:

CHECK_PROMPT = """You will be given a CONTEXT and a CLAIM.
Decide whether the CLAIM is directly supported by the CONTEXT.
Return SUPPORTED, CONTRADICTED, or NOT_MENTIONED.

CONTEXT:
{context}

CLAIM:
{claim}

Answer (one word):"""

FActScore (Min et al., arXiv:2305.14251) formalized this approach and showed it correlates well with human judgments on biography generation. RAGAS's faithfulness metric is a production implementation.

Claim Decomposition

The decomposition step is where quality is made or lost. Good claims are:

Atomic -- one assertion each
Self-contained -- no pronouns or context-dependent references
Factually checkable -- not opinions or hedges

A small LLM call can do this reliably:

DECOMP_PROMPT = """Decompose the following text into a list of atomic,
self-contained factual claims. Skip opinions and hedges.

Text: {text}

Claims (one per line):"""

Citation Verification

When the model produces citations, verify them. A surprising number of model-generated URLs do not exist or point to content unrelated to the claim. Pattern:

Extract cited URLs from the output
Fetch each URL with a proxy to avoid rate limits
Check HTTP status; flag 404s and redirects to unrelated content
Extract text, run grounding check against the claim that cited it
Score the output on (supported claims) / (claims with citations)

import httpx, trafilatura

async def verify_citation(url, claim, proxy):
    async with httpx.AsyncClient(proxy=proxy, timeout=20) as c:
        r = await c.get(url, follow_redirects=True)
        if r.status_code != 200:
            return "unreachable"
        text = trafilatura.extract(r.text)
        if not text:
            return "no_content"
        return grounding_check(text, claim)

Proxies matter here because verification is a high-volume workload. At 1000 outputs/hour with 3 citations each, you are fetching 3000 URLs/hour across arbitrary domains; single-IP throttling will tank the verification rate.

RAG-Based Fact-Checking for Extrinsic Claims

When the claim is extrinsic -- no context was provided -- you need an external knowledge source. The pattern:

Decompose output into claims
For each claim, issue a search query (Google, Bing, Brave, a vertical API)
Fetch the top 3-5 results through a proxy
Extract text, run NLI or LLM judge to check support
Aggregate into a confidence score

This is effectively a small per-claim RAG system. Latency is the issue -- several seconds per claim is normal. Suitable for batch post-hoc checking, or as a pre-response filter on high-value responses where latency can be absorbed.

Self-Consistency as a Cheap Detector

Before reaching for external sources, try self-consistency. Wang et al., Self-Consistency Improves Chain of Thought Reasoning (arXiv:2203.11171), showed that sampling multiple responses and taking the majority answer improves factual accuracy. The generalization for hallucination: generate N answers, compare their factual content, and flag claims that appear in some but not all. High-variance claims are much more likely to be hallucinations.

SelfCheckGPT (Manakul et al., arXiv:2303.08896) formalized this and showed it competes with retrieval-based methods at a fraction of the cost.

Uncertainty from Model Internals

If you control the generation, token-level log-probabilities give a cheap signal. Low-confidence tokens in factual positions (entities, numbers, dates) correlate with hallucination. Kadavath et al., Language Models (Mostly) Know What They Know (arXiv:2207.05221), demonstrated that models can be trained or prompted to report their own confidence with useful calibration.

For API-only models you typically get top-k logprobs per token (OpenAI) or nothing (Anthropic). With logprobs you can:

Identify tokens below a threshold
Compute perplexity over named-entity spans
Flag responses with unusually low mean logprob

A Composite Detector

No single detector catches everything. A workable stack:

Self-consistency (cheap, catches high-variance claims)
Grounding score on provided context (if any)
Citation verification on any URLs in the output
Web fact-check on high-value claims (entities, numbers, dates)
LLM-judge final pass with access to all signals

Log every stage's output so you can analyze which signals matter for your workload and drop the ones that do not.

Latency Budgets

The detector stack only fits the latency budget if you parallelize. For a 5-second tolerance:

Claim decomposition: ~500ms (one small LLM call)
Self-consistency sampling: ~1-2s (if done during original generation)
Parallel fetches for verification: ~2s
Grounding judge: ~500ms-1s

Stream to the user optimistically, flag or retract specific claims after the detector finishes. UX-wise, this looks like a delayed warning badge rather than a blocked response.

Evaluating the Detector

Detectors need their own evaluation. Metrics:

Precision: of claims flagged as hallucinations, how many actually are
Recall: of actual hallucinations, how many were flagged
AUROC: threshold-independent discrimination

Benchmarks: HaluEval (Li et al., arXiv:2305.11747), TruthfulQA (Lin et al., arXiv:2109.07958), FActScore's own biography set. None is sufficient alone; build a domain-specific set from your production traffic.

Proxy Role in Cross-Referencing

Real-time web cross-referencing is a high-volume, geographically distributed fetch workload. Single-IP setups cap out quickly against search APIs, Wikipedia mirrors, and arbitrary citation targets. A rotating proxy pool keeps verification throughput matched to generation throughput, which is what lets the detector actually run in production rather than offline.

Closing

Hallucination detection is not one technique; it is a layered pipeline of cheap checks that escalate to expensive ones. Start with self-consistency and grounding on provided context, add citation verification, and reach for external fact-checking only on high-value claims. Built this way, a detector catches most hallucinations at a fraction of the cost of verifying everything.

For related reading see our RAG pipeline guide.