Building a Proxy Health Monitor with Prometheus and Grafana

Last updated: April 2026 | Author: Hex Proxies Team

TL;DR: Monitoring proxy health is essential for maintaining reliable scraping and data collection operations. This guide walks through building a complete proxy health monitoring system using Prometheus for metrics collection and Grafana for visualization. Covers latency tracking, success rate monitoring, bandwidth usage, and automatic alerting — with production-ready code for monitoring both residential ($4.25/GB) and ISP ($2.08/IP) proxies through gate.hexproxies.com:8080.

Running proxy infrastructure without monitoring is like driving without a dashboard — you have no idea when something is going wrong until it is too late. Proxy health degrades gradually: success rates drop, latency increases, and bandwidth costs creep up. Without monitoring, these issues compound silently until your data pipeline fails in production.

This guide builds a complete proxy health monitoring system using the industry-standard Prometheus and Grafana stack. By the end, you will have real-time dashboards, automatic alerting, and historical data for capacity planning.

Architecture Overview

┌──────────────────────────────────────────────────┐
│             Proxy Health Prober                    │
│  Sends test requests through proxy infrastructure │
│  Exposes metrics on :8000/metrics                 │
└─────────────────┬────────────────────────────────┘
                  │ Prometheus scrapes every 15s
                  ▼
┌──────────────────────────────────────────────────┐
│             Prometheus                             │
│  Stores time-series metrics                       │
│  Evaluates alerting rules                         │
└─────────────────┬────────────────────────────────┘
                  │ Query interface
                  ▼
┌──────────────────────────────────────────────────┐
│             Grafana                                │
│  Dashboards for proxy health visualization        │
│  Alert notifications via Slack/email/PagerDuty    │
└──────────────────────────────────────────────────┘

Step 1: The Proxy Health Prober

The prober is a Python service that sends periodic test requests through your proxy infrastructure and exposes the results as Prometheus metrics.

"""Proxy Health Prober — exposes Prometheus metrics."""
import time
import httpx
from prometheus_client import (
    start_http_server, Histogram, Counter, Gauge, Summary
)
from dataclasses import dataclass
from typing import List, Optional
import threading

# --- Prometheus Metrics ---

PROXY_REQUEST_DURATION = Histogram(
    "proxy_request_duration_seconds",
    "Time spent on proxy requests",
    ["proxy_type", "country", "target"],
    buckets=[0.1, 0.25, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0]
)

PROXY_REQUEST_TOTAL = Counter(
    "proxy_requests_total",
    "Total proxy requests",
    ["proxy_type", "country", "target", "status"]
)

PROXY_SUCCESS_RATE = Gauge(
    "proxy_success_rate",
    "Current success rate (0-1)",
    ["proxy_type", "country"]
)

PROXY_BANDWIDTH_BYTES = Counter(
    "proxy_bandwidth_bytes_total",
    "Total bandwidth through proxies",
    ["proxy_type", "country", "direction"]
)

PROXY_ACTIVE_CONNECTIONS = Gauge(
    "proxy_active_connections",
    "Current active proxy connections",
    ["proxy_type"]
)

# --- Configuration ---

@dataclass(frozen=True)
class ProbeTarget:
    url: str
    name: str
    expected_status: int = 200

@dataclass(frozen=True)
class ProxyConfig:
    proxy_type: str  # 'residential' or 'isp'
    country: str
    proxy_url: str

DEFAULT_TARGETS = [
    ProbeTarget("https://httpbin.org/ip", "httpbin"),
    ProbeTarget("https://api.ipify.org?format=json", "ipify"),
    ProbeTarget("https://www.google.com", "google"),
    ProbeTarget("https://www.amazon.com", "amazon"),
]

def create_proxy_configs(
    username: str, password: str, countries: List[str]
) -> List[ProxyConfig]:
    """Create proxy configs for monitoring."""
    configs = []
    for country in countries:
        resi_url = (
            f"http://{username}-country-{country}:{password}"
            f"@gate.hexproxies.com:8080"
        )
        configs.append(ProxyConfig("residential", country, resi_url))
    return configs

The Probe Loop

class ProxyHealthProber:
    def __init__(
        self,
        proxy_configs: List[ProxyConfig],
        targets: List[ProbeTarget],
        interval: float = 30.0
    ):
        self.proxy_configs = proxy_configs
        self.targets = targets
        self.interval = interval
        self._success_counts: dict = {}
        self._total_counts: dict = {}

    def probe_once(
        self, proxy: ProxyConfig, target: ProbeTarget
    ) -> None:
        """Send a single probe request and record metrics."""
        labels = {
            "proxy_type": proxy.proxy_type,
            "country": proxy.country,
            "target": target.name
        }
        start = time.monotonic()
        try:
            response = httpx.get(
                target.url,
                proxies=proxy.proxy_url,
                timeout=15.0,
                follow_redirects=True
            )
            duration = time.monotonic() - start
            status = "success" if response.status_code == target.expected_status else "error"

            PROXY_REQUEST_DURATION.labels(**labels).observe(duration)
            PROXY_REQUEST_TOTAL.labels(**labels, status=status).inc()

            # Track bandwidth
            resp_size = len(response.content)
            PROXY_BANDWIDTH_BYTES.labels(
                proxy_type=proxy.proxy_type,
                country=proxy.country,
                direction="download"
            ).inc(resp_size)

            self._update_success_rate(
                proxy, status == "success"
            )

        except httpx.RequestError:
            duration = time.monotonic() - start
            PROXY_REQUEST_DURATION.labels(**labels).observe(duration)
            PROXY_REQUEST_TOTAL.labels(
                **labels, status="timeout"
            ).inc()
            self._update_success_rate(proxy, False)

    def _update_success_rate(
        self, proxy: ProxyConfig, success: bool
    ) -> None:
        """Update rolling success rate."""
        key = f"{proxy.proxy_type}:{proxy.country}"
        if key not in self._total_counts:
            self._success_counts[key] = 0
            self._total_counts[key] = 0
        self._total_counts[key] += 1
        if success:
            self._success_counts[key] += 1
        rate = self._success_counts[key] / self._total_counts[key]
        PROXY_SUCCESS_RATE.labels(
            proxy_type=proxy.proxy_type,
            country=proxy.country
        ).set(rate)

    def run(self) -> None:
        """Main probe loop."""
        while True:
            for proxy in self.proxy_configs:
                for target in self.targets:
                    self.probe_once(proxy, target)
            time.sleep(self.interval)


def main():
    # Start Prometheus metrics server
    start_http_server(8000)
    print("Prometheus metrics server on :8000")

    # Configure proxies to monitor
    configs = create_proxy_configs(
        username="YOUR_USER",
        password="YOUR_PASS",
        countries=["us", "gb", "de", "jp"]
    )

    # Start probing
    prober = ProxyHealthProber(
        proxy_configs=configs,
        targets=DEFAULT_TARGETS,
        interval=30.0
    )
    prober.run()


if __name__ == "__main__":
    main()

Step 2: Prometheus Configuration

Configure Prometheus to scrape the prober's metrics endpoint:

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "proxy_alerts.yml"

scrape_configs:
  - job_name: "proxy-health-prober"
    static_configs:
      - targets: ["localhost:8000"]
    scrape_interval: 15s
    metrics_path: /metrics

Step 3: Alerting Rules

Define Prometheus alerting rules to catch proxy health issues before they impact your data pipeline:

# proxy_alerts.yml
groups:
  - name: proxy_health
    interval: 30s
    rules:
      # Alert when success rate drops below 85%
      - alert: ProxySuccessRateLow
        expr: proxy_success_rate < 0.85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Proxy success rate below 85%"
          description: >-
            {{ $labels.proxy_type }} proxy in
            {{ $labels.country }} has success rate
            {{ $value | humanizePercentage }}

      # Alert when success rate is critical
      - alert: ProxySuccessRateCritical
        expr: proxy_success_rate < 0.70
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Proxy success rate critically low"
          description: >-
            {{ $labels.proxy_type }} proxy in
            {{ $labels.country }} at
            {{ $value | humanizePercentage }} success rate

      # Alert on high latency
      - alert: ProxyLatencyHigh
        expr: >
          histogram_quantile(0.95,
            rate(proxy_request_duration_seconds_bucket[5m])
          ) > 5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Proxy P95 latency above 5 seconds"

      # Alert on excessive bandwidth (cost control)
      - alert: ProxyBandwidthHigh
        expr: >
          rate(proxy_bandwidth_bytes_total[1h]) * 3600
          > 1073741824
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Proxy bandwidth exceeding 1 GB/hour"

Step 4: Grafana Dashboard

Create a Grafana dashboard with panels for each key metric. Here is the dashboard JSON configuration for the core panels:

Panel 1: Success Rate by Proxy Type and Country

Query: proxy_success_rate
Visualization: Stat panel
Thresholds: Red < 0.80, Yellow < 0.90, Green >= 0.90

Panel 2: Request Latency Distribution

Query: histogram_quantile(0.50, rate(proxy_request_duration_seconds_bucket[5m]))
       histogram_quantile(0.95, rate(proxy_request_duration_seconds_bucket[5m]))
       histogram_quantile(0.99, rate(proxy_request_duration_seconds_bucket[5m]))
Visualization: Time series graph
Legend: P50 / P95 / P99

Panel 3: Bandwidth Consumption

Query: rate(proxy_bandwidth_bytes_total[5m]) * 300
Visualization: Time series, unit: bytes
Group by: proxy_type, country

Panel 4: Request Rate and Error Distribution

Query: rate(proxy_requests_total[5m])
Visualization: Stacked bar chart
Group by: status (success, error, timeout)

Panel 5: Cost Estimation

Query: increase(proxy_bandwidth_bytes_total{proxy_type="residential"}[24h]) / 1073741824 * 4.25
Visualization: Stat panel, unit: currency (USD)
Title: Estimated Daily Residential Proxy Cost

Step 5: Docker Compose Deployment

Deploy the entire monitoring stack with Docker Compose:

# docker-compose.yml
version: "3.8"

services:
  proxy-prober:
    build: ./prober
    ports:
      - "8000:8000"
    environment:
      - PROXY_USERNAME=YOUR_USER
      - PROXY_PASSWORD=YOUR_PASS
      - PROBE_INTERVAL=30
      - PROBE_COUNTRIES=us,gb,de,jp,br
    restart: unless-stopped

  prometheus:
    image: prom/prometheus:v2.52.0
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - ./proxy_alerts.yml:/etc/prometheus/proxy_alerts.yml
      - prometheus_data:/prometheus
    restart: unless-stopped

  grafana:
    image: grafana/grafana:11.0.0
    ports:
      - "3000:3000"
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/dashboards:/etc/grafana/provisioning/dashboards
      - ./grafana/datasources:/etc/grafana/provisioning/datasources
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=your-secure-password
    restart: unless-stopped

  alertmanager:
    image: prom/alertmanager:v0.27.0
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
    restart: unless-stopped

volumes:
  prometheus_data:
  grafana_data:

Step 6: Alertmanager Configuration

Route alerts to Slack, email, or PagerDuty:

# alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ["alertname", "proxy_type"]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: "slack-notifications"
  routes:
    - match:
        severity: critical
      receiver: "pagerduty-critical"

receivers:
  - name: "slack-notifications"
    slack_configs:
      - api_url: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
        channel: "#proxy-alerts"
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

  - name: "pagerduty-critical"
    pagerduty_configs:
      - service_key: "YOUR_PAGERDUTY_KEY"

Monitoring ISP vs Residential Proxies

Different proxy types need different monitoring strategies:

Metric	Residential Proxies	ISP Proxies
Success rate baseline	92-97% (varies by target)	98-99.9%
Latency P95 baseline	500ms-3s	100-500ms
Key alert	Bandwidth cost spike	Uptime drop
Rotation monitoring	IP diversity per session	N/A (static IP)
Cost metric	$/GB consumed	$/IP/month (fixed)

For ISP proxies at $2.08/IP, uptime monitoring is the primary concern — the cost is fixed regardless of bandwidth. For residential proxies at $4.25/GB, bandwidth tracking is essential for cost control. Visit our pricing page to plan your proxy budget.

Advanced: Custom Metrics for Your Use Case

Extend the prober with custom metrics specific to your scraping operation:

# Add to the prober for scraping-specific monitoring

SCRAPE_PAGES_TOTAL = Counter(
    "scrape_pages_total",
    "Total pages scraped",
    ["proxy_type", "target_site", "status"]
)

SCRAPE_DATA_QUALITY = Gauge(
    "scrape_data_quality_score",
    "Data quality score (0-1) for scraped content",
    ["target_site"]
)

PROXY_COST_ESTIMATE_USD = Gauge(
    "proxy_cost_estimate_usd",
    "Estimated proxy cost in USD",
    ["proxy_type", "period"]
)

def update_cost_metrics(bandwidth_bytes: int, proxy_type: str):
    """Update cost estimation metrics."""
    gb = bandwidth_bytes / (1024 ** 3)
    if proxy_type == "residential":
        cost = gb * 4.25  # $4.25/GB
    else:
        cost = 0  # ISP is fixed cost per IP
    PROXY_COST_ESTIMATE_USD.labels(
        proxy_type=proxy_type, period="daily"
    ).set(cost)

Operational Runbook

When Success Rate Drops

Check which targets are failing (some sites may have changed their anti-bot rules)
Verify proxy credentials are valid (test manually with curl)
Check if the issue is geographic (one country may be affected)
If residential, try rotating to a new session
If ISP, check if the specific IP has been blocked and request a replacement

When Latency Increases

Check if the issue is target-specific or proxy-wide
Compare latency across countries — a single country may have routing issues
Verify the target site has not added JavaScript challenges (which increase response time)
Check your network connectivity independently of the proxy

When Bandwidth Spikes

Identify which collector or agent is consuming excessive bandwidth
Check for infinite loops in scraping logic
Verify you are not downloading large assets (images, videos) unintentionally
Review your rate limiting configuration

Frequently Asked Questions

How often should I probe proxy health?

Every 30-60 seconds per proxy-target combination provides a good balance between monitoring granularity and probe traffic. At 30-second intervals with 4 proxy configurations and 4 targets, that is 16 probes per 30 seconds — minimal bandwidth impact. For ISP proxies monitoring critical operations, probe every 15 seconds. For residential proxies used in batch operations, 60-second intervals are sufficient.

Does the prober itself consume proxy bandwidth?

Yes, but minimally. Each probe request is a small page load (typically 5-50 KB). With 16 probes every 30 seconds, daily probe bandwidth is approximately 50-500 MB — under $3/day at $4.25/GB for residential proxies. This is a negligible cost for the operational visibility it provides. ISP proxy probing has zero additional cost since bandwidth is unlimited.

Can I monitor proxies from multiple providers in the same dashboard?

Yes. Add additional ProxyConfig entries for each provider, using a label to distinguish them. The Prometheus metrics model handles multi-provider monitoring naturally. This also enables automatic failover — if your monitoring detects one provider's success rate dropping, your scraping infrastructure can automatically shift traffic to the healthier provider.

What Grafana plugins are useful for proxy monitoring?

The built-in Stat, Time Series, and Bar Chart panels cover most needs. For geographic visualization, the Geomap panel can show success rates on a world map using country labels. The Alerting integration supports Slack, PagerDuty, email, and webhooks out of the box. See our residential proxy and ISP proxy pages for the infrastructure these dashboards monitor.

How long should I retain proxy health metrics?

Retain raw metrics for 15-30 days and downsampled metrics for 6-12 months. Raw metrics let you investigate recent incidents in detail. Downsampled metrics enable long-term trend analysis — you can track how proxy performance changes over months and plan capacity accordingly. Prometheus with default retention handles this efficiently for most deployments.