Building a Proxy Health Monitor with Prometheus and Grafana
Last updated: April 2026 | Author: Hex Proxies Team
Running proxy infrastructure without monitoring is like driving without a dashboard — you have no idea when something is going wrong until it is too late. Proxy health degrades gradually: success rates drop, latency increases, and bandwidth costs creep up. Without monitoring, these issues compound silently until your data pipeline fails in production.
This guide builds a complete proxy health monitoring system using the industry-standard Prometheus and Grafana stack. By the end, you will have real-time dashboards, automatic alerting, and historical data for capacity planning.
Architecture Overview
┌──────────────────────────────────────────────────┐
│ Proxy Health Prober │
│ Sends test requests through proxy infrastructure │
│ Exposes metrics on :8000/metrics │
└─────────────────┬────────────────────────────────┘
│ Prometheus scrapes every 15s
▼
┌──────────────────────────────────────────────────┐
│ Prometheus │
│ Stores time-series metrics │
│ Evaluates alerting rules │
└─────────────────┬────────────────────────────────┘
│ Query interface
▼
┌──────────────────────────────────────────────────┐
│ Grafana │
│ Dashboards for proxy health visualization │
│ Alert notifications via Slack/email/PagerDuty │
└──────────────────────────────────────────────────┘
Step 1: The Proxy Health Prober
The prober is a Python service that sends periodic test requests through your proxy infrastructure and exposes the results as Prometheus metrics.
"""Proxy Health Prober — exposes Prometheus metrics."""
import time
import httpx
from prometheus_client import (
start_http_server, Histogram, Counter, Gauge, Summary
)
from dataclasses import dataclass
from typing import List, Optional
import threading
# --- Prometheus Metrics ---
PROXY_REQUEST_DURATION = Histogram(
"proxy_request_duration_seconds",
"Time spent on proxy requests",
["proxy_type", "country", "target"],
buckets=[0.1, 0.25, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0]
)
PROXY_REQUEST_TOTAL = Counter(
"proxy_requests_total",
"Total proxy requests",
["proxy_type", "country", "target", "status"]
)
PROXY_SUCCESS_RATE = Gauge(
"proxy_success_rate",
"Current success rate (0-1)",
["proxy_type", "country"]
)
PROXY_BANDWIDTH_BYTES = Counter(
"proxy_bandwidth_bytes_total",
"Total bandwidth through proxies",
["proxy_type", "country", "direction"]
)
PROXY_ACTIVE_CONNECTIONS = Gauge(
"proxy_active_connections",
"Current active proxy connections",
["proxy_type"]
)
# --- Configuration ---
@dataclass(frozen=True)
class ProbeTarget:
url: str
name: str
expected_status: int = 200
@dataclass(frozen=True)
class ProxyConfig:
proxy_type: str # 'residential' or 'isp'
country: str
proxy_url: str
DEFAULT_TARGETS = [
ProbeTarget("https://httpbin.org/ip", "httpbin"),
ProbeTarget("https://api.ipify.org?format=json", "ipify"),
ProbeTarget("https://www.google.com", "google"),
ProbeTarget("https://www.amazon.com", "amazon"),
]
def create_proxy_configs(
username: str, password: str, countries: List[str]
) -> List[ProxyConfig]:
"""Create proxy configs for monitoring."""
configs = []
for country in countries:
resi_url = (
f"http://{username}-country-{country}:{password}"
f"@gate.hexproxies.com:8080"
)
configs.append(ProxyConfig("residential", country, resi_url))
return configs
The Probe Loop
class ProxyHealthProber:
def __init__(
self,
proxy_configs: List[ProxyConfig],
targets: List[ProbeTarget],
interval: float = 30.0
):
self.proxy_configs = proxy_configs
self.targets = targets
self.interval = interval
self._success_counts: dict = {}
self._total_counts: dict = {}
def probe_once(
self, proxy: ProxyConfig, target: ProbeTarget
) -> None:
"""Send a single probe request and record metrics."""
labels = {
"proxy_type": proxy.proxy_type,
"country": proxy.country,
"target": target.name
}
start = time.monotonic()
try:
response = httpx.get(
target.url,
proxies=proxy.proxy_url,
timeout=15.0,
follow_redirects=True
)
duration = time.monotonic() - start
status = "success" if response.status_code == target.expected_status else "error"
PROXY_REQUEST_DURATION.labels(**labels).observe(duration)
PROXY_REQUEST_TOTAL.labels(**labels, status=status).inc()
# Track bandwidth
resp_size = len(response.content)
PROXY_BANDWIDTH_BYTES.labels(
proxy_type=proxy.proxy_type,
country=proxy.country,
direction="download"
).inc(resp_size)
self._update_success_rate(
proxy, status == "success"
)
except httpx.RequestError:
duration = time.monotonic() - start
PROXY_REQUEST_DURATION.labels(**labels).observe(duration)
PROXY_REQUEST_TOTAL.labels(
**labels, status="timeout"
).inc()
self._update_success_rate(proxy, False)
def _update_success_rate(
self, proxy: ProxyConfig, success: bool
) -> None:
"""Update rolling success rate."""
key = f"{proxy.proxy_type}:{proxy.country}"
if key not in self._total_counts:
self._success_counts[key] = 0
self._total_counts[key] = 0
self._total_counts[key] += 1
if success:
self._success_counts[key] += 1
rate = self._success_counts[key] / self._total_counts[key]
PROXY_SUCCESS_RATE.labels(
proxy_type=proxy.proxy_type,
country=proxy.country
).set(rate)
def run(self) -> None:
"""Main probe loop."""
while True:
for proxy in self.proxy_configs:
for target in self.targets:
self.probe_once(proxy, target)
time.sleep(self.interval)
def main():
# Start Prometheus metrics server
start_http_server(8000)
print("Prometheus metrics server on :8000")
# Configure proxies to monitor
configs = create_proxy_configs(
username="YOUR_USER",
password="YOUR_PASS",
countries=["us", "gb", "de", "jp"]
)
# Start probing
prober = ProxyHealthProber(
proxy_configs=configs,
targets=DEFAULT_TARGETS,
interval=30.0
)
prober.run()
if __name__ == "__main__":
main()
Step 2: Prometheus Configuration
Configure Prometheus to scrape the prober's metrics endpoint:
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "proxy_alerts.yml"
scrape_configs:
- job_name: "proxy-health-prober"
static_configs:
- targets: ["localhost:8000"]
scrape_interval: 15s
metrics_path: /metrics
Step 3: Alerting Rules
Define Prometheus alerting rules to catch proxy health issues before they impact your data pipeline:
# proxy_alerts.yml
groups:
- name: proxy_health
interval: 30s
rules:
# Alert when success rate drops below 85%
- alert: ProxySuccessRateLow
expr: proxy_success_rate < 0.85
for: 5m
labels:
severity: warning
annotations:
summary: "Proxy success rate below 85%"
description: >-
{{ $labels.proxy_type }} proxy in
{{ $labels.country }} has success rate
{{ $value | humanizePercentage }}
# Alert when success rate is critical
- alert: ProxySuccessRateCritical
expr: proxy_success_rate < 0.70
for: 2m
labels:
severity: critical
annotations:
summary: "Proxy success rate critically low"
description: >-
{{ $labels.proxy_type }} proxy in
{{ $labels.country }} at
{{ $value | humanizePercentage }} success rate
# Alert on high latency
- alert: ProxyLatencyHigh
expr: >
histogram_quantile(0.95,
rate(proxy_request_duration_seconds_bucket[5m])
) > 5
for: 5m
labels:
severity: warning
annotations:
summary: "Proxy P95 latency above 5 seconds"
# Alert on excessive bandwidth (cost control)
- alert: ProxyBandwidthHigh
expr: >
rate(proxy_bandwidth_bytes_total[1h]) * 3600
> 1073741824
for: 15m
labels:
severity: warning
annotations:
summary: "Proxy bandwidth exceeding 1 GB/hour"
Step 4: Grafana Dashboard
Create a Grafana dashboard with panels for each key metric. Here is the dashboard JSON configuration for the core panels:
Panel 1: Success Rate by Proxy Type and Country
Query: proxy_success_rate
Visualization: Stat panel
Thresholds: Red < 0.80, Yellow < 0.90, Green >= 0.90
Panel 2: Request Latency Distribution
Query: histogram_quantile(0.50, rate(proxy_request_duration_seconds_bucket[5m]))
histogram_quantile(0.95, rate(proxy_request_duration_seconds_bucket[5m]))
histogram_quantile(0.99, rate(proxy_request_duration_seconds_bucket[5m]))
Visualization: Time series graph
Legend: P50 / P95 / P99
Panel 3: Bandwidth Consumption
Query: rate(proxy_bandwidth_bytes_total[5m]) * 300
Visualization: Time series, unit: bytes
Group by: proxy_type, country
Panel 4: Request Rate and Error Distribution
Query: rate(proxy_requests_total[5m])
Visualization: Stacked bar chart
Group by: status (success, error, timeout)
Panel 5: Cost Estimation
Query: increase(proxy_bandwidth_bytes_total{proxy_type="residential"}[24h]) / 1073741824 * 1.70
Visualization: Stat panel, unit: currency (USD)
Title: Estimated Daily Residential Proxy Cost
Step 5: Docker Compose Deployment
Deploy the entire monitoring stack with Docker Compose:
# docker-compose.yml
version: "3.8"
services:
proxy-prober:
build: ./prober
ports:
- "8000:8000"
environment:
- PROXY_USERNAME=YOUR_USER
- PROXY_PASSWORD=YOUR_PASS
- PROBE_INTERVAL=30
- PROBE_COUNTRIES=us,gb,de,jp,br
restart: unless-stopped
prometheus:
image: prom/prometheus:v2.52.0
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- ./proxy_alerts.yml:/etc/prometheus/proxy_alerts.yml
- prometheus_data:/prometheus
restart: unless-stopped
grafana:
image: grafana/grafana:11.0.0
ports:
- "3000:3000"
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/dashboards:/etc/grafana/provisioning/dashboards
- ./grafana/datasources:/etc/grafana/provisioning/datasources
environment:
- GF_SECURITY_ADMIN_PASSWORD=your-secure-password
restart: unless-stopped
alertmanager:
image: prom/alertmanager:v0.27.0
ports:
- "9093:9093"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
restart: unless-stopped
volumes:
prometheus_data:
grafana_data:
Step 6: Alertmanager Configuration
Route alerts to Slack, email, or PagerDuty:
# alertmanager.yml
global:
resolve_timeout: 5m
route:
group_by: ["alertname", "proxy_type"]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: "slack-notifications"
routes:
- match:
severity: critical
receiver: "pagerduty-critical"
receivers:
- name: "slack-notifications"
slack_configs:
- api_url: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
channel: "#proxy-alerts"
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: "pagerduty-critical"
pagerduty_configs:
- service_key: "YOUR_PAGERDUTY_KEY"
Monitoring ISP vs Residential Proxies
Different proxy types need different monitoring strategies:
| Metric | Residential Proxies | ISP Proxies |
|---|---|---|
| Success rate baseline | 92-97% (varies by target) | 98-99.9% |
| Latency P95 baseline | 500ms-3s | 100-500ms |
| Key alert | Bandwidth cost spike | Uptime drop |
| Rotation monitoring | IP diversity per session | N/A (static IP) |
| Cost metric | $/GB consumed | $/IP/month (fixed) |
For ISP proxies at $0.83/IP, uptime monitoring is the primary concern — the cost is fixed regardless of bandwidth. For residential proxies at $1.70/GB, bandwidth tracking is essential for cost control. Visit our pricing page to plan your proxy budget.
Advanced: Custom Metrics for Your Use Case
Extend the prober with custom metrics specific to your scraping operation:
# Add to the prober for scraping-specific monitoring
SCRAPE_PAGES_TOTAL = Counter(
"scrape_pages_total",
"Total pages scraped",
["proxy_type", "target_site", "status"]
)
SCRAPE_DATA_QUALITY = Gauge(
"scrape_data_quality_score",
"Data quality score (0-1) for scraped content",
["target_site"]
)
PROXY_COST_ESTIMATE_USD = Gauge(
"proxy_cost_estimate_usd",
"Estimated proxy cost in USD",
["proxy_type", "period"]
)
def update_cost_metrics(bandwidth_bytes: int, proxy_type: str):
"""Update cost estimation metrics."""
gb = bandwidth_bytes / (1024 ** 3)
if proxy_type == "residential":
cost = gb * 1.70 # $1.70/GB
else:
cost = 0 # ISP is fixed cost per IP
PROXY_COST_ESTIMATE_USD.labels(
proxy_type=proxy_type, period="daily"
).set(cost)
Operational Runbook
When Success Rate Drops
- Check which targets are failing (some sites may have changed their anti-bot rules)
- Verify proxy credentials are valid (test manually with curl)
- Check if the issue is geographic (one country may be affected)
- If residential, try rotating to a new session
- If ISP, check if the specific IP has been blocked and request a replacement
When Latency Increases
- Check if the issue is target-specific or proxy-wide
- Compare latency across countries — a single country may have routing issues
- Verify the target site has not added JavaScript challenges (which increase response time)
- Check your network connectivity independently of the proxy
When Bandwidth Spikes
- Identify which collector or agent is consuming excessive bandwidth
- Check for infinite loops in scraping logic
- Verify you are not downloading large assets (images, videos) unintentionally
- Review your rate limiting configuration
Frequently Asked Questions
How often should I probe proxy health?
Every 30-60 seconds per proxy-target combination provides a good balance between monitoring granularity and probe traffic. At 30-second intervals with 4 proxy configurations and 4 targets, that is 16 probes per 30 seconds — minimal bandwidth impact. For ISP proxies monitoring critical operations, probe every 15 seconds. For residential proxies used in batch operations, 60-second intervals are sufficient.
Does the prober itself consume proxy bandwidth?
Yes, but minimally. Each probe request is a small page load (typically 5-50 KB). With 16 probes every 30 seconds, daily probe bandwidth is approximately 50-500 MB — well under $1/day at $1.70/GB for residential proxies. This is a negligible cost for the operational visibility it provides. ISP proxy probing has zero additional cost since bandwidth is unlimited.
Can I monitor proxies from multiple providers in the same dashboard?
Yes. Add additional ProxyConfig entries for each provider, using a label to distinguish them. The Prometheus metrics model handles multi-provider monitoring naturally. This also enables automatic failover — if your monitoring detects one provider's success rate dropping, your scraping infrastructure can automatically shift traffic to the healthier provider.
What Grafana plugins are useful for proxy monitoring?
The built-in Stat, Time Series, and Bar Chart panels cover most needs. For geographic visualization, the Geomap panel can show success rates on a world map using country labels. The Alerting integration supports Slack, PagerDuty, email, and webhooks out of the box. See our residential proxy and ISP proxy pages for the infrastructure these dashboards monitor.
How long should I retain proxy health metrics?
Retain raw metrics for 15-30 days and downsampled metrics for 6-12 months. Raw metrics let you investigate recent incidents in detail. Downsampled metrics enable long-term trend analysis — you can track how proxy performance changes over months and plan capacity accordingly. Prometheus with default retention handles this efficiently for most deployments.