v1.10.90-0e025b8
Skip to main content
Web ScrapingGuide

Proxies for Government and Public Records Collection at Scale

11 min read

By Hex Proxies Engineering Team

Proxies for Government and Public Records Collection at Scale

Last updated: April 2026 | Author: Hex Proxies Team

TL;DR: Government databases and public records are freely available but difficult to collect at scale due to rate limiting, IP blocking, and inconsistent APIs. ISP proxies ($0.83/IP with unlimited bandwidth) are ideal for government sources that prefer stable access patterns. Residential proxies ($1.70/GB) handle sources that block datacenter IP ranges. This guide covers strategies for collecting court records, property data, corporate filings, and regulatory documents using proxy infrastructure.

Government and public records represent one of the largest and most valuable sources of structured data on the web. Court records, property deeds, corporate filings, environmental permits, campaign finance disclosures, and regulatory enforcement actions are all public by law — but accessing them at scale is a different challenge entirely.

Most government databases were built for individual lookups, not bulk data extraction. They implement rate limits designed for human browsing speeds, block IP addresses that exceed those limits, and offer APIs (if any) that are underfunded and poorly documented. Proxy infrastructure bridges this gap, enabling systematic collection of public data without overwhelming government servers or triggering access restrictions.

The Government Data Landscape

Federal Sources

US federal agencies maintain hundreds of publicly accessible databases:

  • SEC EDGAR: Corporate filings, financial statements, insider trading reports — over 21 million filings
  • USPTO: Patent and trademark applications, assignments, litigation records
  • PACER/RECAP: Federal court records across all 94 district courts
  • SAM.gov: Government contracts, grants, entity registrations
  • FEC: Campaign finance filings, donor records, PAC expenditures
  • EPA: Environmental compliance, Superfund sites, emissions data
  • OSHA: Workplace safety inspections, violations, penalties

State and Local Sources

State-level data is often more granular and more difficult to access:

  • Secretary of State: Business entity filings, UCC records, registered agents
  • County recorders: Property deeds, liens, mortgages, easements
  • State courts: Civil and criminal case records, docket information
  • Licensing boards: Professional licenses, disciplinary actions
  • Tax assessors: Property valuations, tax assessments, ownership history

Why Government Sites Need Proxy Infrastructure

Government websites present unique challenges that proxy infrastructure addresses:

Rate Limiting Without APIs

Many government databases lack proper APIs. The PACER system, for example, was designed for individual case lookups. Collecting data across thousands of cases requires making thousands of individual requests — each subject to rate limits that were set for human browsing speeds.

IP-Based Access Restrictions

Government IT departments often implement aggressive IP blocking. A single IP making systematic requests across a county property database will be blocked within minutes. Distributing requests across multiple IPs prevents any single address from exceeding rate limits.

Geographic Access Patterns

Some state and local databases restrict access or show different information based on the requester's geographic origin. A county assessor website may require a local IP address to access detailed property records, or a state licensing board may only show full records to in-state requesters.

Proxy Strategy by Source Type

Source TypeRecommended ProxyRationaleRate Limit
Federal databases (SEC, USPTO)ISP (static)Consistent access pattern, unlimited bandwidth1 req/2-3 sec
State court recordsResidential (state-targeted)In-state IP may unlock more data1 req/3-5 sec
County property recordsResidential (geo-targeted)Local IP for full access, bypass geo-blocks1 req/5 sec
Federal court (PACER)ISP (static)Account-based access, stable IP avoids flags1 req/3 sec
Campaign finance (FEC)ISP (static)Bulk data available, API rate limitedPer API docs
Municipal permits/licensesResidential (city-targeted)City-level targeting for local access1 req/5-10 sec

Implementation: SEC EDGAR Collection

SEC EDGAR is one of the most commonly scraped government databases. The SEC explicitly allows automated access but requires identification via User-Agent headers and enforces rate limits of 10 requests per second.

import httpx
import time

class EdgarCollector:
    def __init__(self, proxy_ip, contact_email):
        self.proxy = f"http://USER:PASS@{proxy_ip}:8080"
        self.client = httpx.Client(
            proxies=self.proxy,
            timeout=30.0,
            headers={
                "User-Agent": f"CompanyName {contact_email}",
                "Accept-Encoding": "gzip, deflate"
            }
        )
        self.last_request = 0
        self.min_interval = 0.15  # ~6.5 req/sec (under 10 limit)

    def get_filings(self, cik, filing_type="10-K"):
        self._rate_limit()
        url = (
            f"https://efts.sec.gov/LATEST/search-index?"
            f"q=&dateRange=custom&startdt=2024-01-01"
            f"&forms={filing_type}&entities={cik}"
        )
        response = self.client.get(url)
        response.raise_for_status()
        return response.json()

    def _rate_limit(self):
        elapsed = time.time() - self.last_request
        if elapsed < self.min_interval:
            time.sleep(self.min_interval - elapsed)
        self.last_request = time.time()

# Using ISP proxy for stable EDGAR access
collector = EdgarCollector(
    proxy_ip="gate.hexproxies.com",
    contact_email="data@yourcompany.com"
)

Implementation: Multi-State Property Records

Collecting property records across multiple counties requires geo-targeted proxies. Each county assessor website may respond differently based on the requester's location:

import httpx

STATE_COUNTIES = {
    "california": ["los-angeles", "san-francisco", "san-diego"],
    "texas": ["harris", "dallas", "travis"],
    "florida": ["miami-dade", "broward", "palm-beach"]
}

def collect_property_records(state, county, parcel_ids):
    """Collect property records using state-targeted proxy."""
    proxy_url = (
        f"http://USER-country-us-st-{state}:PASS"
        f"@gate.hexproxies.com:8080"
    )
    client = httpx.Client(proxies=proxy_url, timeout=30.0)
    
    records = []
    for parcel_id in parcel_ids:
        time.sleep(5)  # Conservative rate limit for county sites
        try:
            response = client.get(
                f"https://{county}.{state}.gov/assessor/parcel/{parcel_id}"
            )
            if response.status_code == 200:
                record = parse_property_record(response.text)
                records.append(record)
        except httpx.RequestError:
            continue
    
    client.close()
    return records

The -st-{state} parameter targets a residential IP in the specified state, which may provide access to records that are restricted or limited for out-of-state requesters.

Ethical and Legal Framework

Public records collection carries a strong legal foundation — these records are public by definition. However, ethical considerations still apply:

Legal Protections

  • Freedom of Information: Federal FOIA and state equivalents establish the right to access government records
  • Public Records Acts: Most states have laws mandating public access to government data
  • hiQ v. LinkedIn (2022): Reinforced that scraping publicly available data is not a CFAA violation

Ethical Obligations

  • Do not overwhelm government servers: Use conservative rate limits. Government IT budgets are limited, and overloading servers impacts public access for everyone
  • Identify your collection activities: Use descriptive User-Agent headers where possible
  • Respect access restrictions: If a government site explicitly blocks automated access, consider filing a FOIA request instead
  • Handle sensitive records carefully: Court records may contain SSNs, addresses, and other PII that requires secure handling

Scaling Strategies

Distributed Collection Across States

For nationwide data collection, distribute requests across multiple proxies to avoid overwhelming any single source:

  • Assign ISP proxies to federal sources (2-3 IPs per source at $0.83/IP)
  • Use state-targeted residential proxies for state-level sources
  • Implement per-source rate limiting independent of proxy rotation
  • Schedule collection during off-peak hours (nights and weekends) when government servers have more capacity

Cost Optimization

Government databases are often text-heavy and relatively low bandwidth. A typical property record page is 50-200 KB. Collecting 100,000 property records consumes approximately 10-20 GB of bandwidth — $17-34 at residential proxy rates of $1.70/GB. For ISP proxy usage on federal databases, the cost is simply $0.83/IP regardless of bandwidth consumed.

Handling Common Challenges

CAPTCHAs on Government Sites

Some government databases implement CAPTCHAs for bulk access. Strategies include:

  • Use browser automation (Playwright/Puppeteer) through proxies to handle JavaScript challenges
  • Implement session management with sticky proxies to maintain CAPTCHA-solved sessions
  • Request API access directly from the agency — many agencies provide bulk data access upon request

Inconsistent Data Formats

Every county, state, and agency uses different data formats. Build source-specific parsers and normalize data into a common schema. This is the most time-consuming part of government data collection — proxy infrastructure solves the access problem, but data normalization requires custom engineering per source.

Frequently Asked Questions

Is it legal to scrape government websites?

Public records are public by law, and collecting them is generally legal. However, the method of collection may be subject to the website's terms of service and computer access laws. Use respectful rate limits, identify your requests with appropriate headers, and consult legal counsel for large-scale operations. The hiQ v. LinkedIn decision supports the legality of scraping publicly available data.

Why not just use government APIs instead of scraping?

Most government databases lack modern APIs. Those that exist are often rate-limited, incomplete, or poorly maintained. SEC EDGAR has a reasonable API, but most county property records, state court systems, and local licensing databases are web-only. Proxy-based collection fills this gap. For sources with APIs, we recommend using the API with ISP proxies for rate limit distribution.

How many ISP proxies do I need for federal database monitoring?

For most federal databases, 2-5 ISP proxies are sufficient. At $0.83/IP with unlimited bandwidth, the total cost is $1.66-$4.15/month per source. Distribute requests across IPs to stay under per-IP rate limits while maintaining collection throughput. Visit our ISP proxy page for details.

Can I collect data from all 50 states simultaneously?

Yes. Using Hex Proxies residential network with state-level targeting, you can route requests through IPs in all 53 US states and territories simultaneously. Configure each state's collector with the appropriate -st-{state} parameter through the gateway at gate.hexproxies.com:8080. Check our residential proxy page for geo-targeting details.

What rate limits should I use for government sites?

Be conservative. Federal databases like SEC EDGAR publish their rate limits (10 req/sec). For state and local sites without published limits, start at 1 request every 5 seconds per source and adjust based on response patterns. Government servers have limited capacity, and overloading them is both unethical and counterproductive — you will get blocked faster. See our pricing page for proxy costs that support these conservative collection strategies.