Proxies for Government and Public Records Collection at Scale
Last updated: April 2026 | Author: Hex Proxies Team
Government and public records represent one of the largest and most valuable sources of structured data on the web. Court records, property deeds, corporate filings, environmental permits, campaign finance disclosures, and regulatory enforcement actions are all public by law — but accessing them at scale is a different challenge entirely.
Most government databases were built for individual lookups, not bulk data extraction. They implement rate limits designed for human browsing speeds, block IP addresses that exceed those limits, and offer APIs (if any) that are underfunded and poorly documented. Proxy infrastructure bridges this gap, enabling systematic collection of public data without overwhelming government servers or triggering access restrictions.
The Government Data Landscape
Federal Sources
US federal agencies maintain hundreds of publicly accessible databases:
- SEC EDGAR: Corporate filings, financial statements, insider trading reports — over 21 million filings
- USPTO: Patent and trademark applications, assignments, litigation records
- PACER/RECAP: Federal court records across all 94 district courts
- SAM.gov: Government contracts, grants, entity registrations
- FEC: Campaign finance filings, donor records, PAC expenditures
- EPA: Environmental compliance, Superfund sites, emissions data
- OSHA: Workplace safety inspections, violations, penalties
State and Local Sources
State-level data is often more granular and more difficult to access:
- Secretary of State: Business entity filings, UCC records, registered agents
- County recorders: Property deeds, liens, mortgages, easements
- State courts: Civil and criminal case records, docket information
- Licensing boards: Professional licenses, disciplinary actions
- Tax assessors: Property valuations, tax assessments, ownership history
Why Government Sites Need Proxy Infrastructure
Government websites present unique challenges that proxy infrastructure addresses:
Rate Limiting Without APIs
Many government databases lack proper APIs. The PACER system, for example, was designed for individual case lookups. Collecting data across thousands of cases requires making thousands of individual requests — each subject to rate limits that were set for human browsing speeds.
IP-Based Access Restrictions
Government IT departments often implement aggressive IP blocking. A single IP making systematic requests across a county property database will be blocked within minutes. Distributing requests across multiple IPs prevents any single address from exceeding rate limits.
Geographic Access Patterns
Some state and local databases restrict access or show different information based on the requester's geographic origin. A county assessor website may require a local IP address to access detailed property records, or a state licensing board may only show full records to in-state requesters.
Proxy Strategy by Source Type
| Source Type | Recommended Proxy | Rationale | Rate Limit |
|---|---|---|---|
| Federal databases (SEC, USPTO) | ISP (static) | Consistent access pattern, unlimited bandwidth | 1 req/2-3 sec |
| State court records | Residential (state-targeted) | In-state IP may unlock more data | 1 req/3-5 sec |
| County property records | Residential (geo-targeted) | Local IP for full access, bypass geo-blocks | 1 req/5 sec |
| Federal court (PACER) | ISP (static) | Account-based access, stable IP avoids flags | 1 req/3 sec |
| Campaign finance (FEC) | ISP (static) | Bulk data available, API rate limited | Per API docs |
| Municipal permits/licenses | Residential (city-targeted) | City-level targeting for local access | 1 req/5-10 sec |
Implementation: SEC EDGAR Collection
SEC EDGAR is one of the most commonly scraped government databases. The SEC explicitly allows automated access but requires identification via User-Agent headers and enforces rate limits of 10 requests per second.
import httpx
import time
class EdgarCollector:
def __init__(self, proxy_ip, contact_email):
self.proxy = f"http://USER:PASS@{proxy_ip}:8080"
self.client = httpx.Client(
proxies=self.proxy,
timeout=30.0,
headers={
"User-Agent": f"CompanyName {contact_email}",
"Accept-Encoding": "gzip, deflate"
}
)
self.last_request = 0
self.min_interval = 0.15 # ~6.5 req/sec (under 10 limit)
def get_filings(self, cik, filing_type="10-K"):
self._rate_limit()
url = (
f"https://efts.sec.gov/LATEST/search-index?"
f"q=&dateRange=custom&startdt=2024-01-01"
f"&forms={filing_type}&entities={cik}"
)
response = self.client.get(url)
response.raise_for_status()
return response.json()
def _rate_limit(self):
elapsed = time.time() - self.last_request
if elapsed < self.min_interval:
time.sleep(self.min_interval - elapsed)
self.last_request = time.time()
# Using ISP proxy for stable EDGAR access
collector = EdgarCollector(
proxy_ip="gate.hexproxies.com",
contact_email="data@yourcompany.com"
)
Implementation: Multi-State Property Records
Collecting property records across multiple counties requires geo-targeted proxies. Each county assessor website may respond differently based on the requester's location:
import httpx
STATE_COUNTIES = {
"california": ["los-angeles", "san-francisco", "san-diego"],
"texas": ["harris", "dallas", "travis"],
"florida": ["miami-dade", "broward", "palm-beach"]
}
def collect_property_records(state, county, parcel_ids):
"""Collect property records using state-targeted proxy."""
proxy_url = (
f"http://USER-country-us-st-{state}:PASS"
f"@gate.hexproxies.com:8080"
)
client = httpx.Client(proxies=proxy_url, timeout=30.0)
records = []
for parcel_id in parcel_ids:
time.sleep(5) # Conservative rate limit for county sites
try:
response = client.get(
f"https://{county}.{state}.gov/assessor/parcel/{parcel_id}"
)
if response.status_code == 200:
record = parse_property_record(response.text)
records.append(record)
except httpx.RequestError:
continue
client.close()
return records
The -st-{state} parameter targets a residential IP in the specified state, which may provide access to records that are restricted or limited for out-of-state requesters.
Ethical and Legal Framework
Public records collection carries a strong legal foundation — these records are public by definition. However, ethical considerations still apply:
Legal Protections
- Freedom of Information: Federal FOIA and state equivalents establish the right to access government records
- Public Records Acts: Most states have laws mandating public access to government data
- hiQ v. LinkedIn (2022): Reinforced that scraping publicly available data is not a CFAA violation
Ethical Obligations
- Do not overwhelm government servers: Use conservative rate limits. Government IT budgets are limited, and overloading servers impacts public access for everyone
- Identify your collection activities: Use descriptive User-Agent headers where possible
- Respect access restrictions: If a government site explicitly blocks automated access, consider filing a FOIA request instead
- Handle sensitive records carefully: Court records may contain SSNs, addresses, and other PII that requires secure handling
Scaling Strategies
Distributed Collection Across States
For nationwide data collection, distribute requests across multiple proxies to avoid overwhelming any single source:
- Assign ISP proxies to federal sources (2-3 IPs per source at $0.83/IP)
- Use state-targeted residential proxies for state-level sources
- Implement per-source rate limiting independent of proxy rotation
- Schedule collection during off-peak hours (nights and weekends) when government servers have more capacity
Cost Optimization
Government databases are often text-heavy and relatively low bandwidth. A typical property record page is 50-200 KB. Collecting 100,000 property records consumes approximately 10-20 GB of bandwidth — $17-34 at residential proxy rates of $1.70/GB. For ISP proxy usage on federal databases, the cost is simply $0.83/IP regardless of bandwidth consumed.
Handling Common Challenges
CAPTCHAs on Government Sites
Some government databases implement CAPTCHAs for bulk access. Strategies include:
- Use browser automation (Playwright/Puppeteer) through proxies to handle JavaScript challenges
- Implement session management with sticky proxies to maintain CAPTCHA-solved sessions
- Request API access directly from the agency — many agencies provide bulk data access upon request
Inconsistent Data Formats
Every county, state, and agency uses different data formats. Build source-specific parsers and normalize data into a common schema. This is the most time-consuming part of government data collection — proxy infrastructure solves the access problem, but data normalization requires custom engineering per source.
Frequently Asked Questions
Is it legal to scrape government websites?
Public records are public by law, and collecting them is generally legal. However, the method of collection may be subject to the website's terms of service and computer access laws. Use respectful rate limits, identify your requests with appropriate headers, and consult legal counsel for large-scale operations. The hiQ v. LinkedIn decision supports the legality of scraping publicly available data.
Why not just use government APIs instead of scraping?
Most government databases lack modern APIs. Those that exist are often rate-limited, incomplete, or poorly maintained. SEC EDGAR has a reasonable API, but most county property records, state court systems, and local licensing databases are web-only. Proxy-based collection fills this gap. For sources with APIs, we recommend using the API with ISP proxies for rate limit distribution.
How many ISP proxies do I need for federal database monitoring?
For most federal databases, 2-5 ISP proxies are sufficient. At $0.83/IP with unlimited bandwidth, the total cost is $1.66-$4.15/month per source. Distribute requests across IPs to stay under per-IP rate limits while maintaining collection throughput. Visit our ISP proxy page for details.
Can I collect data from all 50 states simultaneously?
Yes. Using Hex Proxies residential network with state-level targeting, you can route requests through IPs in all 53 US states and territories simultaneously. Configure each state's collector with the appropriate -st-{state} parameter through the gateway at gate.hexproxies.com:8080. Check our residential proxy page for geo-targeting details.
What rate limits should I use for government sites?
Be conservative. Federal databases like SEC EDGAR publish their rate limits (10 req/sec). For state and local sites without published limits, start at 1 request every 5 seconds per source and adjust based on response patterns. Government servers have limited capacity, and overloading them is both unethical and counterproductive — you will get blocked faster. See our pricing page for proxy costs that support these conservative collection strategies.