v1.8.91-d84675c
ProxiesReal EstateGuide

Real Estate Data Collection with Proxies: Listings, Pricing, and Market Trends

11 min read

By Hex Proxies Engineering Team

Real Estate Data Collection with Proxies: Listings, Pricing, and Market Trends

Real estate is a $45 trillion asset class that still runs largely on information asymmetry. A property investor who can systematically track listing prices, days on market, price reductions, and sales velocity across 50 markets has a structural edge over investors relying on quarterly reports and manual Zillow searches. Proptech companies that aggregate listing data across hundreds of sources power the comparison tools that millions of homebuyers use daily.

The common infrastructure behind all these use cases is proxy-powered data collection. Real estate websites -- from Zillow and Redfin to regional MLS portals and property management platforms -- deploy increasingly sophisticated anti-bot systems to control automated access to their data. This guide covers how to engineer data collection systems that work reliably against these defenses.

For general proxy background, see our real estate data use case and real estate industry page.

The Real Estate Data Landscape

Data Sources and Their Protection Levels

Source TypeExamplesAnti-Bot ProtectionBest Proxy Type
Major portalsZillow, Redfin, Realtor.comVery high (PerimeterX, DataDome)Residential
Regional MLS systemsMRIS, CRMLS, NorthstarMLSModerate (basic auth + rate limiting)ISP
International portalsRightmove (UK), Immobilienscout24 (DE)High (Cloudflare Enterprise)Residential
County recordsAssessor offices, deed recordsLow to noneISP
Property managementApartments.com, Rent.comModerateISP or Residential
Auction sitesAuction.com, HubzuHighResidential
The protection level directly maps to proxy requirements. Zillow's PerimeterX integration actively blocks ISP and datacenter IP ranges, making residential proxies the only viable option. County assessor websites, by contrast, typically have no bot detection and work fine with faster, cheaper ISP proxies.

What Data Points Matter

Depending on your use case, you might collect some or all of these:

Listing data: Address, listing price, property type, square footage, bedrooms, bathrooms, lot size, year built, listing date, listing agent, brokerage.

Market dynamics: Days on market, price changes (with dates), list-to-sale price ratio, number of competing listings in the area, absorption rate.

Financial data: Property tax assessments, previous sale prices and dates, estimated rental income, HOA fees.

Visual data: Property photos (useful for condition assessment via ML models), virtual tour URLs, floor plan availability.

Building a Listing Aggregation Pipeline

Step 1: Define Your Geographic Scope

Real estate is inherently local. A system that monitors "all US listings" is monitoring approximately 1.5 million active for-sale listings plus 2+ million rental listings. That is a massive data collection effort. Most operations start with specific markets.

A typical starting scope: 5-10 target markets (metro areas), covering the top listing portals and local MLS for each.

Step 2: Proxy Configuration by Source

import requests
import random
import time

# Proxy routing based on source protection level
def get_proxy(source_type: str, market_country: str = "us") -> dict:
    """Route to the appropriate proxy type based on source protection."""
    if source_type in ("major_portal", "international_portal", "auction"):
        # Residential for heavily protected sources
        proxy_url = (
            f"http://USER-country-{market_country}:PASS"
            f"@gate.hexproxies.com:8080"
        )
    else:
        # ISP for lightly protected sources (faster, unlimited bandwidth)
        proxy_url = "http://USER:PASS@gate.hexproxies.com:8080"
    
    return {"http": proxy_url, "https": proxy_url}


def collect_listing(url: str, source_type: str) -> dict:
    """Collect a single listing page with appropriate proxy routing."""
    proxy = get_proxy(source_type)
    
    headers = {
        "User-Agent": random.choice(BROWSER_USER_AGENTS),
        "Accept-Language": "en-US,en;q=0.9",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Referer": "https://www.google.com/",
    }
    
    response = requests.get(url, proxies=proxy, headers=headers, timeout=25)
    
    if response.status_code == 200:
        return parse_listing(response.text, source_type)
    
    return {"error": response.status_code, "url": url}

Step 3: Search Result Pagination

Listing portals organize data by search results. To collect all listings in a market, you paginate through search results for that geographic area.

def collect_market_listings(
    portal: str,
    location: str,
    max_pages: int = 50,
) -> list:
    """
    Paginate through a portal's search results for a location.
    Uses per-request rotation -- each page load gets a fresh IP.
    """
    all_listings = []
    
    for page in range(1, max_pages + 1):
        search_url = build_search_url(portal, location, page=page)
        proxy = get_proxy("major_portal")
        
        headers = {
            "User-Agent": random.choice(BROWSER_USER_AGENTS),
            "Accept-Language": "en-US,en;q=0.9",
        }
        
        response = requests.get(
            search_url, proxies=proxy, headers=headers, timeout=25
        )
        
        if response.status_code != 200:
            break
        
        listings = parse_search_results(response.text, portal)
        
        if not listings:
            break  # No more results
        
        all_listings.extend(listings)
        
        # Human-like delay between page loads
        time.sleep(random.uniform(3, 8))
    
    return all_listings

Why per-request rotation matters here: Paginating through 50 pages of search results from a single IP is a clear bot signal. With per-request rotation, each page load comes from a different residential IP. The portal sees 50 different "users" each viewing one page of results, which is normal traffic.

Step 4: Listing Detail Collection

After collecting listing URLs from search results, fetch each listing's detail page for comprehensive data extraction.

This is the most bandwidth-intensive step. A typical Zillow listing page transfers 1-3 MB of data (HTML + inline data). For 100,000 listings across your target markets, that is 100-300 GB of residential proxy bandwidth.

Optimization: target the API, not the page. Most portals load listing data from internal APIs. Zillow, for instance, uses a GraphQL API that returns structured JSON (~50 KB per listing vs ~2 MB for the full page). Identifying and targeting these APIs reduces bandwidth by 95%.

# Example: targeting a portal's internal API instead of full page loads
def fetch_listing_api(listing_id: str) -> dict:
    """
    Fetch listing data from the portal's internal API.
    Dramatically reduces bandwidth vs full page loads.
    """
    proxy = get_proxy("major_portal")
    
    headers = {
        "User-Agent": random.choice(BROWSER_USER_AGENTS),
        "Accept": "application/json",
        "Referer": f"https://www.portal.com/listing/{listing_id}",
    }
    
    api_url = f"https://www.portal.com/api/listing/{listing_id}"
    response = requests.get(api_url, proxies=proxy, headers=headers, timeout=20)
    
    if response.status_code == 200:
        return response.json()
    
    return None

Use Case 1: Investment Analysis

Property investors use listing data to identify undervalued properties, track market trends, and model investment returns.

Key Metrics to Track

Price per square foot trends: Track median $/sqft by zip code over time. Rising $/sqft with flat inventory signals appreciation. Rising $/sqft with rising inventory signals a potential peak.

Days on market distribution: A market where 80% of listings sell in under 14 days is a seller's market. When this shifts to 30+ days, the market is cooling.

List-to-sale ratio: Track the percentage of asking price that properties actually sell for. A ratio above 100% indicates bidding wars; below 95% indicates negotiating power for buyers.

Price reduction frequency: What percentage of listings experience a price reduction before selling? An increasing rate signals a softening market.

Data Collection Frequency

For investment analysis, daily updates are sufficient for most metrics. Weekly updates work for trend analysis. The proxy cost for daily monitoring of 10 markets (approximately 50,000 active listings per refresh):

  • API-targeted approach: ~2.5 GB/day = 75 GB/month
  • At $4.25-$4.75/GB: $318.75-$356.25/month

Use Case 2: Proptech Product Data

Proptech companies that build consumer-facing tools (home valuation estimators, market comparison dashboards, investment calculators) need comprehensive, current listing data.

Challenges at Proptech Scale

Coverage requirements: Consumers expect to see every listing. Missing 5% of listings in a market undermines trust. This requires monitoring multiple sources per market with deduplication.

Data freshness: Listings go under contract within hours in hot markets. If your data is 24 hours stale, you are showing properties that no longer exist. This drives the need for frequent refresh cycles.

Photo collection: Property photos are essential for consumer-facing products but expensive to collect through proxies. Each listing has 20-50 photos averaging 200 KB each. For 100,000 listings, that is 200-500 GB of photo data alone.

Recommendation: Collect metadata (price, address, features) via API targeting with residential proxies for protected portals. Collect photos separately on a less frequent schedule, or source them from data partners.

Use Case 3: Market Research and Appraisal

Appraisers, lenders, and market researchers need historical and current comp data for property valuations. This requires both active listing data and sold/closed transaction records.

Collecting Sold Data

Sold property data is more difficult to collect than active listings because:

  • Some portals restrict sold data behind authentication walls
  • County assessor records (the authoritative source) have wildly varying website quality
  • MLS sold data is typically restricted to licensed agents
County assessor strategy: County assessor websites are generally unprotected and serve structured data about property transactions. ISP proxies work well here, providing fast, unlimited-bandwidth access.
def collect_county_records(county_url: str, parcel_ids: list) -> list:
    """
    Collect property records from county assessor websites.
    ISP proxies provide fast, cost-effective access for unprotected sites.
    """
    proxy = get_proxy("county_records")  # Routes to ISP proxy
    records = []
    
    for parcel_id in parcel_ids:
        record_url = f"{county_url}/parcel/{parcel_id}"
        response = requests.get(record_url, proxies=proxy, timeout=15)
        
        if response.status_code == 200:
            record = parse_assessor_record(response.text)
            records.append(record)
        
        time.sleep(random.uniform(1, 3))
    
    return records

Handling Zillow and Redfin Specifically

These two portals deserve specific attention because they are the most commonly targeted and the most aggressively protected.

Zillow

Protection: PerimeterX with aggressive JavaScript challenges. Detects and blocks headless browsers, analyzes mouse movement patterns, and maintains an IP reputation database.

What works: Residential proxies with per-request rotation. Target Zillow's internal APIs (the /api/ endpoints) rather than scraping full HTML pages. Maintain realistic request headers including a current Chrome User-Agent.

What does not work: ISP proxies (blocked at high rates). Headless browsers without fingerprint evasion (detected immediately). Rapid request rates (more than 5 requests/minute to Zillow triggers rate limiting even with rotation).

Success rate with residential proxies: 89-93% (Hex Proxies internal testing, January 2026).

Redfin

Protection: Custom anti-bot with Cloudflare. Less aggressive than Zillow but still effective against datacenter and ISP proxies.

What works: Residential proxies with moderate pacing (2-4 second delays). Redfin's mobile API endpoints are generally less protected than the web interface.

Success rate with residential proxies: 92-96% (Hex Proxies internal testing, January 2026).

Cost Planning by Scale

ScaleListings MonitoredSourcesRefresh FrequencyMonthly Bandwidth (API-targeted)Proxy Cost/Month
Small investor5,0002Daily1.5 GB$6.38-$7.13
Regional operator50,0005Daily25 GB$106.25-$118.75
Proptech startup500,00010Twice daily500 GB$2,125-$2,375
National platform2,000,000+15+4x daily2,400+ GB$10,200-$11,400+
For small investors, the proxy cost is trivial -- less than the cost of a single real estate database subscription. For proptech companies, the cost scales with data requirements but remains a fraction of the value the data creates.

Legal and Ethical Considerations

Real estate data collection operates in a specific legal context:

Public records are public. Property tax records, deed transfers, and assessor data are public government records. Collecting this data is generally uncontroversial.

Listing data has copyright considerations. MLS listing data is copyrighted by the listing agent and the MLS. Repackaging this data without authorization may violate copyright. However, displaying factual data (prices, addresses, square footage) extracted from public-facing websites is generally permissible under US law.

Terms of service vary. Zillow's terms of service prohibit scraping, but the legal enforceability of such terms against data scraping has been weakened by court decisions (notably hiQ v. LinkedIn). Consult legal counsel for your specific situation.

Rate limiting is respectful. Regardless of legality, maintaining reasonable request rates (2-5 second delays between requests to the same domain) reduces server impact and reduces the likelihood of IP blocks.

Frequently Asked Questions

How many residential proxy IPs do I need for real estate data collection?

With per-request rotation, you do not need a specific number of IPs. Hex Proxies' residential pool provides access to millions of IPs. You pay per GB of bandwidth, not per IP. The pool size ensures sufficient IP diversity for any real estate data collection workload.

Can I use ISP proxies for Zillow?

Not reliably. Zillow's PerimeterX protection actively blocks ISP-range IPs. Residential proxies are required for consistent access. See our proptech industry page for source-specific recommendations.

How do I handle listings that require JavaScript rendering?

For listing detail pages that load data via JavaScript, use Playwright or Puppeteer with proxy configuration. This increases bandwidth consumption by 5-10x compared to API targeting. Reserve browser rendering for sources where API endpoints cannot be identified.

What is the best strategy for collecting property photos?

Photos are bandwidth-intensive (200 KB average per image, 20-50 images per listing). Consider: (1) collect metadata first, photos only for listings that meet your criteria; (2) use ISP proxies for photo collection from portals that allow it (unlimited bandwidth advantage); (3) collect lower-resolution thumbnails initially and full-resolution only on demand.


Start collecting real estate data with Hex Proxies. Residential proxies at $4.25-$4.75/GB provide the IP trust needed for protected portals like Zillow and Redfin. ISP proxies at $2.08-$2.47/IP with unlimited bandwidth handle county records and unprotected sources. Visit our real estate data use case for more setup guidance.

Cookie Preferences

We use cookies to ensure the best experience. You can customize your preferences below. Learn more