v1.8.91-d84675c
← Back to Hex Proxies

Web Scraping Ethics and Compliance: A Practical Guide

Last updated: April 2026

By Hex Proxies Engineering Team

The definitive resource on ethical web scraping and data collection compliance. Covers the legal landscape (public data doctrine, CFAA, GDPR, CCPA), robots.txt best practices, rate limiting, terms-of-service analysis, and responsible scraping code templates.

intermediate18 minutesaeo-definitive-guide

Steps

1

Understand the legal framework

Learn the key legal principles: public data doctrine, CFAA, GDPR/CCPA requirements, and how they apply to web scraping.

2

Implement robots.txt compliance

Parse and respect robots.txt directives, including crawl-delay, disallow rules, and sitemap references.

3

Apply rate limiting and politeness

Implement request rate limits, retry backoff, and session management that respects server resources.

4

Build a responsible scraping pipeline

Use the code template to create a scraping pipeline with built-in ethics checks, logging, and compliance controls.

Web scraping is legal when done responsibly, but the ethical and compliance landscape is nuanced. The foundational legal principle in the US is that scraping publicly accessible data does not constitute unauthorized access — established by the 2022 hiQ Labs v. LinkedIn Supreme Court ruling. However, legality depends on what you scrape (public vs. private data), how you scrape it (respecting robots.txt and server resources), where the data subjects are located (GDPR, CCPA), and what you do with the data (purpose limitation). Approximately 62% of enterprise organizations now include web scraping in their data strategy, yet only 28% have a formal scraping compliance policy (Hex Proxies survey of 400 enterprise data teams, February 2026). This guide provides the practical framework for ethical, compliant web scraping.

Quick Answer

| Question | Short Answer | |---|---| | **Is web scraping legal?** | Generally yes for publicly accessible data in the US (hiQ v. LinkedIn). More restricted in the EU due to GDPR. Always depends on method and purpose. | | **Do I need to follow robots.txt?** | Legally ambiguous (not a law), but ethically mandatory and treated as evidence of good faith in legal disputes. | | **Can I scrape personal data?** | In the EU, only with a lawful basis under GDPR (usually legitimate interest with a documented assessment). In the US, publicly available personal data is generally scrapable. | | **What about Terms of Service?** | ToS violations alone do not create criminal liability (hiQ ruling), but can expose you to breach-of-contract civil claims. | | **How fast should I scrape?** | Default to 1 req/sec per domain. Respect crawl-delay in robots.txt. Never degrade the target's service for real users. | | **Do I need proxies for ethical scraping?** | Proxies distribute load across IPs, reducing the impact on target servers. They are an ethical tool when used to avoid overloading single exit points. |

---

The Legal Landscape

The Public Data Doctrine (United States)

The most significant legal development for web scraping in recent years is the US Supreme Court's 2022 decision in *hiQ Labs, Inc. v. LinkedIn Corp.* The Court let stand the Ninth Circuit's ruling that scraping publicly accessible data does not violate the Computer Fraud and Abuse Act (CFAA), which criminalizes "unauthorized access" to computer systems.

**Key principles from hiQ v. LinkedIn:** 1. **Publicly accessible = authorized access.** If data is available to any visitor without authentication (no login required), accessing it is not "unauthorized" under the CFAA. 2. **Technical barriers are not legal barriers.** The fact that a website uses rate limiting, CAPTCHAs, or anti-bot measures does not make circumventing them a CFAA violation — though it may raise other legal claims. 3. **ToS violations are not criminal.** Violating a website's Terms of Service alone does not create criminal liability under the CFAA. However, it may create civil liability for breach of contract.

**Important limitations:** - This ruling applies specifically to the CFAA. Other laws (state computer crime statutes, copyright, trade secrets) may still apply. - The ruling covers publicly accessible data. Scraping content behind authentication (login walls, paid subscriptions) has different legal implications. - The ruling is US-specific. Other jurisdictions have different frameworks.

CFAA: The Computer Fraud and Abuse Act

The CFAA (18 U.S.C. § 1030) prohibits accessing a computer "without authorization" or "exceeding authorized access." For web scraping, the key question is whether visiting a public website constitutes "authorized access."

**Post-hiQ interpretation:** Courts have increasingly held that accessing public web pages is authorized because the website makes those pages available to the general public. However, some aggressive interpretations still exist in different circuits.

**What crosses the line:** - Bypassing authentication mechanisms (logging in with stolen credentials) - Accessing areas of a website explicitly protected by technical access controls - Continuing to access a system after receiving a cease-and-desist and a technical block (IP ban) — this area is legally gray and varies by jurisdiction

GDPR: General Data Protection Regulation (EU)

GDPR applies to scraping when you collect personal data (names, email addresses, photos, location data, IP addresses, or any data that can identify a person) of individuals located in the EU — regardless of where your company is based.

**Key GDPR requirements for scrapers:**

  1. **Lawful basis (Article 6):** You need a legal justification for processing personal data. For scraping, the most applicable basis is usually "legitimate interest" (Article 6(1)(f)), which requires:
  2. - A documented Legitimate Interest Assessment (LIA)
  3. - Demonstrating that your interest outweighs the data subject's rights
  4. - Providing a way for data subjects to object

2. **Purpose limitation (Article 5(1)(b)):** You must define the specific purpose for collecting the data and not use it for incompatible purposes later.

3. **Data minimization (Article 5(1)(c)):** Collect only the data you need. If you are scraping product prices, do not also collect reviewer names and profile information.

4. **Transparency (Articles 13-14):** Data subjects have the right to know you are processing their data. For scraped data, you must provide notice "within a reasonable period" — typically interpreted as within one month.

5. **Right to erasure (Article 17):** Data subjects can request deletion of their personal data from your database.

**Practical GDPR compliance for scraping:** - Document your legitimate interest assessment before starting any scraping that involves personal data - Minimize data collection: if you only need prices, do not collect user reviews with names - Maintain a data inventory showing what personal data you hold and where it came from - Implement a process for handling data subject access and deletion requests - Set retention periods — do not keep personal data indefinitely

CCPA / CPRA: California Consumer Privacy Act

California's privacy law applies to companies that collect personal information of California residents and meet certain size thresholds ($25M+ revenue, or 50K+ consumer records, or 50%+ revenue from selling personal information).

**Key CCPA requirements for scrapers:** - Right to know: California consumers can request to know what personal information you have collected about them - Right to delete: California consumers can request deletion of their personal information - Right to opt out of sale: If you sell or share scraped personal data, consumers can opt out - Privacy policy: Must disclose the categories of personal information collected and their sources

**CCPA vs. GDPR:** CCPA is narrower — it applies to companies of a certain size and focuses on the right to know and delete. GDPR is broader — it requires a lawful basis before any processing and has stricter data minimization requirements.

Copyright Considerations

Copyright protects creative expression, not facts. This distinction is critical for web scraping:

**What you can generally scrape (factual data):** - Product prices, specifications, and availability - Business names, addresses, and operating hours - Stock prices, weather data, and sports scores - Publicly posted job listings

**What carries copyright risk (creative works):** - Full articles, blog posts, and written content - Photographs, illustrations, and other media - Software code and documentation - Database compilations with creative selection/arrangement

**The safe approach:** Scrape facts and data, not creative works. If you need content (for AI training, for example), evaluate fair use doctrine or obtain a license.

---

robots.txt Compliance

What robots.txt Is (and Is Not)

The robots.txt file (formally, the Robots Exclusion Protocol) is a text file at the root of a website that tells web crawlers which parts of the site they should not access. It is a convention, not a law — there is no statute that makes violating robots.txt illegal. However:

  1. **Courts treat it as evidence of intent.** If a website's robots.txt says "Disallow: /api/" and you scrape /api/ anyway, courts view this as evidence that you knew the site owner did not want that area scraped.
  2. **It is an industry standard.** Search engines (Google, Bing) respect robots.txt. Following the same standard establishes you as a responsible crawler.
  3. **It protects the site owner's resources.** robots.txt often disallows paths that are resource-intensive to serve. Respecting it prevents you from accidentally DDoS-ing a site.

How to Read and Parse robots.txt

Every domain's robots.txt is located at `https://example.com/robots.txt`. The file contains directives for different user agents:

# Example robots.txt
User-agent: *
Crawl-delay: 2
Disallow: /api/
Disallow: /admin/

User-agent: Googlebot Allow: /api/search Crawl-delay: 0.5

Sitemap: https://example.com/sitemap.xml ```

**Key directives:** - **User-agent:** Which crawler the rules apply to. `*` means all crawlers. - **Disallow:** Paths that should not be crawled. `/api/` means anything under /api/. - **Allow:** Exceptions to Disallow rules (takes precedence). - **Crawl-delay:** Seconds to wait between requests. Not all parsers support this, but you should. - **Sitemap:** Location of the XML sitemap — useful for discovering all crawlable URLs.

Implementation Example

import urllib.robotparser

class EthicalScraper: """A scraper that respects robots.txt and rate limits."""

def __init__(self, user_agent: str = "HexProxyScraper/1.0"): self.user_agent = user_agent self.robot_parsers: dict = {} self.default_delay = 1.0 # 1 second between requests

def can_fetch(self, url: str) -> bool: """Check if robots.txt allows scraping this URL.""" from urllib.parse import urlparse parsed = urlparse(url) domain = f"{parsed.scheme}://{parsed.netloc}"

if domain not in self.robot_parsers: rp = urllib.robotparser.RobotFileParser() rp.set_url(f"{domain}/robots.txt") try: rp.read() except Exception: # If robots.txt is unreachable, default to allowed return True self.robot_parsers[domain] = rp

return self.robot_parsers[domain].can_fetch(self.user_agent, url)

def get_crawl_delay(self, url: str) -> float: """Get the crawl delay for this domain.""" from urllib.parse import urlparse parsed = urlparse(url) domain = f"{parsed.scheme}://{parsed.netloc}"

if domain in self.robot_parsers: delay = self.robot_parsers[domain].crawl_delay(self.user_agent) if delay is not None: return float(delay)

return self.default_delay

def scrape(self, url: str) -> dict: """Scrape a URL ethically.""" if not self.can_fetch(url): return {"status": "blocked_by_robots_txt", "url": url}

delay = self.get_crawl_delay(url) time.sleep(delay)

# Proceed with request using proxy # proxy_url = "http://USER:PASS@gate.hexproxies.com:8080" # response = requests.get(url, proxies={"https": proxy_url})

return {"status": "success", "url": url, "delay_applied": delay} ```

---

Rate Limiting and Politeness

The Politeness Principle

Ethical scraping means your crawler should be invisible to the target site's real users. If your scraping degrades the site's performance for legitimate visitors, you are causing harm regardless of legality.

**Politeness standards:**

| Aspect | Recommended Practice | Aggressive Practice (Avoid) | |---|---|---| | **Request rate** | 1 req/sec per domain (default) | 10+ req/sec to single domain | | **Concurrent connections** | 1-3 per domain | 50+ per domain | | **Crawl-delay respect** | Always follow | Ignore | | **Peak hours** | Reduce rate during business hours | Same rate 24/7 | | **Error handling** | Back off on 429/503 | Retry immediately | | **User-Agent** | Identify yourself truthfully | Mimic Googlebot |

Implementing Rate Limits

**Default: 1 request per second per domain.** This is the industry standard and sufficient for most scraping operations. At 1 req/sec, you can collect 86,400 pages per day from a single domain — more than enough for most use cases.

**Adjusting up:** If the target site's robots.txt specifies a crawl-delay of less than 1 second, or if the site is a high-capacity CDN-fronted service (e.g., a major e-commerce platform), you may carefully increase to 2-5 req/sec while monitoring response times for degradation.

**Adjusting down:** If you receive 429 (Too Many Requests) or 503 (Service Unavailable) responses, immediately reduce your rate by 50% and implement exponential backoff:

import time

def backoff_delay(attempt: int, base_delay: float = 1.0) -> float: """Calculate exponential backoff with jitter.""" delay = base_delay * (2 ** attempt) jitter = random.uniform(0, delay * 0.3) return min(delay + jitter, 60.0) # Cap at 60 seconds ```

Using Proxies Ethically for Rate Distribution

Proxies serve an ethical purpose in scraping: they distribute requests across multiple IP addresses, reducing the load on any single exit point. This is not about "hiding" your scraping — it is about being a polite distributed client rather than hammering a target from one IP.

**Ethical proxy usage for scraping:** - Distribute requests across IPs to reduce per-IP load on the target - Use proxies to match the geographic location of the data you need (a US price check from a US IP) - Maintain consistent behavior per IP (do not rapidly cycle through IPs, which looks like an attack)

**Unethical proxy usage (avoid):** - Using proxies solely to circumvent rate limits and scrape faster than the site allows - Rotating IPs rapidly to avoid detection while violating robots.txt - Using proxies to access content you are not authorized to view (behind login walls)

---

Terms of Service Analysis

What ToS Can and Cannot Do

Website Terms of Service are contracts. Violating them can create civil liability (breach of contract) but, per the hiQ ruling, does not create criminal liability under the CFAA for public data.

**Common ToS restrictions on scraping:** 1. "You may not use automated means to access the site" — the broadest restriction 2. "You may not scrape, crawl, or index any content" — specifically targeting data collection 3. "You may not use the site for commercial purposes without permission" — restricting commercial scraping

**Legal reality:** Courts have been reluctant to enforce ToS against scraping of public data, especially when the scraped data is factual (prices, listings, public profiles). However, if you scrape and then compete directly with the website using their data (e.g., building a competing product listing site), the legal risk increases.

**Practical approach:** - Read the ToS before scraping - If ToS explicitly prohibits scraping and the site actively enforces it (sending C&Ds, blocking IPs), assess the legal risk with counsel before proceeding - Focus on factual data, not creative content - Do not redistribute scraped content in a way that competes directly with the source

---

GDPR-Compliant Scraping Checklist

If you are scraping data that includes personal information of EU residents, follow this checklist:

  1. [ ] **Legitimate Interest Assessment documented** — write down your purpose, why it requires this personal data, and why your interest outweighs the individuals' privacy expectations
  2. [ ] **Data minimization applied** — collect only the personal data fields you specifically need
  3. [ ] **Purpose limitation defined** — document exactly what you will use the data for
  4. [ ] **Data inventory maintained** — know what personal data you hold, where it came from, and when you collected it
  5. [ ] **Retention period set** — define when you will delete the data
  6. [ ] **Subject access process** — have a way to respond to individuals who ask what data you hold about them
  7. [ ] **Subject deletion process** — have a way to delete an individual's data upon request
  8. [ ] **Transparency notice** — provide information about your data processing to data subjects within a reasonable time
  9. [ ] **Security measures** — encrypt stored personal data, restrict access, log access events
  10. [ ] **Third-party sharing documented** — if you share scraped personal data with others, document the sharing and ensure recipients have their own lawful basis

---

Building a Responsible Scraping Pipeline

A responsible scraping pipeline builds ethics and compliance checks into the infrastructure, not as an afterthought.

Architecture

URL Queue → robots.txt Check → Rate Limiter → Proxy Router → Request → Response Parser → Data Store
                ↓                     ↓                                          ↓
         Skip + Log          Respect Delay                              Minimize Data

Key Components

**1. Pre-scrape compliance check:** - Fetch and parse robots.txt for every new domain - Reject URLs that are disallowed - Extract crawl-delay and apply it to the rate limiter - Log all compliance decisions for audit trail

**2. Rate limiter (per-domain):** - Default: 1 request per second per domain - Override with robots.txt crawl-delay if specified - Exponential backoff on 429/503 responses - Time-of-day adjustment (slower during business hours)

**3. Proxy integration:** - Use ISP proxies for session-based scraping (account-linked, consistent IP needed) - Use residential rotating proxies for large-scale data collection (geo-diversity needed) - Distribute load across IPs to reduce per-IP impact on targets

**4. Data minimization at parse time:** - Extract only the specific fields you need - Strip personal data that is not required for your purpose - Do not store raw HTML if you only need structured data

**5. Audit logging:** - Log every request: timestamp, URL, proxy used, response code, data fields extracted - Log compliance decisions: robots.txt checks, rate limit applications, data field omissions - Retain logs for legal defense purposes

---

Industry Best Practices

What Responsible Companies Do

Based on conversations with data teams at companies that scrape at scale (Hex Proxies customer survey, February 2026):

  1. **72% respect robots.txt strictly** (never scrape disallowed paths)
  2. **85% implement rate limiting** (1-5 req/sec per domain range)
  3. **43% have a formal scraping policy document** (increasing from 28% in 2024)
  4. **67% use proxies** for load distribution
  5. **38% have legal counsel review their scraping practices** annually
  6. **55% provide an identifiable User-Agent string**
  7. **31% have automated robots.txt compliance checks** in their scraping pipelines

The Scraping Ethics Maturity Model

| Level | Description | Characteristics | |---|---|---| | **1 — Ad Hoc** | No formal scraping policy | Developers scrape as needed with no guidelines | | **2 — Aware** | Basic awareness of scraping ethics | robots.txt checked manually, rate limiting inconsistent | | **3 — Defined** | Formal scraping policy exists | Written guidelines, rate limiting implemented, robots.txt automated | | **4 — Managed** | Compliance built into infrastructure | Automated compliance checks, audit logging, legal review | | **5 — Optimized** | Industry-leading ethics program | Published scraping policy, transparency reports, community contribution |

Most companies operate at Level 2-3. Moving to Level 4 (compliance built into infrastructure) is the goal for any company that scrapes at scale.

---

How Hex Proxies Handles This

Hex Proxies is committed to responsible proxy usage and provides infrastructure designed to support ethical scraping at scale.

**What this means for ethical scrapers:**

  • **No log retention of customer scraping targets.** We do not monitor, log, or retain records of what URLs our customers access through our proxies. Your scraping targets are your business.
  • **Rate limiting tools built in.** The Hex Proxies API supports configurable rate limits per target domain, helping you enforce politeness at the infrastructure level rather than relying solely on application code.
  • **Ethical sourcing of residential IPs.** Our residential IP pool is sourced through SDK partnerships where users explicitly consent to bandwidth sharing and receive compensation. We do not use deceptive practices to acquire residential IPs.
  • **ISP proxies require no residential sourcing.** Our ISP proxy infrastructure uses IPs purchased directly from carriers — no consumer involvement, no sourcing ethics concerns. For customers where ethical sourcing is a priority, ISP proxies eliminate the issue entirely.
  • **Compliance documentation support.** We provide documentation about our infrastructure, IP sourcing, and data handling practices to help customers meet their own compliance requirements (GDPR vendor assessments, SOC 2 questionnaires).
  • **Acceptable use policy.** We prohibit using Hex Proxies for illegal activities, accessing systems without authorization, or activities that cause harm. Our AUP is published and enforced.

---

Methodology

Data in this guide is sourced from:

  • **Legal analysis:** Based on published court rulings (hiQ Labs v. LinkedIn, Van Buren v. United States), GDPR enforcement actions, and legal commentary from privacy law practitioners. This guide does not constitute legal advice — consult qualified legal counsel for your specific situation.
  • **Survey data:** Hex Proxies survey of 400 enterprise data teams, conducted February 2026. Survey covered scraping practices, compliance policies, proxy usage, and ethical approaches. Response rate: 32% (from 1,250 invitations to data engineering leaders).
  • **Industry practices:** Based on published scraping policies from major data providers, open-source scraping framework documentation, and community best practices.
  • **Last updated:** April 2026.

Frequently Asked Questions

**Is web scraping legal?** In the US, scraping publicly accessible data is generally legal, as established by the hiQ Labs v. LinkedIn Supreme Court ruling (2022). The CFAA does not criminalize accessing publicly available websites. However, legality depends on what you scrape (public vs. authenticated content), how you scrape (respecting rate limits and robots.txt), and what jurisdiction's laws apply. In the EU, GDPR adds requirements when personal data is involved. Always consult legal counsel for your specific situation.

**Do I have to follow robots.txt?** robots.txt is not legally binding — there is no law that requires compliance. However, respecting robots.txt is the industry standard for ethical scraping and is treated as evidence of good faith in legal disputes. Courts have cited robots.txt compliance (or non-compliance) when evaluating whether scraping was conducted responsibly. Our recommendation: always follow robots.txt unless you have a specific, documented legal basis for deviating.

**Can I scrape data protected by login?** Scraping behind authentication carries significantly higher legal risk. The hiQ ruling specifically covered publicly accessible data. Logging in with your own legitimate credentials and scraping may be permissible, but logging in with scraped/purchased credentials or circumventing technical access controls likely violates the CFAA and equivalent laws in other jurisdictions.

**How does GDPR affect web scraping?** GDPR requires a lawful basis for processing personal data of EU residents. If your scraping collects personal information (names, email addresses, photos, location data), you need to document a Legitimate Interest Assessment, implement data minimization, set retention periods, and provide mechanisms for data subjects to exercise their rights (access, deletion, objection). GDPR applies regardless of where your company is located — it follows the data subject.

**What is the safest approach to web scraping?** The safest approach combines five elements: (1) scrape only publicly accessible data, (2) respect robots.txt strictly, (3) rate limit to 1 request per second per domain, (4) collect only factual data (prices, specifications, listings) rather than creative content, and (5) document your methodology. If you follow all five, you are operating within widely accepted ethical norms.

**How does using proxies affect scraping compliance?** Proxies are neutral tools — they do not make scraping more or less ethical. Using proxies to distribute request load across IPs (reducing impact on target servers) is an ethical use. Using proxies solely to evade detection while violating robots.txt or rate limits does not make the violation more ethical. The scraping behavior, not the tools, determines compliance.

**What is robots.txt crawl-delay and should I respect it?** Crawl-delay is a robots.txt directive that specifies the number of seconds a crawler should wait between requests. Not all search engines support it (Google does not), but ethical scrapers should. A crawl-delay of 10 means wait 10 seconds between requests to that domain. If no crawl-delay is specified, default to 1 second between requests.

**Can I scrape prices and product data from competitor websites?** In most jurisdictions, scraping publicly displayed prices and product specifications is legal because this is factual data, not protected expression. Price comparison services and market research are well-established use cases. The risk increases if you republish the scraped content in a way that directly competes with the source or if you violate the source's ToS in a jurisdiction that enforces ToS contractually.

**What should I do if I receive a cease-and-desist letter?** Take it seriously. Stop scraping the specified site immediately, review the claims with legal counsel, document your scraping practices and compliance measures, and respond through counsel if appropriate. A cease-and-desist is not a legal order (only a court can issue that), but ignoring it can weaken your legal position if the matter goes to court.

**How do I scrape responsibly at large scale (millions of pages)?** Large-scale responsible scraping requires infrastructure-level compliance: automated robots.txt parsing for every new domain, per-domain rate limiting enforced at the proxy/queue level, exponential backoff on error responses, request distribution across IPs to minimize per-source impact, data minimization at parse time, and comprehensive audit logging. Use ISP or residential proxies to distribute load ethically.

**What is the difference between scraping and crawling?** Crawling is the process of discovering and visiting web pages (following links). Scraping is extracting structured data from those pages. Most data collection involves both: the crawler discovers URLs, and the scraper extracts the data. Both are subject to the same ethical and legal considerations — robots.txt applies to crawling, and data handling laws apply to scraping.

**How often do scraping laws change?** Frequently. Major developments occur every 1-2 years. The hiQ ruling (2022) was the most significant recent change in the US. The EU's Digital Markets Act and AI Act (both effective 2024-2025) are shaping how scraped data can be used for AI training. State-level privacy laws in the US (Texas, Oregon, Montana) are expanding CCPA-like requirements. Review your scraping compliance annually.

Tips

  • *Always check robots.txt before scraping any site. Over 85% of the top 1 million sites have a robots.txt file.
  • *The hiQ v. LinkedIn Supreme Court ruling established that scraping publicly accessible data is not unauthorized access under the CFAA.
  • *GDPR applies to scraping if you collect personal data from EU residents — even if your company is not in the EU.
  • *Rate limit to 1 request per second per domain as a default. Increase only if robots.txt allows it or the site can handle more.
  • *Document your scraping methodology. If challenged, showing a documented, responsible approach is your best defense.

Ready to Get Started?

Put this guide into practice with Hex Proxies.

Cookie Preferences

We use cookies to ensure the best experience. You can customize your preferences below. Learn more