GDPR Compliance for Public Data Collection via Proxies
This article is for informational purposes only and does not constitute legal advice. Consult qualified counsel for guidance specific to your situation.
A common misconception is that the General Data Protection Regulation (Regulation (EU) 2016/679) does not apply to data that is already public. It does. Article 4(1) defines personal data as "any information relating to an identified or identifiable natural person," with no carve-out for public visibility. The Court of Justice of the European Union confirmed this posture in Google Spain SL v. Agencia Española de Protección de Datos, Case C-131/12 (2014), and it has been reinforced in every subsequent decision touching on search engines, directories, and aggregators.
This article walks through the lawful basis analysis, data minimization obligations, DPIA requirements, and transfer mechanics that apply when an organization uses proxy infrastructure to collect personal data from publicly accessible web sources.
Does GDPR Apply to Your Scraping?
Article 3 sets the territorial scope. GDPR applies when:
- The controller is established in the EU (regardless of where processing happens), or
- The processing activities relate to offering goods or services to data subjects in the EU, or
- The processing monitors the behavior of data subjects in the EU.
A US-based scraper collecting the names and job titles of EU residents from LinkedIn for sales prospecting triggers Article 3(2)(a) because it is targeting EU data subjects in the course of offering services. A US scraper collecting US residents' data using an exit node in Germany generally does not trigger GDPR based on the location of the proxy alone; the relevant question is where the data subjects are, not where the traffic is routed.
Article 6: Lawful Basis
Assuming GDPR applies, every processing activity needs a lawful basis under Article 6(1). For public data collection, the realistic options are:
- Consent (Article 6(1)(a)): Impractical. Obtaining GDPR-valid consent from data subjects whose data is being scraped without their knowledge is not possible, and consent must be freely given, specific, informed, and unambiguous under Article 7.
- Legitimate interests (Article 6(1)(f)): The primary basis for most commercial scraping operations. Requires a three-part balancing test.
The other four bases (contract, legal obligation, vital interests, public task) rarely fit scraping fact patterns.
The legitimate interests test
Drawn from Article 6(1)(f) and Recital 47, the controller must show:
- Purpose test: A real, articulated legitimate interest exists. Market research, fraud prevention, direct marketing (subject to ePrivacy), and journalism have been accepted as legitimate interests by EU supervisory authorities.
- Necessity test: The processing is necessary to achieve the purpose, meaning there is no less-intrusive alternative. If aggregate or anonymized data would suffice, using personal data fails this test.
- Balancing test: The interests of the controller do not override the "interests or fundamental rights and freedoms of the data subject." The more sensitive the data or the more unexpected the processing, the heavier the data subject's side of the scale.
The Court of Justice in Fashion ID GmbH v. Verbraucherzentrale NRW, Case C-40/17 (2019), held that the balancing must be documented and actually performed, not asserted.
Data Minimization: Article 5(1)(c)
Article 5(1)(c) requires data to be "adequate, relevant and limited to what is necessary" for the stated purpose. In scraping terms, this means:
- Do not collect fields you do not need. If your use case is price monitoring, do not also store seller names and addresses.
- Do not retain raw payloads indefinitely. Extract the fields you need, then discard or anonymize the rest.
- Do not scrape at a frequency beyond what the purpose justifies. Hourly collection of a personal profile for sales prospecting is disproportionate; weekly is defensible.
Enforcement is not theoretical. The Italian Garante fined Clearview AI €20 million in 2022 for scraping face images without a lawful basis, citing minimization and purpose limitation violations. The CNIL in France imposed parallel fines on the same facts. The 2024 Dutch DPA investigation into a data enrichment vendor cited Article 5(1)(c) explicitly.
Article 14: Notice When Data Is Not Obtained From the Data Subject
When a controller collects personal data from a source other than the data subject (as scraping always does), Article 14 requires providing the data subject with a set of information including the controller's identity, the purposes of processing, the categories of data, the source, and the data subject's rights. The notice must be given within a "reasonable period" and at the latest within one month.
Article 14(5)(b) provides an exemption where "the provision of such information proves impossible or would involve a disproportionate effort." Scraping at scale is often within this exemption, but the controller must document why contacting each data subject would be disproportionate and must consider measures to protect the data subjects' rights, including making the information publicly available. The European Data Protection Board guidance (Guidelines 05/2020) makes clear that the exemption is not automatic and must be justified case by case.
DPIA Requirements: Article 35
A Data Protection Impact Assessment is required under Article 35 when processing is "likely to result in a high risk to the rights and freedoms of natural persons," particularly for large-scale processing, automated decision-making, or systematic monitoring. Large-scale scraping of personal data almost always triggers DPIA requirements under the EDPB's nine-criteria framework (Guidelines WP248).
A DPIA must include: a systematic description of processing, an assessment of necessity and proportionality, an assessment of risks to data subjects, and the measures envisaged to address those risks. If the DPIA shows residual high risk, prior consultation with the supervisory authority under Article 36 is required before processing begins.
Article 89: The Research Exception
Article 89 provides derogations for processing for scientific research, historical research, statistical purposes, and archiving in the public interest, provided appropriate safeguards are in place (pseudonymization, technical and organizational measures). The research exception is real but narrow. It does not cover commercial market research dressed up as science, and it does not fully exempt the controller from Articles 5 and 6; it relaxes specific rights (access, rectification, objection) subject to member state law.
Recital 159 clarifies that scientific research should be "interpreted in a broad manner" and can include privately funded research, but the purpose must be genuine scientific inquiry, not commercial exploitation of data labeled as research.
Where Proxies Fit
A proxy network moves bits between client and target. The proxy provider is generally a processor under Article 28 only if it handles personal data on behalf of the controller, which for pass-through IP routing is not ordinarily the case. Most proxy providers do not see the payloads they route (HTTPS is opaque to them), do not store the content, and do not make decisions about processing purposes. The controller-processor analysis therefore depends heavily on what the provider logs, whether it inspects traffic, and what it retains.
When engaging a proxy vendor for GDPR-scope workloads, the controller should confirm:
- What the provider logs (connection metadata, timestamps, source/destination IPs) and for how long.
- Whether the provider processes traffic in a way that would make it a processor requiring an Article 28 DPA.
- Where the infrastructure is located and whether Article 44-49 transfer mechanics apply.
Hex Proxies publishes its privacy posture and offers a Data Processing Addendum on request for customers in GDPR scope.
Practical Compliance Checklist
- Identify whether GDPR applies (Article 3 analysis).
- Document the lawful basis. For most commercial scraping, perform and record the legitimate interests balancing test.
- Minimize fields collected and retention periods.
- Address Article 14 notice, or document the disproportionate effort exemption with specificity.
- Run a DPIA if the processing is large-scale or sensitive; consult the DPA if residual risk is high.
- Maintain a record of processing activities (Article 30).
- Execute a DPA with any processor that handles personal data on your behalf.
Key Sources
- Regulation (EU) 2016/679 (GDPR), particularly Articles 3, 5, 6, 14, 28, 35, 89.
- EDPB Guidelines 05/2020 on consent; Guidelines 3/2018 on territorial scope.
- Article 29 WP Guidelines WP248 on DPIA.
- Google Spain SL v. AEPD, Case C-131/12 (2014).
- Fashion ID GmbH v. Verbraucherzentrale NRW, Case C-40/17 (2019).
- Garante per la Protezione dei Dati Personali, Decision 50/2022 (Clearview AI).