Building a Data Moat: When Scraped Data Becomes Competitive Advantage
The phrase "data moat" appears in roughly every third software company's fundraising deck, and it is almost always wrong. Data alone is not a moat. Most datasets commoditize within a product cycle. What creates defensibility is a data flywheel: a feedback loop in which customer usage produces data that improves the product, which attracts more customers, which produces more data. Hamilton Helmer's 7 Powers (2016) classifies this as "Counter-Positioning" or "Scale Economies" depending on mechanism, and neither is automatic just because data exists.
This article walks through the three canonical examples of data flywheels that actually worked, identifies the conditions that made them work, and describes when and how externally collected web data can become part of a defensible product.
The Three Canonical Examples
Tesla: sensor data as closed-loop training corpus
Tesla's Full Self-Driving program collects sensor data from roughly 7 million vehicles on the road (as of Q4 2024 earnings). Each vehicle produces corner cases, labeled implicitly by driver intervention, which feed neural network training. The feedback loop is tight: more vehicles produce more miles, more miles produce more edge cases, better models produce better autonomy, which attracts more buyers.
The moat is not the raw sensor data. It is the combination of fleet scale, the data infrastructure to ingest and label it, and the model training pipeline that converts it into product improvement. Waymo, operating on a much smaller fleet, cannot match the corner-case coverage regardless of per-mile engineering quality. Mobileye, with a larger data volume through OEM partnerships, cannot match the end-to-end closed loop because it does not own the vehicle software. The moat is the loop, not the volume.
Waze: crowdsourced routing
Waze (acquired by Google for $966 million in 2013) built a live traffic and incident graph from user reports and passive GPS telemetry. Every driver's trip produced speed and routing data, which improved the routing engine, which made the app more useful, which attracted more drivers. The flywheel produced moat strength proportional to daily active users in each metropolitan area, which is why Waze retained its utility advantage over Google Maps in mid-sized cities long after the acquisition.
The structural lesson from Waze: the moat was geographic and density-dependent. Adding a 100,000th user in Tel Aviv did more for Waze's Tel Aviv product than the same user did in Kansas City, because density drives incident coverage. Flywheels compound within a scope; they do not transfer across scopes automatically.
Zillow: the Zestimate and its limits
Zillow's Zestimate, launched in 2006, was the first attempt at a proprietary estimate-of-value engine built on public record data plus user-submitted corrections. The flywheel worked for discovery (more users checking Zestimates meant more searches meant better ranking and more corrections), but it famously broke when Zillow tried to convert the estimates into actual buy/sell decisions via Zillow Offers. The company wrote down $304 million in inventory and wound the program down in November 2021.
The Zillow lesson is that flywheels can produce a useful product feature without producing actual pricing accuracy strong enough to bet capital on. Data scale compounds noise as well as signal. A data moat that works for engagement metrics does not necessarily work when the same model is asked to make high-stakes decisions.
What Made These Flywheels Work
Three conditions appear in every data moat that held up:
- Proprietary collection: The data is produced by the company's own product usage and cannot be replicated by a competitor without building a comparable user base first. Public-record or purchased data does not clear this bar.
- Closed-loop improvement: The data directly improves the product experience, creating a mechanical reason for users to prefer the better product.
- Increasing marginal value: Each additional data point produces more value than the previous one, up to a plateau. If the 10 millionth data point adds nothing that the 1 millionth did not, there is no moat advantage to scale.
Flywheels that fail usually fail on condition one. A company that trains on publicly scrapeable data can be out-trained by any competitor that scrapes the same sources. There is no structural reason one competitor should own the advantage over time unless they compound a second asset (usage telemetry, proprietary labels, domain experts) on top.
Where Externally Collected Data Fits
This is the crucial point for any data team funding a scraping operation. Externally collected web data is almost never the moat itself. It is an input that enables the product to exist, in the same way that cloud computing is an input that enables most software to exist. The moat has to be built elsewhere.
Three models describe how externally collected data can contribute to defensibility:
Model 1: Data as table stakes plus proprietary enrichment
Competitive intelligence platforms like Crayon, Klue, and Kompyte all scrape similar public sources (news, job boards, product pages, pricing). The differentiation comes from how they enrich that raw data with customer-specific context (battle cards, win/loss outcomes, sales conversations). The scraped data is commodity; the enrichment loop is proprietary.
If you are building in this space, do not count the scraped data as your moat. Count the structured judgments your customers produce on top of it.
Model 2: Data as real-time coverage advantage
For use cases where freshness matters more than historical depth (threat intelligence, sports betting, flight price alerts), the moat is operational: can you scrape every source every 30 seconds, globally, without triggering rate limits, at a cost structure that supports your pricing? This is a systems and network engineering advantage, not a data advantage. Competitors can eventually build the same infrastructure, but the operational lead time is real and measurable in years.
Model 3: Data as bootstrap into a network product
Glassdoor started by scraping company review content from older job boards, then converted users who came for that content into contributors who wrote new reviews. At the moment of conversion, the product stopped being a scraping company and became a user-generated content network. The scraped data got the flywheel started; the UGC kept it spinning. LinkedIn did something similar with its early importer of contacts from Outlook and Gmail.
Zillow's Zestimate falls into a variant of this: public records provided the initial listings, and user corrections became part of the data flywheel. Until they tried to monetize it through iBuying.
When Scraped Data Is Not a Moat
Three fact patterns indicate that scraped data will not produce defensibility:
- The source is public and accessible to any competitor with modest engineering.
- The data changes slowly enough that a competitor can catch up with a single bulk crawl.
- The product depends on the data being comprehensive rather than fresh, and the sources are finite.
Pharmaceutical databases, patent databases, and SEC filing databases are the canonical examples. Every competitor has access to the same sources. The winners in those categories differentiated on interface, domain expertise, and distribution, not on data possession.
The Spreadsheet Test
Before funding a scraping operation as a strategic asset, run this test. Write down the answer to four questions:
- What is the dataset? (Specific sources, fields, cadence.)
- Could a competent competitor reproduce it in 90 days with a $500,000 budget? If yes, it is not a moat.
- What proprietary signal does your product add on top that the raw data does not contain? That signal is your actual moat candidate.
- Does usage of your product produce more of that signal over time? That is the flywheel test.
If the answers to questions three and four are "nothing" and "no," the scraping operation is a cost of doing business, not a moat. That is fine as long as you budget it that way.
Implications for Infrastructure Choices
Companies building a genuine data flywheel have different proxy infrastructure needs than companies running scraping as a supporting function. Flywheel companies optimize for breadth of sources, uninterrupted coverage, and low failure rates, because a gap in the collection erodes the very asset they are compounding. Supporting-function scrapers optimize for cost per GB and good-enough availability.
Hex Proxies serves both profiles. Flywheel customers typically run on ISP proxies with dedicated allocations for key sources plus a residential layer for rotation-sensitive targets. Supporting-function customers tend to run mixed workloads on shared pools. The product page outlines the SLA and IP trust score differences.
Further Reading
- Hamilton Helmer, 7 Powers: The Foundations of Business Strategy (2016).
- Andreessen Horowitz, "The New Moats" (2020), covering data network effects.
- Tesla Q4 2024 Shareholder Update, vehicle fleet and Autopilot metrics.
- Zillow Group Q3 2021 earnings, Zillow Offers wind-down disclosure.
- NFX, "The Network Effects Manual" (2019).