Feeding Vector Databases from Web Sources at Scale
Vector databases -- Pinecone, Weaviate, Qdrant, Milvus, pgvector -- are the working memory of modern AI applications. Their query performance is well understood; the operational challenge is on the ingest side. Keeping a billion-vector index fresh, deduplicated, and consistent when the source is the public web is where most teams lose weeks.
This post walks through batch processing, incremental updates, dedupe, and the index-tuning decisions that matter at scale.
Pick the Right Store
A quick comparison of the mainstream options:
- Pinecone: serverless, proprietary, strong filtering, no infra to run. Cost scales with storage and reads.
- Weaviate: open-source, built-in hybrid search, GraphQL API, supports named vectors per object.
- Qdrant: open-source, Rust, excellent filtering and payload indexing, strong scalar-quantization story.
- Milvus: open-source, C++, designed for large indexes (10B+), richer index type selection.
- pgvector: Postgres extension, good up to tens of millions of vectors, trivial to operate if you already run Postgres.
For web-sourced corpora that sit in the 10M-500M vector range with heavy metadata filtering, Qdrant or Weaviate are the common picks. Above 1B vectors, Milvus or a sharded Pinecone setup.
Embedding Batch Processing
Embedding is dominated by API latency, not compute. At 1000 documents per API call with text-embedding-3-large, you get roughly 200-500 docs/sec from a single process. Throughput scales linearly with concurrent connections up to your provider's rate limit. Patterns:
import asyncio
from openai import AsyncOpenAI
client = AsyncOpenAI()
sem = asyncio.Semaphore(20) # concurrent requests
async def embed_batch(texts: list[str]):
async with sem:
r = await client.embeddings.create(
model="text-embedding-3-large",
input=texts,
dimensions=1024 # matryoshka truncation
)
return [d.embedding for d in r.data]
Three tunables dominate throughput: batch size (up to 2048 inputs for OpenAI), concurrency, and dimension (truncating to 1024 or 512 via matryoshka cuts storage 3-6x with small retrieval loss).
Ingestion Pipeline Stages
- Source fetch: fetch documents with proxies to avoid per-IP rate limits
- Extract: convert HTML/PDF to clean text (trafilatura, pdfplumber)
- Dedupe: SHA-256 exact and MinHash near-dedupe
- Chunk: 512-1024 token chunks with metadata preserved
- Embed: batched API calls
- Upsert: write to vector DB with stable IDs
- Verify: sample a handful of queries, compare to prior index
Keep each stage idempotent and restartable. Use a work queue (Redis, SQS, Kafka) between stages so a failure in embedding does not force a re-fetch.
Stable IDs and Upserts
Every chunk needs a stable ID derived from content and source:
import hashlib
def chunk_id(source_url: str, chunk_index: int, chunk_text: str) -> str:
h = hashlib.sha256()
h.update(source_url.encode())
h.update(str(chunk_index).encode())
h.update(chunk_text.encode())
return h.hexdigest()
This lets you upsert safely on re-runs. If the content changed, the hash changes, you get a new ID, and the old chunk can be garbage-collected by URL.
Incremental Updates
Full re-ingestion becomes impossible above tens of millions of vectors. Incremental patterns:
- ETag/Last-Modified check: skip unchanged sources
- Sitemap lastmod: use the source's own freshness signal
- Content-hash check: normalize → hash → compare to stored hash
- Delta index: write new vectors to a shadow index, swap atomically
When a document changes, delete all prior chunks for that URL before writing the new ones. Most vector DBs support filter-based delete (where source_url = …) which makes this cheap.
Dedupe at Vector Scale
Hash-level dedupe catches byte-exact duplicates. Near-duplicates (mirror sites, templated pages, boilerplate headers) require more work. Two approaches:
- MinHash at ingest: compute MinHash signatures before embedding; reject docs with >0.8 Jaccard against an existing doc. Cheap, avoids wasted embed cost.
- Semantic dedupe at embed time: after embedding, compare to nearest neighbor; if cosine > 0.97, drop or link rather than insert. SemDeDup (arXiv:2303.09540) formalized this approach.
Index Configuration
Vector index tuning is the second-biggest lever after embedding quality. The common families:
- HNSW (Malkov & Yashunin, arXiv:1603.09320): hierarchical navigable small world, default in most DBs. Parameters
M(graph degree) andefConstructionaffect build cost and recall. - IVF_PQ: inverted file with product quantization. Smaller memory footprint, lower recall at the same speed.
- DiskANN: disk-resident graph index; good for budgets where RAM cannot hold the full set.
For most web-corpus workloads, HNSW with M=16-32 and efConstruction=200 is a reasonable starting point. At query time, ef trades latency for recall.
Quantization
Storage for 100M 1536-dim float32 vectors is ~600GB. Quantization options:
- Scalar quantization (int8): 4x smaller, ~1% recall loss.
- Binary quantization: 32x smaller, ~5-10% recall loss, can rerank with full-precision top-K.
- Product quantization: tunable compression with variable recall cost.
- Matryoshka truncation: store only the first 512 or 1024 dims of a larger embedding.
Combine approaches: matryoshka to 1024-dim + scalar quantization gives 12x compression and often under 2% nDCG loss.
Metadata and Filtering
Retrieval without filters is rarely enough. Typical payload fields: source URL, domain, language, published date, category, region. Keep filters selective -- Qdrant's payload index and Weaviate's inverted index both handle high-cardinality filters well, but cardinality-1 filters (filter returns few results) can be slower than a full scan.
Ingest Throughput and Proxies
The fetch stage is usually the bottleneck when the corpus comes from the web. Per-source rate limits cap single-IP throughput at dozens of requests per minute. Distributing across a proxy pool pushes aggregate throughput into the thousands per minute while staying polite per host.
import httpx, asyncio, itertools
PROXIES = [f"http://user-{i}:pass@gate.hexproxies.com:7777" for i in range(200)]
proxy_cycle = itertools.cycle(PROXIES)
async def fetch(url):
async with httpx.AsyncClient(proxy=next(proxy_cycle), timeout=30) as c:
r = await c.get(url)
return r.text if r.status_code == 200 else None
Observability
Instrument per stage:
- Fetch: URLs/sec, HTTP status histogram, proxy region health
- Extract: docs/sec, extraction failures by MIME type
- Dedupe: duplication ratio
- Embed: tokens/sec, API error rate, cost/hour
- Upsert: vectors/sec, index size, index build time
- Retrieval quality: offline eval on a fixed query set, tracked per ingest run
Closing
The retrieval side of vector databases is largely solved. The ingest side is where production work lives: idempotent stages, stable IDs, smart incremental updates, aggressive dedupe, and a fetch layer that does not fall over when the target site throttles. Get those right and the DB itself mostly takes care of itself.
For related patterns see our RAG data pipeline guide.