Vector quantization
Vector quantization is the technique of compressing embedding vectors (e.g., float32 → int8 or binary) to cut memory + disk + bandwidth by 4-32× with small recall loss — essential for billion-vector workloads in 2026.
A 1536-dim float32 embedding takes 6KB; at 1B vectors that's 6TB of RAM. Quantization shrinks this: scalar quantization (float32 → int8, 4× smaller, ~99% recall), product quantization (PQ, codebook-based, 8-32× smaller, ~95% recall), binary quantization (1 bit/dim, 32× smaller, ~85% recall but viable with reranking). Modern vector DBs (Qdrant, Pinecone, Milvus, LanceDB) ship quantization as a config knob. Production pattern: store quantized vectors for the ANN search, retrieve top-K, rerank with full-precision vectors or with a [[reranker]]. This pipeline gives < 5ms p99 query latency on billion-vector workloads with 99%+ recall@10 — impossible without quantization.
When to use vector quantization
- Vector DBs above 10M vectors.
- Memory-constrained / disk-constrained deployments.
Common mistakes
- Quantizing without reranking — recall loss compounds.
- Picking binary quantization for tiny vectors — savings don't justify recall hit.
FAQ
What is vector quantization?
Vector quantization is the technique of compressing embedding vectors (e.g., float32 → int8 or binary) to cut memory + disk + bandwidth by 4-32× with small recall loss — essential for billion-vector workloads in 2026.
When should I use vector quantization?
Vector DBs above 10M vectors. Memory-constrained / disk-constrained deployments.
What are the most common mistakes with vector quantization?
Quantizing without reranking — recall loss compounds. Picking binary quantization for tiny vectors — savings don't justify recall hit.
Related terms
- ANN index — An ANN (approximate nearest neighbor) index is the data structure inside a vector DB that returns 'almost-best' matches in sub-millisecond time — HNSW, IVF, ScaNN, DiskANN are 2026 popular implementations.
- Embeddings — Embeddings are dense numeric vectors that represent the meaning of text, images, or other data, allowing similarity to be measured as vector distance.
- Reranker — A reranker is a small cross-encoder model that takes a query + a candidate document and outputs a relevance score — used as the second stage after embedding retrieval to push the right answer to the top.
Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/vector-quantization.md.