technique

Vector quantization

Vector quantization is the technique of compressing embedding vectors (e.g., float32 → int8 or binary) to cut memory + disk + bandwidth by 4-32× with small recall loss — essential for billion-vector workloads in 2026.

A 1536-dim float32 embedding takes 6KB; at 1B vectors that's 6TB of RAM. Quantization shrinks this: scalar quantization (float32 → int8, 4× smaller, ~99% recall), product quantization (PQ, codebook-based, 8-32× smaller, ~95% recall), binary quantization (1 bit/dim, 32× smaller, ~85% recall but viable with reranking). Modern vector DBs (Qdrant, Pinecone, Milvus, LanceDB) ship quantization as a config knob. Production pattern: store quantized vectors for the ANN search, retrieve top-K, rerank with full-precision vectors or with a [[reranker]]. This pipeline gives < 5ms p99 query latency on billion-vector workloads with 99%+ recall@10 — impossible without quantization.

When to use vector quantization

Common mistakes

FAQ

What is vector quantization?

Vector quantization is the technique of compressing embedding vectors (e.g., float32 → int8 or binary) to cut memory + disk + bandwidth by 4-32× with small recall loss — essential for billion-vector workloads in 2026.

When should I use vector quantization?

Vector DBs above 10M vectors. Memory-constrained / disk-constrained deployments.

What are the most common mistakes with vector quantization?

Quantizing without reranking — recall loss compounds. Picking binary quantization for tiny vectors — savings don't justify recall hit.

Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/vector-quantization.md.