Embedding Quantization
Reading time: 22 min | Relevance: AI Engineer, ML Engineer
The Terabyte Vector Store
You're building a semantic search system for a document platform with 500 million documents. Each document produces a 1536-dimensional float32 embedding. Storage: 500M × 1536 × 4 bytes = 3 terabytes for embeddings alone. Vector similarity search over 3 TB is expensive, slow, and requires specialized hardware. Your engineering team estimates $50,000/month for the vector database cluster.
Now apply binary quantization. Each 1536-dim float32 embedding (6,144 bytes) becomes a 1536-bit binary vector (192 bytes) - a 32× compression. Storage drops to 94 GB. Binary similarity search (Hamming distance) is implemented in hardware as POPCOUNT operations - 25-40× faster than float32 dot products. The database cost drops to $3,000/month. Quality loss: approximately 2% if you add a float32 rescoring step over the top-k binary results.
Quantization is one of the highest-leverage optimizations available for embedding systems. It requires no retraining (for float16 and int8) or minimal modification (for binary). This lesson covers the full quantization stack from float16 through binary, with practical implementation in FAISS and Qdrant.
Why Quantize?
Embedding vectors are stored as 32-bit floats (float32) by default. Each float32 uses 4 bytes. A 1536-dimensional embedding uses 6,144 bytes (6 KB). At scale:
| Documents | float32 storage | GPU memory for search |
|---|---|---|
| 1 million | 6 GB | 6 GB (GPU) |
| 10 million | 60 GB | 60 GB (multiple GPUs) |
| 100 million | 600 GB | Distributed cluster |
| 1 billion | 6 TB | Impractical on GPU |
Storage cost is one concern. Search speed is another. ANN (Approximate Nearest Neighbor) search speed scales with both the number of dimensions and the number of bytes per dimension. Quantization reduces both.
Float16: The Free Lunch
The easiest quantization step: convert float32 to float16 (16-bit floating point). Float16 uses 2 bytes vs 4 bytes - exactly 2× compression. Quality loss is negligible for most embedding applications.
Why it's essentially free
Float32 uses 23 bits of mantissa precision. Float16 uses 10 bits. For embeddings, the extra precision in float32 is rarely meaningful. The rank ordering of similarity scores changes almost never when converting between float32 and float16.
import numpy as np
def float32_to_float16(embeddings: np.ndarray) -> np.ndarray:
"""
Convert float32 embeddings to float16.
2× storage reduction, negligible quality loss.
"""
return embeddings.astype(np.float16)
def measure_quality_loss_fp16(
embeddings_f32: np.ndarray, # (n_docs, dim) float32
n_test_queries: int = 100,
) -> dict:
"""
Measure the quality loss from float16 quantization.
Compares top-k rankings between float32 and float16.
"""
embeddings_f16 = float32_to_float16(embeddings_f32)
rng = np.random.default_rng(42)
test_indices = rng.choice(len(embeddings_f32), n_test_queries, replace=False)
rank_changes = []
for idx in test_indices:
q32 = embeddings_f32[idx]
q16 = embeddings_f16[idx]
sims_f32 = (embeddings_f32 @ q32) / (
np.linalg.norm(embeddings_f32, axis=1) * np.linalg.norm(q32) + 1e-10
)
sims_f16 = (embeddings_f16.astype(np.float32) @ q16.astype(np.float32)) / (
np.linalg.norm(embeddings_f16.astype(np.float32), axis=1) *
np.linalg.norm(q16.astype(np.float32)) + 1e-10
)
rank_f32 = np.argsort(-sims_f32)[:10]
rank_f16 = np.argsort(-sims_f16)[:10]
overlap = len(set(rank_f32) & set(rank_f16))
rank_changes.append(overlap / 10)
return {
"top10_overlap_mean": float(np.mean(rank_changes)),
"storage_bytes_f32": embeddings_f32.nbytes,
"storage_bytes_f16": embeddings_f16.nbytes,
"compression_ratio": embeddings_f32.nbytes / embeddings_f16.nbytes,
}
# In practice: almost always > 0.99 overlap
# float16 is a drop-in replacement for embeddings
Using float16 in FAISS
import faiss
import numpy as np
# Standard float32 index
def build_f32_index(embeddings: np.ndarray) -> faiss.IndexFlatIP:
dim = embeddings.shape[1]
index = faiss.IndexFlatIP(dim)
index.add(embeddings.astype(np.float32))
return index
# Quantized IVF index with float16 storage
def build_ivf_f16_index(
embeddings: np.ndarray,
n_cells: int = 1024, # Number of Voronoi cells
) -> faiss.Index:
"""
IVF index with float16 internal storage.
~2× storage savings vs float32 with minimal quality loss.
"""
dim = embeddings.shape[1]
# Quantizer: flat IP index for the coarse quantizer
quantizer = faiss.IndexFlatIP(dim)
# IVFPQ: IVF with product quantization (further compression)
# IndexIVFFlat with float16 training
index = faiss.IndexIVFFlat(quantizer, dim, n_cells)
index.train(embeddings.astype(np.float32))
index.add(embeddings.astype(np.float32))
index.nprobe = 64 # Number of cells to search (quality/speed trade-off)
return index
Int8 Quantization: 4× Compression
Int8 quantization maps float32 values to 8-bit integers in the range . This achieves 4× storage compression with roughly 1-2% quality loss on retrieval tasks.
The quantization scheme
Linear quantization maps a float32 range to :
For embeddings, the standard approach uses the empirical range of the embedding values:
import numpy as np
from dataclasses import dataclass
@dataclass
class QuantizationParams:
scale: float
zero_point: float
def quantize_embeddings_int8(
embeddings: np.ndarray, # (n_docs, dim) float32
percentile_clip: float = 99.5, # Clip outliers before quantizing
) -> tuple[np.ndarray, QuantizationParams]:
"""
Quantize float32 embeddings to int8.
Returns quantized embeddings and parameters needed for dequantization.
"""
# Clip extreme values (outliers can hurt quantization quality)
max_val = np.percentile(np.abs(embeddings), percentile_clip)
clipped = np.clip(embeddings, -max_val, max_val)
# Compute scale: map [-max_val, max_val] to [-128, 127]
scale = max_val / 127.0
zero_point = 0.0
# Quantize
quantized = np.round(clipped / scale).astype(np.int8)
params = QuantizationParams(scale=scale, zero_point=zero_point)
return quantized, params
def dequantize_embeddings(
quantized: np.ndarray, # (n_docs, dim) int8
params: QuantizationParams,
) -> np.ndarray:
"""Convert int8 back to float32."""
return quantized.astype(np.float32) * params.scale + params.zero_point
def int8_similarity_search(
query_f32: np.ndarray, # (dim,) float32 query
corpus_int8: np.ndarray, # (n_docs, dim) int8 corpus
params: QuantizationParams,
k: int = 10,
) -> tuple[np.ndarray, np.ndarray]:
"""
Similarity search using int8 embeddings.
Dequantize to float32 for dot product computation.
"""
# Dequantize and compute dot products
corpus_f32 = dequantize_embeddings(corpus_int8, params)
sims = corpus_f32 @ query_f32
top_k_indices = np.argsort(-sims)[:k]
return top_k_indices, sims[top_k_indices]
# Storage comparison
n_docs = 1_000_000
dim = 1536
bytes_f32 = n_docs * dim * 4 # 4 bytes per float32
bytes_int8 = n_docs * dim * 1 # 1 byte per int8
print(f"1M documents, 1536 dims:")
print(f" float32: {bytes_f32 / 1e9:.2f} GB")
print(f" int8: {bytes_int8 / 1e9:.2f} GB (4× reduction)")
Int8 with FAISS and Qdrant
import qdrant_client
from qdrant_client.http import models as qdrant_models
# Qdrant with int8 quantization
def setup_qdrant_int8(
collection_name: str = "embeddings",
dim: int = 1536,
):
"""Configure Qdrant with int8 scalar quantization."""
client = qdrant_client.QdrantClient("localhost", port=6333)
client.create_collection(
collection_name=collection_name,
vectors_config=qdrant_models.VectorParams(
size=dim,
distance=qdrant_models.Distance.COSINE,
),
quantization_config=qdrant_models.ScalarQuantization(
scalar=qdrant_models.ScalarQuantizationConfig(
type=qdrant_models.ScalarType.INT8,
quantile=0.99, # Clip 1% of outliers
always_ram=True, # Keep quantized vectors in RAM
)
),
)
return client
Binary Quantization: 32× Compression
Binary quantization is the most aggressive common quantization strategy: convert each float32 value to a single bit based on sign.
A 1536-dimensional float32 embedding (6,144 bytes) becomes a 1536-bit binary vector (192 bytes) - 32× compression.
The Hamming distance trick
With binary embeddings, similarity is measured using Hamming distance: the number of bit positions where two vectors differ.
where is XOR and popcount counts set bits. This is the complement of dot product in binary space (more matching bits = higher similarity).
POPCOUNT is a hardware instruction on modern CPUs and GPUs - extremely fast. This makes binary embedding search 25-40× faster than float32 dot product search.
import numpy as np
def binarize_embeddings(embeddings: np.ndarray) -> np.ndarray:
"""
Binarize float32 embeddings by taking the sign.
Returns boolean array (True = positive, False = negative).
Packs bits for memory efficiency using np.packbits.
"""
# Sign-based binarization: positive → 1, non-positive → 0
binary = (embeddings > 0)
# Pack bits: 8 booleans → 1 uint8
packed = np.packbits(binary, axis=-1) # (n_docs, ceil(dim/8))
return packed
def hamming_distance(a: np.ndarray, b: np.ndarray) -> int:
"""
Compute Hamming distance between two packed binary vectors.
Uses XOR + popcount (bit counting).
"""
xor = np.bitwise_xor(a, b)
# Count set bits in each byte
bit_count = np.unpackbits(xor).sum()
return int(bit_count)
def batch_hamming_similarity(
query_packed: np.ndarray, # (n_bytes,) packed query
corpus_packed: np.ndarray, # (n_docs, n_bytes) packed corpus
) -> np.ndarray:
"""
Compute Hamming similarity (1 - normalized hamming distance)
between a query and all corpus vectors.
"""
# XOR query with each corpus vector
xor = np.bitwise_xor(corpus_packed, query_packed[np.newaxis, :])
# Count set bits (Hamming distance)
hamming_dists = np.unpackbits(xor, axis=-1).sum(axis=-1)
dim = corpus_packed.shape[1] * 8
# Convert to similarity: 1 - (hamming_dist / dim)
return 1.0 - hamming_dists / dim
# FAISS binary index
def build_binary_faiss_index(binary_packed: np.ndarray) -> faiss.IndexBinaryFlat:
"""
Build a FAISS binary index for Hamming distance search.
"""
dim_bits = binary_packed.shape[1] * 8 # Total bits
index = faiss.IndexBinaryFlat(dim_bits)
index.add(binary_packed)
return index
def binary_search(
query_binary: np.ndarray,
index: faiss.IndexBinaryFlat,
k: int,
) -> tuple[np.ndarray, np.ndarray]:
"""
Search binary index. Returns Hamming distances and indices.
"""
hamming_dists, indices = index.search(
query_binary.reshape(1, -1), k
)
return indices[0], hamming_dists[0]
Why binary works: the sign preserves direction
Binary quantization loses almost all magnitude information. How can it possibly work for retrieval?
The key insight: for normalized embeddings (L2-normalized to unit sphere), cosine similarity only depends on angle - not magnitude. The angle between two vectors is approximately preserved when you project them to binary space via sign:
This relationship holds reasonably well for the types of embeddings produced by modern contrastive learning - the sign carries most of the directional information.
However, "approximately preserved" doesn't mean "perfectly preserved." Binary Hamming similarity has ~30-40% lower absolute accuracy than float32 cosine similarity on standard benchmarks. This is where rescoring saves us.
The Rescoring Pattern: Binary First Pass + Float32 Reranking
The production pattern for binary quantization:
- Binary first pass: Search the binary index for top-1000 candidates using Hamming distance. This is 25-40× faster than float32 search.
- Float32 rescore: Load the float32 embeddings for these 1000 candidates and compute exact cosine similarity with the float32 query embedding.
- Return top-k: Return the top-k from the rescored candidates.
class BinaryRescoreIndex:
"""
Two-stage retrieval: binary Hamming search → float32 reranking.
Achieves near float32 quality at ~5-10% of the search time.
"""
def __init__(self, first_stage_k: int = 1000):
self.first_stage_k = first_stage_k
self.binary_index = None
self.float32_embeddings = None
def build(self, embeddings_f32: np.ndarray):
"""
Index embeddings in both binary and float32 format.
"""
# Normalize before binarization (critical for quality)
norms = np.linalg.norm(embeddings_f32, axis=1, keepdims=True)
embeddings_normalized = embeddings_f32 / norms.clip(min=1e-10)
# Binarize
binary = (embeddings_normalized > 0).astype(np.uint8)
binary_packed = np.packbits(binary, axis=-1) # (n_docs, dim/8)
# Store both
dim_bits = binary_packed.shape[1] * 8
self.binary_index = faiss.IndexBinaryFlat(dim_bits)
self.binary_index.add(binary_packed)
# Store normalized float32 for rescoring
self.float32_embeddings = embeddings_normalized
print(f"Indexed {len(embeddings_f32)} documents")
print(f"Binary index: {binary_packed.nbytes / 1e9:.3f} GB")
print(f"Float32 backup: {embeddings_normalized.nbytes / 1e9:.3f} GB")
print(f"Total: {(binary_packed.nbytes + embeddings_normalized.nbytes) / 1e9:.3f} GB")
print(f"vs Float32-only: {embeddings_f32.nbytes / 1e9:.3f} GB")
def search(
self,
query_f32: np.ndarray, # (dim,) float32 query
k: int = 10,
) -> tuple[np.ndarray, np.ndarray]:
"""
Two-stage search: binary first pass → float32 reranking.
"""
# Normalize query
query_normalized = query_f32 / (np.linalg.norm(query_f32) + 1e-10)
# Stage 1: Binary search
query_binary = (query_normalized > 0).astype(np.uint8)
query_packed = np.packbits(query_binary).reshape(1, -1)
candidate_indices, _ = self.binary_index.search(
query_packed, self.first_stage_k
)
candidates = candidate_indices[0]
# Stage 2: Float32 rescore on candidates
candidate_embs = self.float32_embeddings[candidates] # (first_stage_k, dim)
float32_sims = candidate_embs @ query_normalized # (first_stage_k,)
# Sort candidates by float32 similarity
rerank_order = np.argsort(-float32_sims)
top_k_indices = candidates[rerank_order[:k]]
top_k_sims = float32_sims[rerank_order[:k]]
return top_k_indices, top_k_sims
Storage Comparison: 1M Vectors at Different Quantizations
def storage_comparison_table():
n_docs = 1_000_000
dim = 1536
configs = [
("float32", 4, "Full precision"),
("float16", 2, "Half precision, no quality loss"),
("int8", 1, "~1-2% quality loss"),
("binary", 1/32, "~30% quality loss alone, ~2% with rescoring"),
]
print(f"{'Dtype':<12} {'Bytes/dim':<12} {'Total GB':>10} {'Vs float32':>12} Notes")
print("-" * 70)
f32_size = n_docs * dim * 4
for dtype, bytes_per_dim, notes in configs:
total_bytes = n_docs * dim * bytes_per_dim
ratio = f32_size / total_bytes
print(f"{dtype:<12} {bytes_per_dim:<12} {total_bytes/1e9:>10.2f} "
f"{ratio:>11.1f}× {notes}")
storage_comparison_table()
# Output:
# Dtype Bytes/dim Total GB Vs float32 Notes
# ----------------------------------------------------------------------
# float32 4 6.14 1.0× Full precision
# float16 2 3.07 2.0× Half precision, no quality loss
# int8 1 1.54 4.0× ~1-2% quality loss
# binary 0.031 0.19 32.0× ~30% loss alone, ~2% with rescoring
Qdrant Binary Quantization
Qdrant (a popular vector database) has native binary quantization support with automatic rescoring:
from qdrant_client import QdrantClient
from qdrant_client.http import models as qdrant_models
def setup_qdrant_binary_quantization(
collection_name: str = "binary-embeddings",
dim: int = 1536,
):
"""
Configure Qdrant with binary quantization and automatic rescoring.
"""
client = QdrantClient("localhost", port=6333)
client.create_collection(
collection_name=collection_name,
vectors_config=qdrant_models.VectorParams(
size=dim,
distance=qdrant_models.Distance.COSINE,
),
quantization_config=qdrant_models.BinaryQuantization(
binary=qdrant_models.BinaryQuantizationConfig(
always_ram=True, # Keep binary index in RAM for fast search
),
),
)
return client
def search_qdrant_with_rescoring(
client: QdrantClient,
collection_name: str,
query_vector: list[float],
top_k: int = 10,
oversampling_factor: int = 5, # Retrieve 5× more candidates for rescoring
) -> list:
"""
Search Qdrant collection with binary quantization and float32 rescoring.
oversampling_factor: how many extra candidates to retrieve for rescoring.
"""
results = client.search(
collection_name=collection_name,
query_vector=query_vector,
limit=top_k,
search_params=qdrant_models.SearchParams(
quantization=qdrant_models.QuantizationSearchParams(
ignore=False, # Use binary quantization for search
rescore=True, # Rescore with float32 after binary search
oversampling=oversampling_factor, # Retrieve 5*top_k candidates
)
)
)
return results
When to Use Each Quantization Level
float16: Default choice. Always use instead of float32 for storage. No quality loss. Supported by all major vector databases.
int8: Good default for large-scale deployments. 4× savings, 1-2% quality loss is acceptable for most applications.
binary: For very large corpora (100M+) where search latency is critical. Always use with rescoring. The 2% quality loss with rescoring is acceptable for most applications; without rescoring (~30% loss), use only for early-stage filtering.
Common Mistakes
:::danger Using binary quantization without rescoring Binary Hamming distance without float32 rescoring loses ~30-40% retrieval quality. This is rarely acceptable for production. Always implement the two-stage pattern: binary first pass to find 100-1000 candidates, float32 rescore to find the final top-k. :::
:::danger Not normalizing before binarization Binary quantization sign-preserves the direction of the vector. For this to work, the vector must represent direction meaningfully - which requires unit L2 normalization. Always normalize embeddings before binarizing. Unnormalized embeddings have magnitudes that corrupt the direction signal. :::
:::warning Quantizing without evaluating quality impact The quality loss from quantization varies by embedding model, domain, and query type. Always measure the quality impact on your specific dataset before deploying quantization. Use your domain evaluation set (Lesson 06) to compare float32 vs quantized retrieval on your actual queries. :::
:::tip Keep float32 embeddings alongside binary for rescoring Don't discard your float32 embeddings after binarizing. The rescoring pattern requires float32 for high-quality final rankings. Store binary embeddings in fast RAM for first-pass search and float32 on cheaper storage for rescoring. The total storage cost (binary + float32) is 1.03× float32 alone - essentially free, because binary is so small. :::
Interview Q&A
Q1: What is embedding quantization and why is it used?
Embedding quantization reduces the numerical precision of embedding vectors to decrease storage cost and search latency. float32 (4 bytes/dimension) is the standard; quantization reduces this to float16 (2 bytes), int8 (1 byte), or binary (1 bit per dimension, i.e., 1/32 bytes). At 1B documents with 1536-dim embeddings, the savings are: float32 = 6.1 TB, float16 = 3 TB, int8 = 1.5 TB, binary = 190 GB. Search speed also scales with quantization level - binary Hamming distance search (implemented as POPCOUNT) is 25-40× faster than float32 dot product.
Q2: How does binary quantization work and what is the Hamming distance trick?
Binary quantization converts each float32 dimension to a single bit based on sign: positive values → 1, negative → 0. A 1536-dim float32 vector (6,144 bytes) becomes 1536 bits (192 bytes). Similarity is then measured by Hamming distance: the count of bit positions where two vectors differ. This equals POPCOUNT(XOR(a, b)) - the number of set bits in the XOR of two binary vectors. Modern CPUs and GPUs implement POPCOUNT as a hardware instruction, making binary similarity search extremely fast. The reason this works for normalized embeddings: cosine similarity depends only on angle, and the sign of each dimension approximately captures the directional relationship between vectors.
Q3: Why is rescoring necessary for binary quantization?
Without rescoring, binary Hamming distance has only ~60-70% retrieval quality (nDCG@10) compared to float32 cosine similarity. This is because the sign quantization loses detailed information about the relative magnitude of each dimension, degrading the precision of similarity ranking. Rescoring adds a second stage: after retrieving the top-1000 candidates using fast binary search, load their float32 embeddings and compute exact cosine similarity. This restores quality to ~98-99% of full float32 search while retaining 90%+ of binary search's speed advantage.
Q4: When would you choose int8 over binary quantization?
Choose int8 over binary when: you want higher quality assurance (int8 loses 1-2% vs binary's 2% with rescoring - similar but int8 doesn't require the two-stage architecture), your corpus is smaller (binary's 32× compression is compelling for 100M+ documents; at 1M documents, int8's 4× compression may be sufficient), or you want simpler infrastructure (int8 is a straightforward compression without the two-stage search complexity of binary with rescoring).
Choose binary when: corpus size exceeds 100M documents and storage is the primary constraint, or search latency is critical and you need the speed of Hamming distance computation.
Summary
Embedding quantization reduces storage cost and search latency without model retraining:
- float16: 2× compression, zero quality loss. Always prefer over float32.
- int8: 4× compression, 1-2% quality loss. Good default for large deployments.
- Binary: 32× compression, 2% quality loss with rescoring (30% without). Best for 100M+ document corpora with latency constraints.
The rescoring pattern is key to production binary quantization: use binary Hamming distance for fast first-pass search over the full corpus, then rescore the top-1000 candidates with float32 embeddings for near-full quality final results.
Supported natively by Qdrant (binary + int8), FAISS (binary via IndexBinaryFlat), and Cohere Embed API (returns all quantization levels in one call).
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Quantisation Effects demo on the EngineersOfAI Playground - no code required.
:::
