Skip to main content

Embedding Quantization

Reading time: 22 min | Relevance: AI Engineer, ML Engineer


The Terabyte Vector Store

You're building a semantic search system for a document platform with 500 million documents. Each document produces a 1536-dimensional float32 embedding. Storage: 500M × 1536 × 4 bytes = 3 terabytes for embeddings alone. Vector similarity search over 3 TB is expensive, slow, and requires specialized hardware. Your engineering team estimates $50,000/month for the vector database cluster.

Now apply binary quantization. Each 1536-dim float32 embedding (6,144 bytes) becomes a 1536-bit binary vector (192 bytes) - a 32× compression. Storage drops to 94 GB. Binary similarity search (Hamming distance) is implemented in hardware as POPCOUNT operations - 25-40× faster than float32 dot products. The database cost drops to $3,000/month. Quality loss: approximately 2% if you add a float32 rescoring step over the top-k binary results.

Quantization is one of the highest-leverage optimizations available for embedding systems. It requires no retraining (for float16 and int8) or minimal modification (for binary). This lesson covers the full quantization stack from float16 through binary, with practical implementation in FAISS and Qdrant.


Why Quantize?

Embedding vectors are stored as 32-bit floats (float32) by default. Each float32 uses 4 bytes. A 1536-dimensional embedding uses 6,144 bytes (6 KB). At scale:

Documentsfloat32 storageGPU memory for search
1 million6 GB6 GB (GPU)
10 million60 GB60 GB (multiple GPUs)
100 million600 GBDistributed cluster
1 billion6 TBImpractical on GPU

Storage cost is one concern. Search speed is another. ANN (Approximate Nearest Neighbor) search speed scales with both the number of dimensions and the number of bytes per dimension. Quantization reduces both.


Float16: The Free Lunch

The easiest quantization step: convert float32 to float16 (16-bit floating point). Float16 uses 2 bytes vs 4 bytes - exactly 2× compression. Quality loss is negligible for most embedding applications.

Why it's essentially free

Float32 uses 23 bits of mantissa precision. Float16 uses 10 bits. For embeddings, the extra precision in float32 is rarely meaningful. The rank ordering of similarity scores changes almost never when converting between float32 and float16.

import numpy as np

def float32_to_float16(embeddings: np.ndarray) -> np.ndarray:
"""
Convert float32 embeddings to float16.
2× storage reduction, negligible quality loss.
"""
return embeddings.astype(np.float16)


def measure_quality_loss_fp16(
embeddings_f32: np.ndarray, # (n_docs, dim) float32
n_test_queries: int = 100,
) -> dict:
"""
Measure the quality loss from float16 quantization.
Compares top-k rankings between float32 and float16.
"""
embeddings_f16 = float32_to_float16(embeddings_f32)
rng = np.random.default_rng(42)
test_indices = rng.choice(len(embeddings_f32), n_test_queries, replace=False)

rank_changes = []
for idx in test_indices:
q32 = embeddings_f32[idx]
q16 = embeddings_f16[idx]

sims_f32 = (embeddings_f32 @ q32) / (
np.linalg.norm(embeddings_f32, axis=1) * np.linalg.norm(q32) + 1e-10
)
sims_f16 = (embeddings_f16.astype(np.float32) @ q16.astype(np.float32)) / (
np.linalg.norm(embeddings_f16.astype(np.float32), axis=1) *
np.linalg.norm(q16.astype(np.float32)) + 1e-10
)

rank_f32 = np.argsort(-sims_f32)[:10]
rank_f16 = np.argsort(-sims_f16)[:10]

overlap = len(set(rank_f32) & set(rank_f16))
rank_changes.append(overlap / 10)

return {
"top10_overlap_mean": float(np.mean(rank_changes)),
"storage_bytes_f32": embeddings_f32.nbytes,
"storage_bytes_f16": embeddings_f16.nbytes,
"compression_ratio": embeddings_f32.nbytes / embeddings_f16.nbytes,
}


# In practice: almost always > 0.99 overlap
# float16 is a drop-in replacement for embeddings

Using float16 in FAISS

import faiss
import numpy as np

# Standard float32 index
def build_f32_index(embeddings: np.ndarray) -> faiss.IndexFlatIP:
dim = embeddings.shape[1]
index = faiss.IndexFlatIP(dim)
index.add(embeddings.astype(np.float32))
return index

# Quantized IVF index with float16 storage
def build_ivf_f16_index(
embeddings: np.ndarray,
n_cells: int = 1024, # Number of Voronoi cells
) -> faiss.Index:
"""
IVF index with float16 internal storage.
~2× storage savings vs float32 with minimal quality loss.
"""
dim = embeddings.shape[1]
# Quantizer: flat IP index for the coarse quantizer
quantizer = faiss.IndexFlatIP(dim)
# IVFPQ: IVF with product quantization (further compression)
# IndexIVFFlat with float16 training
index = faiss.IndexIVFFlat(quantizer, dim, n_cells)
index.train(embeddings.astype(np.float32))
index.add(embeddings.astype(np.float32))
index.nprobe = 64 # Number of cells to search (quality/speed trade-off)
return index

Int8 Quantization: 4× Compression

Int8 quantization maps float32 values to 8-bit integers in the range [128,127][-128, 127]. This achieves 4× storage compression with roughly 1-2% quality loss on retrieval tasks.

The quantization scheme

Linear quantization maps a float32 range [min,max][\min, \max] to [128,127][-128, 127]:

q=round(vzero_pointscale)q = \text{round}\left(\frac{v - \text{zero\_point}}{\text{scale}}\right)

v=q×scale+zero_pointv = q \times \text{scale} + \text{zero\_point}

For embeddings, the standard approach uses the empirical range of the embedding values:

import numpy as np
from dataclasses import dataclass

@dataclass
class QuantizationParams:
scale: float
zero_point: float

def quantize_embeddings_int8(
embeddings: np.ndarray, # (n_docs, dim) float32
percentile_clip: float = 99.5, # Clip outliers before quantizing
) -> tuple[np.ndarray, QuantizationParams]:
"""
Quantize float32 embeddings to int8.
Returns quantized embeddings and parameters needed for dequantization.
"""
# Clip extreme values (outliers can hurt quantization quality)
max_val = np.percentile(np.abs(embeddings), percentile_clip)
clipped = np.clip(embeddings, -max_val, max_val)

# Compute scale: map [-max_val, max_val] to [-128, 127]
scale = max_val / 127.0
zero_point = 0.0

# Quantize
quantized = np.round(clipped / scale).astype(np.int8)

params = QuantizationParams(scale=scale, zero_point=zero_point)
return quantized, params


def dequantize_embeddings(
quantized: np.ndarray, # (n_docs, dim) int8
params: QuantizationParams,
) -> np.ndarray:
"""Convert int8 back to float32."""
return quantized.astype(np.float32) * params.scale + params.zero_point


def int8_similarity_search(
query_f32: np.ndarray, # (dim,) float32 query
corpus_int8: np.ndarray, # (n_docs, dim) int8 corpus
params: QuantizationParams,
k: int = 10,
) -> tuple[np.ndarray, np.ndarray]:
"""
Similarity search using int8 embeddings.
Dequantize to float32 for dot product computation.
"""
# Dequantize and compute dot products
corpus_f32 = dequantize_embeddings(corpus_int8, params)
sims = corpus_f32 @ query_f32
top_k_indices = np.argsort(-sims)[:k]
return top_k_indices, sims[top_k_indices]


# Storage comparison
n_docs = 1_000_000
dim = 1536
bytes_f32 = n_docs * dim * 4 # 4 bytes per float32
bytes_int8 = n_docs * dim * 1 # 1 byte per int8

print(f"1M documents, 1536 dims:")
print(f" float32: {bytes_f32 / 1e9:.2f} GB")
print(f" int8: {bytes_int8 / 1e9:.2f} GB (4× reduction)")

Int8 with FAISS and Qdrant

import qdrant_client
from qdrant_client.http import models as qdrant_models

# Qdrant with int8 quantization
def setup_qdrant_int8(
collection_name: str = "embeddings",
dim: int = 1536,
):
"""Configure Qdrant with int8 scalar quantization."""
client = qdrant_client.QdrantClient("localhost", port=6333)

client.create_collection(
collection_name=collection_name,
vectors_config=qdrant_models.VectorParams(
size=dim,
distance=qdrant_models.Distance.COSINE,
),
quantization_config=qdrant_models.ScalarQuantization(
scalar=qdrant_models.ScalarQuantizationConfig(
type=qdrant_models.ScalarType.INT8,
quantile=0.99, # Clip 1% of outliers
always_ram=True, # Keep quantized vectors in RAM
)
),
)

return client

Binary Quantization: 32× Compression

Binary quantization is the most aggressive common quantization strategy: convert each float32 value to a single bit based on sign.

bi={1if vi00if vi<0b_i = \begin{cases} 1 & \text{if } v_i \geq 0 \\ 0 & \text{if } v_i < 0 \end{cases}

A 1536-dimensional float32 embedding (6,144 bytes) becomes a 1536-bit binary vector (192 bytes) - 32× compression.

The Hamming distance trick

With binary embeddings, similarity is measured using Hamming distance: the number of bit positions where two vectors differ.

dH(a,b)=popcount(ab)d_H(a, b) = \text{popcount}(a \oplus b)

where \oplus is XOR and popcount counts set bits. This is the complement of dot product in binary space (more matching bits = higher similarity).

POPCOUNT is a hardware instruction on modern CPUs and GPUs - extremely fast. This makes binary embedding search 25-40× faster than float32 dot product search.

import numpy as np

def binarize_embeddings(embeddings: np.ndarray) -> np.ndarray:
"""
Binarize float32 embeddings by taking the sign.
Returns boolean array (True = positive, False = negative).
Packs bits for memory efficiency using np.packbits.
"""
# Sign-based binarization: positive → 1, non-positive → 0
binary = (embeddings > 0)
# Pack bits: 8 booleans → 1 uint8
packed = np.packbits(binary, axis=-1) # (n_docs, ceil(dim/8))
return packed


def hamming_distance(a: np.ndarray, b: np.ndarray) -> int:
"""
Compute Hamming distance between two packed binary vectors.
Uses XOR + popcount (bit counting).
"""
xor = np.bitwise_xor(a, b)
# Count set bits in each byte
bit_count = np.unpackbits(xor).sum()
return int(bit_count)


def batch_hamming_similarity(
query_packed: np.ndarray, # (n_bytes,) packed query
corpus_packed: np.ndarray, # (n_docs, n_bytes) packed corpus
) -> np.ndarray:
"""
Compute Hamming similarity (1 - normalized hamming distance)
between a query and all corpus vectors.
"""
# XOR query with each corpus vector
xor = np.bitwise_xor(corpus_packed, query_packed[np.newaxis, :])
# Count set bits (Hamming distance)
hamming_dists = np.unpackbits(xor, axis=-1).sum(axis=-1)
dim = corpus_packed.shape[1] * 8
# Convert to similarity: 1 - (hamming_dist / dim)
return 1.0 - hamming_dists / dim


# FAISS binary index
def build_binary_faiss_index(binary_packed: np.ndarray) -> faiss.IndexBinaryFlat:
"""
Build a FAISS binary index for Hamming distance search.
"""
dim_bits = binary_packed.shape[1] * 8 # Total bits
index = faiss.IndexBinaryFlat(dim_bits)
index.add(binary_packed)
return index


def binary_search(
query_binary: np.ndarray,
index: faiss.IndexBinaryFlat,
k: int,
) -> tuple[np.ndarray, np.ndarray]:
"""
Search binary index. Returns Hamming distances and indices.
"""
hamming_dists, indices = index.search(
query_binary.reshape(1, -1), k
)
return indices[0], hamming_dists[0]

Why binary works: the sign preserves direction

Binary quantization loses almost all magnitude information. How can it possibly work for retrieval?

The key insight: for normalized embeddings (L2-normalized to unit sphere), cosine similarity only depends on angle - not magnitude. The angle between two vectors is approximately preserved when you project them to binary space via sign:

cos(f32(a,b))12πarccos(cos(binary(a,b)))\cos(\angle_{f32}(a, b)) \approx 1 - \frac{2}{\pi} \arccos\left(\cos(\angle_{\text{binary}}(a, b))\right)

This relationship holds reasonably well for the types of embeddings produced by modern contrastive learning - the sign carries most of the directional information.

However, "approximately preserved" doesn't mean "perfectly preserved." Binary Hamming similarity has ~30-40% lower absolute accuracy than float32 cosine similarity on standard benchmarks. This is where rescoring saves us.


The Rescoring Pattern: Binary First Pass + Float32 Reranking

The production pattern for binary quantization:

  1. Binary first pass: Search the binary index for top-1000 candidates using Hamming distance. This is 25-40× faster than float32 search.
  2. Float32 rescore: Load the float32 embeddings for these 1000 candidates and compute exact cosine similarity with the float32 query embedding.
  3. Return top-k: Return the top-k from the rescored candidates.
class BinaryRescoreIndex:
"""
Two-stage retrieval: binary Hamming search → float32 reranking.
Achieves near float32 quality at ~5-10% of the search time.
"""

def __init__(self, first_stage_k: int = 1000):
self.first_stage_k = first_stage_k
self.binary_index = None
self.float32_embeddings = None

def build(self, embeddings_f32: np.ndarray):
"""
Index embeddings in both binary and float32 format.
"""
# Normalize before binarization (critical for quality)
norms = np.linalg.norm(embeddings_f32, axis=1, keepdims=True)
embeddings_normalized = embeddings_f32 / norms.clip(min=1e-10)

# Binarize
binary = (embeddings_normalized > 0).astype(np.uint8)
binary_packed = np.packbits(binary, axis=-1) # (n_docs, dim/8)

# Store both
dim_bits = binary_packed.shape[1] * 8
self.binary_index = faiss.IndexBinaryFlat(dim_bits)
self.binary_index.add(binary_packed)

# Store normalized float32 for rescoring
self.float32_embeddings = embeddings_normalized

print(f"Indexed {len(embeddings_f32)} documents")
print(f"Binary index: {binary_packed.nbytes / 1e9:.3f} GB")
print(f"Float32 backup: {embeddings_normalized.nbytes / 1e9:.3f} GB")
print(f"Total: {(binary_packed.nbytes + embeddings_normalized.nbytes) / 1e9:.3f} GB")
print(f"vs Float32-only: {embeddings_f32.nbytes / 1e9:.3f} GB")

def search(
self,
query_f32: np.ndarray, # (dim,) float32 query
k: int = 10,
) -> tuple[np.ndarray, np.ndarray]:
"""
Two-stage search: binary first pass → float32 reranking.
"""
# Normalize query
query_normalized = query_f32 / (np.linalg.norm(query_f32) + 1e-10)

# Stage 1: Binary search
query_binary = (query_normalized > 0).astype(np.uint8)
query_packed = np.packbits(query_binary).reshape(1, -1)
candidate_indices, _ = self.binary_index.search(
query_packed, self.first_stage_k
)
candidates = candidate_indices[0]

# Stage 2: Float32 rescore on candidates
candidate_embs = self.float32_embeddings[candidates] # (first_stage_k, dim)
float32_sims = candidate_embs @ query_normalized # (first_stage_k,)

# Sort candidates by float32 similarity
rerank_order = np.argsort(-float32_sims)
top_k_indices = candidates[rerank_order[:k]]
top_k_sims = float32_sims[rerank_order[:k]]

return top_k_indices, top_k_sims

Storage Comparison: 1M Vectors at Different Quantizations

def storage_comparison_table():
n_docs = 1_000_000
dim = 1536

configs = [
("float32", 4, "Full precision"),
("float16", 2, "Half precision, no quality loss"),
("int8", 1, "~1-2% quality loss"),
("binary", 1/32, "~30% quality loss alone, ~2% with rescoring"),
]

print(f"{'Dtype':<12} {'Bytes/dim':<12} {'Total GB':>10} {'Vs float32':>12} Notes")
print("-" * 70)
f32_size = n_docs * dim * 4
for dtype, bytes_per_dim, notes in configs:
total_bytes = n_docs * dim * bytes_per_dim
ratio = f32_size / total_bytes
print(f"{dtype:<12} {bytes_per_dim:<12} {total_bytes/1e9:>10.2f} "
f"{ratio:>11.1f}× {notes}")

storage_comparison_table()

# Output:
# Dtype Bytes/dim Total GB Vs float32 Notes
# ----------------------------------------------------------------------
# float32 4 6.14 1.0× Full precision
# float16 2 3.07 2.0× Half precision, no quality loss
# int8 1 1.54 4.0× ~1-2% quality loss
# binary 0.031 0.19 32.0× ~30% loss alone, ~2% with rescoring

Qdrant Binary Quantization

Qdrant (a popular vector database) has native binary quantization support with automatic rescoring:

from qdrant_client import QdrantClient
from qdrant_client.http import models as qdrant_models

def setup_qdrant_binary_quantization(
collection_name: str = "binary-embeddings",
dim: int = 1536,
):
"""
Configure Qdrant with binary quantization and automatic rescoring.
"""
client = QdrantClient("localhost", port=6333)

client.create_collection(
collection_name=collection_name,
vectors_config=qdrant_models.VectorParams(
size=dim,
distance=qdrant_models.Distance.COSINE,
),
quantization_config=qdrant_models.BinaryQuantization(
binary=qdrant_models.BinaryQuantizationConfig(
always_ram=True, # Keep binary index in RAM for fast search
),
),
)

return client


def search_qdrant_with_rescoring(
client: QdrantClient,
collection_name: str,
query_vector: list[float],
top_k: int = 10,
oversampling_factor: int = 5, # Retrieve 5× more candidates for rescoring
) -> list:
"""
Search Qdrant collection with binary quantization and float32 rescoring.
oversampling_factor: how many extra candidates to retrieve for rescoring.
"""
results = client.search(
collection_name=collection_name,
query_vector=query_vector,
limit=top_k,
search_params=qdrant_models.SearchParams(
quantization=qdrant_models.QuantizationSearchParams(
ignore=False, # Use binary quantization for search
rescore=True, # Rescore with float32 after binary search
oversampling=oversampling_factor, # Retrieve 5*top_k candidates
)
)
)
return results

When to Use Each Quantization Level

float16: Default choice. Always use instead of float32 for storage. No quality loss. Supported by all major vector databases.

int8: Good default for large-scale deployments. 4× savings, 1-2% quality loss is acceptable for most applications.

binary: For very large corpora (100M+) where search latency is critical. Always use with rescoring. The 2% quality loss with rescoring is acceptable for most applications; without rescoring (~30% loss), use only for early-stage filtering.


Common Mistakes

:::danger Using binary quantization without rescoring Binary Hamming distance without float32 rescoring loses ~30-40% retrieval quality. This is rarely acceptable for production. Always implement the two-stage pattern: binary first pass to find 100-1000 candidates, float32 rescore to find the final top-k. :::

:::danger Not normalizing before binarization Binary quantization sign-preserves the direction of the vector. For this to work, the vector must represent direction meaningfully - which requires unit L2 normalization. Always normalize embeddings before binarizing. Unnormalized embeddings have magnitudes that corrupt the direction signal. :::

:::warning Quantizing without evaluating quality impact The quality loss from quantization varies by embedding model, domain, and query type. Always measure the quality impact on your specific dataset before deploying quantization. Use your domain evaluation set (Lesson 06) to compare float32 vs quantized retrieval on your actual queries. :::

:::tip Keep float32 embeddings alongside binary for rescoring Don't discard your float32 embeddings after binarizing. The rescoring pattern requires float32 for high-quality final rankings. Store binary embeddings in fast RAM for first-pass search and float32 on cheaper storage for rescoring. The total storage cost (binary + float32) is 1.03× float32 alone - essentially free, because binary is so small. :::


Interview Q&A

Q1: What is embedding quantization and why is it used?

Embedding quantization reduces the numerical precision of embedding vectors to decrease storage cost and search latency. float32 (4 bytes/dimension) is the standard; quantization reduces this to float16 (2 bytes), int8 (1 byte), or binary (1 bit per dimension, i.e., 1/32 bytes). At 1B documents with 1536-dim embeddings, the savings are: float32 = 6.1 TB, float16 = 3 TB, int8 = 1.5 TB, binary = 190 GB. Search speed also scales with quantization level - binary Hamming distance search (implemented as POPCOUNT) is 25-40× faster than float32 dot product.

Q2: How does binary quantization work and what is the Hamming distance trick?

Binary quantization converts each float32 dimension to a single bit based on sign: positive values → 1, negative → 0. A 1536-dim float32 vector (6,144 bytes) becomes 1536 bits (192 bytes). Similarity is then measured by Hamming distance: the count of bit positions where two vectors differ. This equals POPCOUNT(XOR(a, b)) - the number of set bits in the XOR of two binary vectors. Modern CPUs and GPUs implement POPCOUNT as a hardware instruction, making binary similarity search extremely fast. The reason this works for normalized embeddings: cosine similarity depends only on angle, and the sign of each dimension approximately captures the directional relationship between vectors.

Q3: Why is rescoring necessary for binary quantization?

Without rescoring, binary Hamming distance has only ~60-70% retrieval quality (nDCG@10) compared to float32 cosine similarity. This is because the sign quantization loses detailed information about the relative magnitude of each dimension, degrading the precision of similarity ranking. Rescoring adds a second stage: after retrieving the top-1000 candidates using fast binary search, load their float32 embeddings and compute exact cosine similarity. This restores quality to ~98-99% of full float32 search while retaining 90%+ of binary search's speed advantage.

Q4: When would you choose int8 over binary quantization?

Choose int8 over binary when: you want higher quality assurance (int8 loses 1-2% vs binary's 2% with rescoring - similar but int8 doesn't require the two-stage architecture), your corpus is smaller (binary's 32× compression is compelling for 100M+ documents; at 1M documents, int8's 4× compression may be sufficient), or you want simpler infrastructure (int8 is a straightforward compression without the two-stage search complexity of binary with rescoring).

Choose binary when: corpus size exceeds 100M documents and storage is the primary constraint, or search latency is critical and you need the speed of Hamming distance computation.


Summary

Embedding quantization reduces storage cost and search latency without model retraining:

  • float16: 2× compression, zero quality loss. Always prefer over float32.
  • int8: 4× compression, 1-2% quality loss. Good default for large deployments.
  • Binary: 32× compression, 2% quality loss with rescoring (30% without). Best for 100M+ document corpora with latency constraints.

The rescoring pattern is key to production binary quantization: use binary Hamming distance for fast first-pass search over the full corpus, then rescore the top-1000 candidates with float32 embeddings for near-full quality final results.

Supported natively by Qdrant (binary + int8), FAISS (binary via IndexBinaryFlat), and Cohere Embed API (returns all quantization levels in one call).

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Quantisation Effects demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.