OpenAI Embeddings and API-Based Embedding Services
Reading time: 20 min | Relevance: AI Engineer, ML Engineer
The $50,000 Embedding Bill
A startup runs a legal document retrieval system. They chose OpenAI text-embedding-ada-002 in early 2023 when it was the obvious choice. By 2024, they're embedding 500 million tokens per month. At 50,000/month - just for embeddings, before LLM API costs. Their engineers assumed "it'll be negligible at our scale." It was not.
This story is common. API-based embeddings start free for prototyping and become material costs at production scale. Understanding the cost structure - and when to switch to self-hosted models or to newer, cheaper API alternatives - is essential engineering knowledge.
The good news: the embedding API landscape has improved dramatically since 2023. OpenAI's text-embedding-3 is 5× cheaper than ada-002 for equivalent or better quality. Voyage AI often achieves higher retrieval quality than OpenAI at comparable cost. And self-hosted models like BGE-large are competitive with API models at zero inference cost (hardware excluded). This lesson walks through the options, the math, and the decision criteria.
OpenAI text-embedding-ada-002: The Old Standard
text-embedding-ada-002 was released in December 2022 and was the best widely-available embedding API for much of 2023. Key characteristics:
- 1536 dimensions
- Price: $0.10/million tokens
- Context window: 8,191 tokens
- Performance: Strong on general English text; significantly weaker on specialized domains
Ada-002 was trained on large amounts of web text with a contrastive objective. It was a significant improvement over earlier OpenAI embeddings. But by 2024, it had been surpassed by text-embedding-3 on quality and was 5× more expensive per token.
Migration: If you're still using ada-002, you should evaluate text-embedding-3-small. It outperforms ada-002 on MTEB while costing $0.02/million tokens - 5× cheaper. The trade-off: you'll need to re-embed your corpus (the embedding spaces are incompatible). For most applications the quality improvement and cost reduction justify the one-time re-embedding cost.
text-embedding-3-small and text-embedding-3-large
OpenAI released the text-embedding-3 series in January 2024. Both models use Matryoshka Representation Learning (MRL) - covered in depth in Lesson 05 - which means you can truncate the embedding dimensions without significant quality loss.
text-embedding-3-small
- Default dimensions: 1536
- Price: $0.02/million tokens (5× cheaper than ada-002)
- Minimum dimensions: 256 (via MRL truncation)
- Context window: 8,191 tokens
- Performance: Better than ada-002 on most tasks; appropriate for most production RAG applications
text-embedding-3-large
- Default dimensions: 3072
- Price: $0.13/million tokens
- Minimum dimensions: 256 (via MRL truncation)
- Context window: 8,191 tokens
- Performance: State-of-the-art for OpenAI; 15+ points better than ada-002 on MTEB
The dimensions parameter
The OpenAI API accepts a dimensions parameter for text-embedding-3 models, enabling you to reduce embedding dimensions without quality loss:
from openai import OpenAI
client = OpenAI()
def embed_with_reduced_dimensions(
texts: list[str],
model: str = "text-embedding-3-small",
dimensions: int = 512, # Reduce from 1536 to 512
) -> list[list[float]]:
"""
Embed texts using OpenAI API with optional dimension reduction.
MRL training allows dimension reduction without significant quality loss.
"""
response = client.embeddings.create(
model=model,
input=texts,
dimensions=dimensions, # Only works for text-embedding-3 models
)
return [item.embedding for item in response.data]
# Cost comparison: 1M docs at 256 avg tokens per doc
def cost_analysis():
token_count = 1_000_000 * 256 # 256M tokens
costs = {
"ada-002 (1536 dims)": token_count / 1_000_000 * 0.10,
"embedding-3-small (1536 dims)": token_count / 1_000_000 * 0.02,
"embedding-3-small (512 dims)": token_count / 1_000_000 * 0.02, # Same price, fewer dims
"embedding-3-large (3072 dims)": token_count / 1_000_000 * 0.13,
"embedding-3-large (512 dims)": token_count / 1_000_000 * 0.13, # Same price, fewer dims
}
print("Cost for 1M documents at 256 tokens each:")
for model, cost in costs.items():
print(f" {model}: ${cost:.2f}")
# Cost for 1M documents at 256 tokens each:
# ada-002 (1536 dims): $25.60
# embedding-3-small (1536 dims): $5.12
# embedding-3-small (512 dims): $5.12 ← Same cost, 3× less storage!
# embedding-3-large (3072 dims): $33.28
# embedding-3-large (512 dims): $33.28 ← Same cost, 6× less storage!
Key insight: With text-embedding-3, you pay the same per-token price regardless of dimensions - dimension reduction via MRL is free. Use reduced dimensions to save storage and speed up search without paying more.
Voyage AI
Voyage AI was founded by former Stanford NLP researchers specifically to build high-quality embedding models. As of early 2025, voyage-3 and voyage-3-lite consistently outperform OpenAI text-embedding-3-large on MTEB retrieval benchmarks.
Voyage AI model lineup
voyage-3:
- 1024 dimensions
- Context: 32,000 tokens (vs OpenAI's 8,191)
- Price: ~$0.06/million tokens
- MTEB Retrieval: Typically 1-3 points higher than OpenAI text-embedding-3-large
- Particularly strong on long document retrieval (benefiting from 32k context)
voyage-3-lite:
- 512 dimensions
- Context: 32,000 tokens
- Price: ~$0.02/million tokens
- Designed for cost-sensitive applications where quality-cost trade-off favors smaller embeddings
voyage-3-finance and voyage-3-law:
- Domain-specific models for financial and legal text
- Significantly better than general models on domain-specific retrieval benchmarks
- Same price as voyage-3
import voyageai
client = voyageai.Client()
def embed_with_voyage(
texts: list[str],
input_type: str = "document", # "document" or "query"
model: str = "voyage-3",
) -> list[list[float]]:
"""
Embed texts using Voyage AI.
Voyage uses separate query/document encoding (asymmetric retrieval).
"""
result = client.embed(
texts=texts,
model=model,
input_type=input_type, # Key: specify "query" for queries, "document" for passages
)
return result.embeddings
# Usage in a RAG pipeline
def embed_query(query: str) -> list[float]:
return embed_with_voyage([query], input_type="query")[0]
def embed_documents(documents: list[str]) -> list[list[float]]:
return embed_with_voyage(documents, input_type="document")
# Voyage's long context advantage
# A 10,000-token document:
# OpenAI: must be chunked into ~1.2 chunks (8191 token limit)
# Voyage: can be embedded as a single 10k-token document
# This improves retrieval quality for long documents significantly
When Voyage beats OpenAI
Voyage AI consistently outperforms OpenAI on:
- Long document retrieval: 32k context vs 8,191 context makes a significant difference for documents longer than ~2,000 words
- Domain-specific retrieval: Voyage's finance and law models significantly outperform general models
- MTEB retrieval tasks: voyage-3 typically scores 1-3 points higher on the retrieval subset
Voyage is essentially equivalent or slightly worse on:
- Clustering tasks (OpenAI is competitive)
- Semantic textual similarity (STS) benchmarks
- Non-English text (OpenAI has broader multilingual training)
Cohere Embed v3
Cohere Embed v3 (released November 2023) is notable for being the first commercial embedding model with native multimodal support - it can embed both text and images in the same vector space.
Key characteristics
embed-english-v3.0:
- 1024 dimensions
- Context: 512 tokens
- Price: $0.10/million tokens
- Strong on English retrieval
embed-multilingual-v3.0:
- 1024 dimensions
- Context: 512 tokens
- Price: $0.10/million tokens
- Supports 100+ languages
int8 and binary embedding support: Cohere Embed v3 natively supports int8 and binary quantization via API parameters - a significant differentiator. Binary embeddings from Cohere (with rescoring) lose only ~2% retrieval quality versus full float32, while reducing storage by 32×.
import cohere
co = cohere.Client()
def embed_with_cohere(
texts: list[str],
input_type: str = "search_document", # "search_document" or "search_query"
embedding_types: list[str] = ["float"], # Can request "int8" or "binary"
model: str = "embed-english-v3.0",
) -> dict:
"""
Embed texts using Cohere Embed v3.
Returns multiple quantization levels if requested.
"""
response = co.embed(
texts=texts,
model=model,
input_type=input_type,
embedding_types=embedding_types,
)
result = {}
if "float" in embedding_types:
result["float"] = response.embeddings.float
if "int8" in embedding_types:
result["int8"] = response.embeddings.int8
if "binary" in embedding_types:
result["binary"] = response.embeddings.binary
return result
# Get binary + float embeddings in one API call for rescoring
embeddings = embed_with_cohere(
texts=["The treatment for Type 2 diabetes..."],
embedding_types=["float", "binary"],
)
# Use binary for fast first-pass search
# Use float for reranking the top-k results from binary search
Rate Limits and Batch Processing
All embedding APIs have rate limits. At scale, you need batching and retry logic.
import asyncio
import time
from typing import Generator
from openai import AsyncOpenAI, RateLimitError
async_client = AsyncOpenAI()
def chunk_list(lst: list, chunk_size: int) -> Generator:
"""Split a list into chunks of specified size."""
for i in range(0, len(lst), chunk_size):
yield lst[i:i + chunk_size]
async def embed_batch_with_retry(
texts: list[str],
model: str = "text-embedding-3-small",
max_batch_size: int = 2048, # OpenAI max items per request
max_tokens_per_batch: int = 300_000, # Conservative token limit
max_retries: int = 5,
) -> list[list[float]]:
"""
Embed a large list of texts with automatic batching and retry logic.
"""
all_embeddings = []
for batch in chunk_list(texts, max_batch_size):
retries = 0
while retries < max_retries:
try:
response = await async_client.embeddings.create(
model=model,
input=batch,
)
batch_embeddings = [item.embedding for item in response.data]
all_embeddings.extend(batch_embeddings)
break
except RateLimitError:
wait_time = 2 ** retries # Exponential backoff: 1, 2, 4, 8, 16 seconds
print(f"Rate limited. Waiting {wait_time}s before retry {retries + 1}")
await asyncio.sleep(wait_time)
retries += 1
except Exception as e:
print(f"Error: {e}")
retries += 1
await asyncio.sleep(2)
if retries == max_retries:
raise RuntimeError(f"Failed to embed batch after {max_retries} retries")
return all_embeddings
async def embed_corpus_parallel(
texts: list[str],
model: str = "text-embedding-3-small",
n_concurrent: int = 10, # Number of concurrent API requests
) -> list[list[float]]:
"""
Embed a large corpus with controlled parallelism.
"""
batch_size = 100 # Smaller batches for parallelism
batches = list(chunk_list(texts, batch_size))
semaphore = asyncio.Semaphore(n_concurrent)
async def embed_with_semaphore(batch):
async with semaphore:
return await embed_batch_with_retry(batch, model=model)
tasks = [embed_with_semaphore(batch) for batch in batches]
batch_results = await asyncio.gather(*tasks)
# Flatten results
return [emb for batch in batch_results for emb in batch]
# Usage
import asyncio
texts = ["document 1...", "document 2...", ...] # Potentially millions of texts
embeddings = asyncio.run(embed_corpus_parallel(texts))
Cost Analysis at Scale
Understanding when to switch from API to self-hosted requires accurate cost modeling:
def total_cost_analysis(
monthly_tokens: int,
n_months: int = 12,
gpu_cost_per_hour: float = 2.5, # A100 80GB on-demand price
gpu_throughput_tokens_per_second: float = 100_000, # BGE-large on A100
):
"""
Compare total cost of API vs self-hosted embedding.
"""
# API costs (OpenAI text-embedding-3-small)
api_cost_per_token = 0.02 / 1_000_000
api_monthly_cost = monthly_tokens * api_cost_per_token
api_total = api_monthly_cost * n_months
# Self-hosted costs
# GPU hours needed per month
gpu_seconds_per_month = monthly_tokens / gpu_throughput_tokens_per_second
gpu_hours_per_month = gpu_seconds_per_month / 3600
gpu_monthly_cost = gpu_hours_per_month * gpu_cost_per_hour
# Engineering cost: ~40 hrs/month to maintain infrastructure
eng_hourly_cost = 150 # $150/hr loaded cost
eng_monthly_hours = 10 # Ongoing maintenance
eng_monthly_cost = eng_monthly_hours * eng_hourly_cost
# One-time setup cost
setup_cost = 40 * eng_hourly_cost # 40 hours initial setup
self_hosted_monthly = gpu_monthly_cost + eng_monthly_cost
self_hosted_total = setup_cost + (self_hosted_monthly * n_months)
print(f"Monthly token volume: {monthly_tokens:,}")
print(f"Over {n_months} months:")
print(f"\nAPI (text-embedding-3-small):")
print(f" Monthly cost: ${api_monthly_cost:.2f}")
print(f" Total ({n_months}mo): ${api_total:.2f}")
print(f"\nSelf-hosted (BGE-large on A100):")
print(f" GPU hours/month: {gpu_hours_per_month:.1f}")
print(f" Monthly cost: ${self_hosted_monthly:.2f}")
print(f" Setup cost: ${setup_cost:.2f}")
print(f" Total ({n_months}mo): ${self_hosted_total:.2f}")
print(f"\nBreakeven point: "
f"{'API wins' if api_total < self_hosted_total else 'Self-hosted wins'}")
print(f"Savings: ${abs(api_total - self_hosted_total):.2f}")
# Low volume: 50M tokens/month
total_cost_analysis(50_000_000)
# API: $1/month, $12/year
# Self-hosted: ~$1,000+/month (GPU cost + engineer time) - API wins decisively
# High volume: 5B tokens/month
total_cost_analysis(5_000_000_000)
# API: $100/month, $1,200/year
# Self-hosted: ~$1,200+/month - rough parity; self-hosted wins with lower latency
Rule of thumb: API wins below ~500M tokens/month. Self-hosted wins above ~2B tokens/month. The crossover depends on GPU costs, engineering productivity, and latency requirements.
Production Patterns
Caching embeddings
API calls are expensive. Cache embeddings for any text you'll embed more than once:
import hashlib
import json
from pathlib import Path
import numpy as np
class EmbeddingCache:
"""
Disk-backed cache for embeddings.
Keyed by content hash to handle text updates correctly.
"""
def __init__(self, cache_dir: str = ".embedding_cache"):
self.cache_dir = Path(cache_dir)
self.cache_dir.mkdir(exist_ok=True)
def _key(self, text: str, model: str) -> str:
content = f"{model}:{text}"
return hashlib.sha256(content.encode()).hexdigest()
def get(self, text: str, model: str) -> np.ndarray | None:
key = self._key(text, model)
cache_file = self.cache_dir / f"{key}.npy"
if cache_file.exists():
return np.load(cache_file)
return None
def set(self, text: str, model: str, embedding: np.ndarray) -> None:
key = self._key(text, model)
cache_file = self.cache_dir / f"{key}.npy"
np.save(cache_file, embedding)
def embed_with_cache(
self,
texts: list[str],
embed_func, # Callable that embeds uncached texts
model: str,
) -> list[np.ndarray]:
results = []
uncached_texts = []
uncached_indices = []
for i, text in enumerate(texts):
cached = self.get(text, model)
if cached is not None:
results.append((i, cached))
else:
uncached_texts.append(text)
uncached_indices.append(i)
if uncached_texts:
new_embeddings = embed_func(uncached_texts)
for idx, text, emb in zip(uncached_indices, uncached_texts, new_embeddings):
emb_array = np.array(emb)
self.set(text, model, emb_array)
results.append((idx, emb_array))
results.sort(key=lambda x: x[0])
return [emb for _, emb in results]
Common Mistakes
:::danger Using ada-002 in 2024 and beyond ada-002 is 5× more expensive than text-embedding-3-small for equivalent or worse quality. If you're still using it, migrate. The one-time cost of re-embedding your corpus is worth it at any meaningful scale. :::
:::danger Not specifying input_type for Voyage and Cohere
Voyage AI and Cohere Embed use asymmetric encoding - queries and documents are encoded differently. Failing to specify input_type="query" for queries and input_type="document" for passages results in symmetric encoding that loses 10-20% retrieval quality. Always specify the correct input type.
:::
:::warning Treating all API embeddings as interchangeable Embeddings from different models (and even different versions of the same model) are not comparable. You cannot mix text-embedding-3-small embeddings with voyage-3 embeddings in the same index. If you switch models, you must re-embed your entire corpus. Plan model migration carefully and budget for re-embedding time and cost. :::
:::warning Not rate-limiting concurrent API requests Sending too many concurrent embedding requests will trigger rate limiting, causing requests to fail or be delayed. Always implement a semaphore or request queue to control concurrency. For OpenAI, stay below your TPM (tokens per minute) and RPM (requests per minute) limits. Implement exponential backoff for rate limit errors. :::
:::tip Request multiple quantization levels in a single Cohere API call Cohere Embed v3 supports returning float32, int8, and binary embeddings in a single API call. Request all three at embedding time and store the float32 for high-quality reranking and the binary for fast first-pass search. You pay for one API call but get the flexibility to use different quantization levels for different stages of the retrieval pipeline. :::
Interview Q&A
Q1: What is the difference between text-embedding-ada-002 and text-embedding-3?
Ada-002 is OpenAI's previous generation embedding model (1536 dimensions, 0.02/million tokens, better quality than ada-002) and text-embedding-3-large (3072 dims, $0.13/million tokens, state-of-the-art quality). The critical difference beyond quality: text-embedding-3 models are trained with Matryoshka Representation Learning, allowing dimension reduction via the dimensions API parameter without quality loss. Ada-002 has no such capability. At equivalent quality to ada-002, text-embedding-3-small is 5× cheaper - the migration is almost always worth it.
Q2: When would you choose Voyage AI over OpenAI for embeddings?
Four situations favor Voyage: (1) Your documents are longer than ~2,000 words - Voyage has a 32k context window vs OpenAI's 8,191, enabling full-document embedding without chunking for most documents. (2) You need higher retrieval quality - voyage-3 consistently scores 1-3 MTEB retrieval points higher than text-embedding-3-large. (3) You're in financial or legal domain - Voyage's domain-specific models significantly outperform general models. (4) Cost is secondary to quality - voyage-3 at 0.02/million tokens.
Q3: What batch processing patterns are important for embedding at scale?
Three patterns: (1) Batching - always send multiple texts per API request (up to 2048 for OpenAI) to maximize throughput and minimize per-call overhead. (2) Controlled concurrency - use an asyncio Semaphore to limit concurrent requests and avoid rate limiting. Start with 5-10 concurrent requests and increase based on your tier's rate limits. (3) Exponential backoff - implement automatic retry with exponential backoff on rate limit errors. Don't retry immediately; wait 2^n seconds where n is the retry count. This is the standard pattern for all rate-limited APIs.
Q4: When does self-hosted embedding become cost-effective vs API embedding?
The crossover is roughly 500M–2B tokens per month, depending on your GPU costs and engineering resources. Below 500M tokens/month, API is almost always cheaper when you account for engineering time to maintain self-hosted infrastructure. Above 2B tokens/month, self-hosted is typically cheaper: a single A100 GPU can embed ~100M tokens/hour, and at 0.000025/million tokens - 800× cheaper than OpenAI API per token. The complete cost model must include: GPU cost (hardware or cloud), engineering maintenance time (10-40 hours/month), initial setup cost (40-80 hours), and reliability infrastructure (monitoring, failover).
Summary
The API embedding landscape offers clear choices for different needs:
- text-embedding-3-small: Best price/quality for most production applications. $0.02/million tokens, 1536 dims, Matryoshka-trained for dimension reduction.
- text-embedding-3-large: When you need the best quality OpenAI offers. $0.13/million tokens, 3072 dims.
- voyage-3: Best retrieval quality among commercial APIs. 32k context window. Preferred for long documents and domain-specific use cases.
- Cohere Embed v3: Best for multilingual applications and when you want native int8/binary quantization support.
Self-hosted becomes cost-effective above ~500M–2B tokens/month. Below that, API is faster, cheaper, and operationally simpler. Always implement proper batching, rate limiting, and exponential backoff for production usage.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Embedding Space Explorer demo on the EngineersOfAI Playground - no code required.
:::
