What is OpenAI embeddings?

text-embedding-3, Matryoshka training, Voyage AI, Cohere Embed, cost analysis, batch processing patterns, and when to choose API vs self-hosted embeddings.

How does text-embedding-3 work in practice?

OpenAI Embeddings and API-Based Embedding Services covers OpenAI embeddings, text-embedding-3, Matryoshka embeddings from first principles with code examples. Free lesson at https://engineersofai.com/docs/llms/embeddings-engineering/openai-and-api-embeddings

What is the difference between OpenAI embeddings and Matryoshka embeddings?

See the full breakdown at https://engineersofai.com/docs/llms/embeddings-engineering/openai-and-api-embeddings

OpenAI Embeddings and API-Based Embedding Services

Reading time: 20 min | Relevance: AI Engineer, ML Engineer

The $50,000 Embedding Bill

A startup runs a legal document retrieval system. They chose OpenAI text-embedding-ada-002 in early 2023 when it was the obvious choice. By 2024, they're embedding 500 million tokens per month. At $0.10/million tokens for ada-002, that's$ 50,000/month - just for embeddings, before LLM API costs. Their engineers assumed "it'll be negligible at our scale." It was not.

This story is common. API-based embeddings start free for prototyping and become material costs at production scale. Understanding the cost structure - and when to switch to self-hosted models or to newer, cheaper API alternatives - is essential engineering knowledge.

The good news: the embedding API landscape has improved dramatically since 2023. OpenAI's text-embedding-3 is 5× cheaper than ada-002 for equivalent or better quality. Voyage AI often achieves higher retrieval quality than OpenAI at comparable cost. And self-hosted models like BGE-large are competitive with API models at zero inference cost (hardware excluded). This lesson walks through the options, the math, and the decision criteria.

OpenAI text-embedding-ada-002: The Old Standard

text-embedding-ada-002 was released in December 2022 and was the best widely-available embedding API for much of 2023. Key characteristics:

1536 dimensions
Price: $0.10/million tokens
Context window: 8,191 tokens
Performance: Strong on general English text; significantly weaker on specialized domains

Ada-002 was trained on large amounts of web text with a contrastive objective. It was a significant improvement over earlier OpenAI embeddings. But by 2024, it had been surpassed by text-embedding-3 on quality and was 5× more expensive per token.

Migration: If you're still using ada-002, you should evaluate text-embedding-3-small. It outperforms ada-002 on MTEB while costing $0.02/million tokens - 5× cheaper. The trade-off: you'll need to re-embed your corpus (the embedding spaces are incompatible). For most applications the quality improvement and cost reduction justify the one-time re-embedding cost.

text-embedding-3-small and text-embedding-3-large

OpenAI released the text-embedding-3 series in January 2024. Both models use Matryoshka Representation Learning (MRL) - covered in depth in Lesson 05 - which means you can truncate the embedding dimensions without significant quality loss.

text-embedding-3-small

Default dimensions: 1536
Price: $0.02/million tokens (5× cheaper than ada-002)
Minimum dimensions: 256 (via MRL truncation)
Context window: 8,191 tokens
Performance: Better than ada-002 on most tasks; appropriate for most production RAG applications

text-embedding-3-large

Default dimensions: 3072
Price: $0.13/million tokens
Minimum dimensions: 256 (via MRL truncation)
Context window: 8,191 tokens
Performance: State-of-the-art for OpenAI; 15+ points better than ada-002 on MTEB

The dimensions parameter

The OpenAI API accepts a dimensions parameter for text-embedding-3 models, enabling you to reduce embedding dimensions without quality loss:

from openai import OpenAI

client = OpenAI()

def embed_with_reduced_dimensions(
    texts: list[str],
    model: str = "text-embedding-3-small",
    dimensions: int = 512,  # Reduce from 1536 to 512
) -> list[list[float]]:
    """
    Embed texts using OpenAI API with optional dimension reduction.
    MRL training allows dimension reduction without significant quality loss.
    """
    response = client.embeddings.create(
        model=model,
        input=texts,
        dimensions=dimensions,  # Only works for text-embedding-3 models
    )
    return [item.embedding for item in response.data]


# Cost comparison: 1M docs at 256 avg tokens per doc

def cost_analysis():
    token_count = 1_000_000 * 256  # 256M tokens

    costs = {
        "ada-002 (1536 dims)": token_count / 1_000_000 * 0.10,
        "embedding-3-small (1536 dims)": token_count / 1_000_000 * 0.02,
        "embedding-3-small (512 dims)": token_count / 1_000_000 * 0.02,  # Same price, fewer dims
        "embedding-3-large (3072 dims)": token_count / 1_000_000 * 0.13,
        "embedding-3-large (512 dims)": token_count / 1_000_000 * 0.13,  # Same price, fewer dims
    }

    print("Cost for 1M documents at 256 tokens each:")
    for model, cost in costs.items():
        print(f"  {model}: ${cost:.2f}")

# Cost for 1M documents at 256 tokens each:
#   ada-002 (1536 dims): $25.60
#   embedding-3-small (1536 dims): $5.12
#   embedding-3-small (512 dims): $5.12  ← Same cost, 3× less storage!
#   embedding-3-large (3072 dims): $33.28
#   embedding-3-large (512 dims): $33.28 ← Same cost, 6× less storage!

Key insight: With text-embedding-3, you pay the same per-token price regardless of dimensions - dimension reduction via MRL is free. Use reduced dimensions to save storage and speed up search without paying more.

Voyage AI

Voyage AI was founded by former Stanford NLP researchers specifically to build high-quality embedding models. As of early 2025, voyage-3 and voyage-3-lite consistently outperform OpenAI text-embedding-3-large on MTEB retrieval benchmarks.

Voyage AI model lineup

voyage-3:

1024 dimensions
Context: 32,000 tokens (vs OpenAI's 8,191)
Price: ~$0.06/million tokens
MTEB Retrieval: Typically 1-3 points higher than OpenAI text-embedding-3-large
Particularly strong on long document retrieval (benefiting from 32k context)

voyage-3-lite:

512 dimensions
Context: 32,000 tokens
Price: ~$0.02/million tokens
Designed for cost-sensitive applications where quality-cost trade-off favors smaller embeddings

voyage-3-finance and voyage-3-law:

Domain-specific models for financial and legal text
Significantly better than general models on domain-specific retrieval benchmarks
Same price as voyage-3

import voyageai

client = voyageai.Client()

def embed_with_voyage(
    texts: list[str],
    input_type: str = "document",  # "document" or "query"
    model: str = "voyage-3",
) -> list[list[float]]:
    """
    Embed texts using Voyage AI.
    Voyage uses separate query/document encoding (asymmetric retrieval).
    """
    result = client.embed(
        texts=texts,
        model=model,
        input_type=input_type,  # Key: specify "query" for queries, "document" for passages
    )
    return result.embeddings


# Usage in a RAG pipeline
def embed_query(query: str) -> list[float]:
    return embed_with_voyage([query], input_type="query")[0]

def embed_documents(documents: list[str]) -> list[list[float]]:
    return embed_with_voyage(documents, input_type="document")


# Voyage's long context advantage
# A 10,000-token document:
# OpenAI: must be chunked into ~1.2 chunks (8191 token limit)
# Voyage: can be embedded as a single 10k-token document
# This improves retrieval quality for long documents significantly

When Voyage beats OpenAI

Voyage AI consistently outperforms OpenAI on:

Long document retrieval: 32k context vs 8,191 context makes a significant difference for documents longer than ~2,000 words
Domain-specific retrieval: Voyage's finance and law models significantly outperform general models
MTEB retrieval tasks: voyage-3 typically scores 1-3 points higher on the retrieval subset

Voyage is essentially equivalent or slightly worse on:

Clustering tasks (OpenAI is competitive)
Semantic textual similarity (STS) benchmarks
Non-English text (OpenAI has broader multilingual training)

Cohere Embed v3

Cohere Embed v3 (released November 2023) is notable for being the first commercial embedding model with native multimodal support - it can embed both text and images in the same vector space.

Key characteristics

embed-english-v3.0:

1024 dimensions
Context: 512 tokens
Price: $0.10/million tokens
Strong on English retrieval

embed-multilingual-v3.0:

1024 dimensions
Context: 512 tokens
Price: $0.10/million tokens
Supports 100+ languages

int8 and binary embedding support: Cohere Embed v3 natively supports int8 and binary quantization via API parameters - a significant differentiator. Binary embeddings from Cohere (with rescoring) lose only ~2% retrieval quality versus full float32, while reducing storage by 32×.

import cohere

co = cohere.Client()

def embed_with_cohere(
    texts: list[str],
    input_type: str = "search_document",  # "search_document" or "search_query"
    embedding_types: list[str] = ["float"],  # Can request "int8" or "binary"
    model: str = "embed-english-v3.0",
) -> dict:
    """
    Embed texts using Cohere Embed v3.
    Returns multiple quantization levels if requested.
    """
    response = co.embed(
        texts=texts,
        model=model,
        input_type=input_type,
        embedding_types=embedding_types,
    )

    result = {}
    if "float" in embedding_types:
        result["float"] = response.embeddings.float
    if "int8" in embedding_types:
        result["int8"] = response.embeddings.int8
    if "binary" in embedding_types:
        result["binary"] = response.embeddings.binary

    return result


# Get binary + float embeddings in one API call for rescoring
embeddings = embed_with_cohere(
    texts=["The treatment for Type 2 diabetes..."],
    embedding_types=["float", "binary"],
)
# Use binary for fast first-pass search
# Use float for reranking the top-k results from binary search

Rate Limits and Batch Processing

All embedding APIs have rate limits. At scale, you need batching and retry logic.

import asyncio
import time
from typing import Generator
from openai import AsyncOpenAI, RateLimitError

async_client = AsyncOpenAI()

def chunk_list(lst: list, chunk_size: int) -> Generator:
    """Split a list into chunks of specified size."""
    for i in range(0, len(lst), chunk_size):
        yield lst[i:i + chunk_size]

async def embed_batch_with_retry(
    texts: list[str],
    model: str = "text-embedding-3-small",
    max_batch_size: int = 2048,  # OpenAI max items per request
    max_tokens_per_batch: int = 300_000,  # Conservative token limit
    max_retries: int = 5,
) -> list[list[float]]:
    """
    Embed a large list of texts with automatic batching and retry logic.
    """
    all_embeddings = []

    for batch in chunk_list(texts, max_batch_size):
        retries = 0
        while retries < max_retries:
            try:
                response = await async_client.embeddings.create(
                    model=model,
                    input=batch,
                )
                batch_embeddings = [item.embedding for item in response.data]
                all_embeddings.extend(batch_embeddings)
                break

            except RateLimitError:
                wait_time = 2 ** retries  # Exponential backoff: 1, 2, 4, 8, 16 seconds
                print(f"Rate limited. Waiting {wait_time}s before retry {retries + 1}")
                await asyncio.sleep(wait_time)
                retries += 1
            except Exception as e:
                print(f"Error: {e}")
                retries += 1
                await asyncio.sleep(2)

        if retries == max_retries:
            raise RuntimeError(f"Failed to embed batch after {max_retries} retries")

    return all_embeddings


async def embed_corpus_parallel(
    texts: list[str],
    model: str = "text-embedding-3-small",
    n_concurrent: int = 10,  # Number of concurrent API requests
) -> list[list[float]]:
    """
    Embed a large corpus with controlled parallelism.
    """
    batch_size = 100  # Smaller batches for parallelism
    batches = list(chunk_list(texts, batch_size))

    semaphore = asyncio.Semaphore(n_concurrent)

    async def embed_with_semaphore(batch):
        async with semaphore:
            return await embed_batch_with_retry(batch, model=model)

    tasks = [embed_with_semaphore(batch) for batch in batches]
    batch_results = await asyncio.gather(*tasks)

    # Flatten results
    return [emb for batch in batch_results for emb in batch]


# Usage
import asyncio
texts = ["document 1...", "document 2...", ...]  # Potentially millions of texts
embeddings = asyncio.run(embed_corpus_parallel(texts))

Cost Analysis at Scale

Understanding when to switch from API to self-hosted requires accurate cost modeling:

def total_cost_analysis(
    monthly_tokens: int,
    n_months: int = 12,
    gpu_cost_per_hour: float = 2.5,  # A100 80GB on-demand price
    gpu_throughput_tokens_per_second: float = 100_000,  # BGE-large on A100
):
    """
    Compare total cost of API vs self-hosted embedding.
    """
    # API costs (OpenAI text-embedding-3-small)
    api_cost_per_token = 0.02 / 1_000_000
    api_monthly_cost = monthly_tokens * api_cost_per_token
    api_total = api_monthly_cost * n_months

    # Self-hosted costs
    # GPU hours needed per month
    gpu_seconds_per_month = monthly_tokens / gpu_throughput_tokens_per_second
    gpu_hours_per_month = gpu_seconds_per_month / 3600
    gpu_monthly_cost = gpu_hours_per_month * gpu_cost_per_hour
    # Engineering cost: ~40 hrs/month to maintain infrastructure
    eng_hourly_cost = 150  # $150/hr loaded cost
    eng_monthly_hours = 10  # Ongoing maintenance
    eng_monthly_cost = eng_monthly_hours * eng_hourly_cost
    # One-time setup cost
    setup_cost = 40 * eng_hourly_cost  # 40 hours initial setup

    self_hosted_monthly = gpu_monthly_cost + eng_monthly_cost
    self_hosted_total = setup_cost + (self_hosted_monthly * n_months)

    print(f"Monthly token volume: {monthly_tokens:,}")
    print(f"Over {n_months} months:")
    print(f"\nAPI (text-embedding-3-small):")
    print(f"  Monthly cost: ${api_monthly_cost:.2f}")
    print(f"  Total ({n_months}mo): ${api_total:.2f}")
    print(f"\nSelf-hosted (BGE-large on A100):")
    print(f"  GPU hours/month: {gpu_hours_per_month:.1f}")
    print(f"  Monthly cost: ${self_hosted_monthly:.2f}")
    print(f"  Setup cost: ${setup_cost:.2f}")
    print(f"  Total ({n_months}mo): ${self_hosted_total:.2f}")
    print(f"\nBreakeven point: "
          f"{'API wins' if api_total < self_hosted_total else 'Self-hosted wins'}")
    print(f"Savings: ${abs(api_total - self_hosted_total):.2f}")

# Low volume: 50M tokens/month
total_cost_analysis(50_000_000)
# API: $1/month, $12/year
# Self-hosted: ~$1,000+/month (GPU cost + engineer time) - API wins decisively

# High volume: 5B tokens/month
total_cost_analysis(5_000_000_000)
# API: $100/month, $1,200/year
# Self-hosted: ~$1,200+/month - rough parity; self-hosted wins with lower latency

Rule of thumb: API wins below ~500M tokens/month. Self-hosted wins above ~2B tokens/month. The crossover depends on GPU costs, engineering productivity, and latency requirements.

Production Patterns

Caching embeddings

API calls are expensive. Cache embeddings for any text you'll embed more than once:

import hashlib
import json
from pathlib import Path
import numpy as np

class EmbeddingCache:
    """
    Disk-backed cache for embeddings.
    Keyed by content hash to handle text updates correctly.
    """

    def __init__(self, cache_dir: str = ".embedding_cache"):
        self.cache_dir = Path(cache_dir)
        self.cache_dir.mkdir(exist_ok=True)

    def _key(self, text: str, model: str) -> str:
        content = f"{model}:{text}"
        return hashlib.sha256(content.encode()).hexdigest()

    def get(self, text: str, model: str) -> np.ndarray | None:
        key = self._key(text, model)
        cache_file = self.cache_dir / f"{key}.npy"
        if cache_file.exists():
            return np.load(cache_file)
        return None

    def set(self, text: str, model: str, embedding: np.ndarray) -> None:
        key = self._key(text, model)
        cache_file = self.cache_dir / f"{key}.npy"
        np.save(cache_file, embedding)

    def embed_with_cache(
        self,
        texts: list[str],
        embed_func,  # Callable that embeds uncached texts
        model: str,
    ) -> list[np.ndarray]:
        results = []
        uncached_texts = []
        uncached_indices = []

        for i, text in enumerate(texts):
            cached = self.get(text, model)
            if cached is not None:
                results.append((i, cached))
            else:
                uncached_texts.append(text)
                uncached_indices.append(i)

        if uncached_texts:
            new_embeddings = embed_func(uncached_texts)
            for idx, text, emb in zip(uncached_indices, uncached_texts, new_embeddings):
                emb_array = np.array(emb)
                self.set(text, model, emb_array)
                results.append((idx, emb_array))

        results.sort(key=lambda x: x[0])
        return [emb for _, emb in results]

Common Mistakes

:::danger Using ada-002 in 2024 and beyond ada-002 is 5× more expensive than text-embedding-3-small for equivalent or worse quality. If you're still using it, migrate. The one-time cost of re-embedding your corpus is worth it at any meaningful scale. :::

:::danger Not specifying input_type for Voyage and Cohere Voyage AI and Cohere Embed use asymmetric encoding - queries and documents are encoded differently. Failing to specify input_type="query" for queries and input_type="document" for passages results in symmetric encoding that loses 10-20% retrieval quality. Always specify the correct input type. :::

:::warning Treating all API embeddings as interchangeable Embeddings from different models (and even different versions of the same model) are not comparable. You cannot mix text-embedding-3-small embeddings with voyage-3 embeddings in the same index. If you switch models, you must re-embed your entire corpus. Plan model migration carefully and budget for re-embedding time and cost. :::

:::warning Not rate-limiting concurrent API requests Sending too many concurrent embedding requests will trigger rate limiting, causing requests to fail or be delayed. Always implement a semaphore or request queue to control concurrency. For OpenAI, stay below your TPM (tokens per minute) and RPM (requests per minute) limits. Implement exponential backoff for rate limit errors. :::

:::tip Request multiple quantization levels in a single Cohere API call Cohere Embed v3 supports returning float32, int8, and binary embeddings in a single API call. Request all three at embedding time and store the float32 for high-quality reranking and the binary for fast first-pass search. You pay for one API call but get the flexibility to use different quantization levels for different stages of the retrieval pipeline. :::

Interview Q&A

Q1: What is the difference between text-embedding-ada-002 and text-embedding-3?

Ada-002 is OpenAI's previous generation embedding model (1536 dimensions, $0.10/million tokens). text-embedding-3 is the current generation with two variants: text-embedding-3-small (1536 dims,$ 0.02/million tokens, better quality than ada-002) and text-embedding-3-large (3072 dims, $0.13/million tokens, state-of-the-art quality). The critical difference beyond quality: text-embedding-3 models are trained with Matryoshka Representation Learning, allowing dimension reduction via the dimensions API parameter without quality loss. Ada-002 has no such capability. At equivalent quality to ada-002, text-embedding-3-small is 5× cheaper - the migration is almost always worth it.

Q2: When would you choose Voyage AI over OpenAI for embeddings?

Four situations favor Voyage: (1) Your documents are longer than ~2,000 words - Voyage has a 32k context window vs OpenAI's 8,191, enabling full-document embedding without chunking for most documents. (2) You need higher retrieval quality - voyage-3 consistently scores 1-3 MTEB retrieval points higher than text-embedding-3-large. (3) You're in financial or legal domain - Voyage's domain-specific models significantly outperform general models. (4) Cost is secondary to quality - voyage-3 at $0.06/million tokens offers better quality than text-embedding-3-small at$ 0.02/million tokens.

Q3: What batch processing patterns are important for embedding at scale?

Three patterns: (1) Batching - always send multiple texts per API request (up to 2048 for OpenAI) to maximize throughput and minimize per-call overhead. (2) Controlled concurrency - use an asyncio Semaphore to limit concurrent requests and avoid rate limiting. Start with 5-10 concurrent requests and increase based on your tier's rate limits. (3) Exponential backoff - implement automatic retry with exponential backoff on rate limit errors. Don't retry immediately; wait 2^n seconds where n is the retry count. This is the standard pattern for all rate-limited APIs.

Q4: When does self-hosted embedding become cost-effective vs API embedding?

The crossover is roughly 500M–2B tokens per month, depending on your GPU costs and engineering resources. Below 500M tokens/month, API is almost always cheaper when you account for engineering time to maintain self-hosted infrastructure. Above 2B tokens/month, self-hosted is typically cheaper: a single A100 GPU can embed ~100M tokens/hour, and at $2.50/hour that's$ 0.000025/million tokens - 800× cheaper than OpenAI API per token. The complete cost model must include: GPU cost (hardware or cloud), engineering maintenance time (10-40 hours/month), initial setup cost (40-80 hours), and reliability infrastructure (monitoring, failover).

Summary

The API embedding landscape offers clear choices for different needs:

text-embedding-3-small: Best price/quality for most production applications. $0.02/million tokens, 1536 dims, Matryoshka-trained for dimension reduction.
text-embedding-3-large: When you need the best quality OpenAI offers. $0.13/million tokens, 3072 dims.
voyage-3: Best retrieval quality among commercial APIs. 32k context window. Preferred for long documents and domain-specific use cases.
Cohere Embed v3: Best for multilingual applications and when you want native int8/binary quantization support.

Self-hosted becomes cost-effective above ~500M–2B tokens/month. Below that, API is faster, cheaper, and operationally simpler. Always implement proper batching, rate limiting, and exponential backoff for production usage.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Embedding Space Explorer demo on the EngineersOfAI Playground - no code required.

:::

The $50,000 Embedding Bill​

OpenAI text-embedding-ada-002: The Old Standard​

text-embedding-3-small and text-embedding-3-large​

text-embedding-3-small​

text-embedding-3-large​

The dimensions parameter​

Voyage AI​

Voyage AI model lineup​

When Voyage beats OpenAI​

Cohere Embed v3​

Key characteristics​

Rate Limits and Batch Processing​

Cost Analysis at Scale​

Production Patterns​

Caching embeddings​

Common Mistakes​

Interview Q&A​

Summary​