What is build vs buy ML?

A rigorous financial and risk framework for deciding when to build ML infrastructure in-house vs use managed services - applied to feature stores, vector DBs, LLMs, and more.

How does make vs buy ML infrastructure work in practice?

Build vs Buy Analysis covers build vs buy ML, make vs buy ML infrastructure, ML platform decisions from first principles with code examples. Free lesson at https://engineersofai.com/docs/ai-systems/cost-and-finops/build-vs-buy-analysis

What is the difference between build vs buy ML and ML platform decisions?

See the full breakdown at https://engineersofai.com/docs/ai-systems/cost-and-finops/build-vs-buy-analysis

:::tip 🎮 Interactive Playground Visualize this concept: Try the Build vs Buy demo on the EngineersOfAI Playground - no code required. :::

Build vs Buy Analysis

The $2M Llama Decision

The startup had 8 engineers and was spending $45,000/month on the OpenAI API. At that rate, they'd burn$ 540,000 in cloud costs next year - more than the fully-loaded cost of two senior engineers. The CEO's question was simple: "Could we just run Llama ourselves?"

The answer, like all real engineering answers, was: "It depends." And the decision framework to answer "it depends" properly turned out to be more important than the answer itself.

The engineering team spent two weeks doing a rigorous build vs buy analysis. They modeled compute costs for self-hosted Llama 3 70B, engineering time to deploy and maintain, infrastructure complexity, latency improvements, and the risk profile of each option. They ran benchmark tests comparing output quality on their specific task distribution. They talked to two other startups who had made the same decision and charted different paths.

The recommendation they brought back was nuanced: keep OpenAI for complex reasoning tasks (20% of volume), switch to a self-hosted Llama 3 70B for standard generation (80% of volume), and accept 6 weeks of migration risk. The financial case was clear: $45K/month reduced to$ 12K/month after an initial $30K infrastructure investment, with 6-month payback. They executed the plan. It worked.

The lesson isn't about the specific conclusion. It's about the process - the structured analysis that turned a vague "should we build or buy?" into a defensible engineering decision backed by real numbers.

Why Build vs Buy Is Always Wrong When Done Wrong

Most teams approach build vs buy with hidden biases that corrupt the analysis before it starts:

The engineer's bias: Engineers like to build things. Building is interesting, buying feels like giving up. This bias systematically underestimates build cost and overestimates vendor risk.

The manager's bias: Managers like to buy things. Buying is fast, buying has a clear price, buying transfers responsibility. This bias underestimates the lock-in risk and technical debt of poor vendor fit.

The sunk cost bias: If you've already started building, the investment looks sunk and buying feels like waste. But engineering hours already spent are sunk costs and should not influence the forward-looking decision.

A rigorous build vs buy analysis requires: (1) explicit cost modeling for both paths over 3 years, (2) honest assessment of quality differential, (3) quantified risk analysis, and (4) clear decision criteria established before you look at the numbers.

The Build vs Buy Decision Framework

The Five Dimensions

Every build vs buy analysis should assess these five dimensions:

Dimension	Build Advantages	Buy Advantages
Cost	No premium; pay compute + eng time	Predictable pricing; no ops cost
Quality	Customize to exact needs	Vendor has years of optimization
Speed	Slower (months to build)	Faster (days to integrate)
Control	Full control, zero lock-in	Vendor controls roadmap and uptime
Risk	Build failure risk, maintenance burden	Vendor risk, price changes, API changes

Case Study 1: OpenAI API vs Self-Hosted Llama

This is the highest-stakes build vs buy decision for most AI teams in 2024. Here is the full analysis framework:

Step 1: Quality Benchmark

Quality is non-negotiable - if self-hosted can't match API quality on your task, everything else is moot. Run a proper benchmark:

import openai
from transformers import pipeline
import json
from typing import Callable

def benchmark_models(
    test_cases: list[dict],
    model_a_fn: Callable,     # e.g., OpenAI GPT-4
    model_b_fn: Callable,     # e.g., self-hosted Llama 3 70B
    judge_fn: Callable,       # automated quality judge (also an LLM)
) -> dict:
    """
    Compare two models on a representative test set.
    Uses LLM-as-judge for quality scoring.
    """
    results = {"a_wins": 0, "b_wins": 0, "ties": 0, "details": []}

    for test in test_cases:
        response_a = model_a_fn(test["prompt"])
        response_b = model_b_fn(test["prompt"])

        # LLM judge rates both responses
        judgment = judge_fn(
            prompt=test["prompt"],
            response_a=response_a,
            response_b=response_b,
            rubric=test.get("rubric", "accuracy, completeness, clarity"),
        )

        winner = judgment["winner"]  # "a", "b", or "tie"
        results[f"{winner}_wins" if winner != "tie" else "ties"] += 1
        results["details"].append({
            "prompt": test["prompt"],
            "winner": winner,
            "score_a": judgment["score_a"],
            "score_b": judgment["score_b"],
        })

    total = len(test_cases)
    results["model_a_win_rate"] = results["a_wins"] / total
    results["model_b_win_rate"] = results["b_wins"] / total
    results["quality_delta"] = results["model_a_win_rate"] - results["model_b_win_rate"]
    return results

# Decision rule: if quality_delta < 0.10 (model B within 10% of model A quality),
# proceed to cost analysis. Otherwise, evaluate fine-tuning model B.

Step 2: TCO Model

from dataclasses import dataclass

@dataclass
class LLMDeploymentCost:
    """Full TCO model for LLM API vs self-hosted comparison."""

    # API option
    monthly_requests: int
    avg_input_tokens: int
    avg_output_tokens: int
    api_input_price_per_1k: float    # e.g., $0.01 for GPT-4o
    api_output_price_per_1k: float   # e.g., $0.03 for GPT-4o

    # Self-hosted option
    gpu_instance_hourly_cost: float  # e.g., $3.06 for A100
    gpu_instances_needed: int
    gpu_utilization: float           # 0.0-1.0, typically 0.6-0.8
    infra_overhead_pct: float        # monitoring, storage, etc. typically 0.15

    # Engineering costs
    engineer_hourly_rate: float      # fully-loaded, e.g., $115/hr ($240K/yr)
    initial_build_hours: float       # deployment + testing, e.g., 400 hours
    monthly_maintenance_hours: float # ongoing ops, e.g., 20 hours/month

    def api_monthly_cost(self) -> float:
        input_cost = self.monthly_requests * self.avg_input_tokens * self.api_input_price_per_1k / 1000
        output_cost = self.monthly_requests * self.avg_output_tokens * self.api_output_price_per_1k / 1000
        return input_cost + output_cost

    def self_hosted_monthly_cost(self) -> float:
        compute = self.gpu_instance_hourly_cost * self.gpu_instances_needed * 730
        overhead = compute * self.infra_overhead_pct
        maintenance = self.monthly_maintenance_hours * self.engineer_hourly_rate
        return compute + overhead + maintenance

    def self_hosted_initial_cost(self) -> float:
        return self.initial_build_hours * self.engineer_hourly_rate

    def payback_months(self) -> float:
        """Months until self-hosted TCO breaks even vs API."""
        monthly_savings = self.api_monthly_cost() - self.self_hosted_monthly_cost()
        if monthly_savings <= 0:
            return float('inf')  # never pays back
        return self.self_hosted_initial_cost() / monthly_savings

    def three_year_comparison(self) -> dict:
        api_3yr = self.api_monthly_cost() * 36
        self_hosted_3yr = (
            self.self_hosted_initial_cost()
            + self.self_hosted_monthly_cost() * 36
        )
        return {
            "api_3yr_cost": api_3yr,
            "self_hosted_3yr_cost": self_hosted_3yr,
            "savings": api_3yr - self_hosted_3yr,
            "payback_months": self.payback_months(),
            "monthly_api": self.api_monthly_cost(),
            "monthly_self_hosted": self.self_hosted_monthly_cost(),
        }


# Scenario: startup with 500K requests/month, avg 2K input, 500 output tokens
analysis = LLMDeploymentCost(
    monthly_requests=500_000,
    avg_input_tokens=2000,
    avg_output_tokens=500,
    api_input_price_per_1k=0.005,      # GPT-4o
    api_output_price_per_1k=0.015,
    gpu_instance_hourly_cost=3.06,
    gpu_instances_needed=2,             # 2x A100 for 70B at this volume
    gpu_utilization=0.65,
    infra_overhead_pct=0.15,
    engineer_hourly_rate=115,
    initial_build_hours=400,
    monthly_maintenance_hours=20,
)

result = analysis.three_year_comparison()
print(f"Monthly API cost:          ${result['monthly_api']:,.0f}")
print(f"Monthly self-hosted cost:  ${result['monthly_self_hosted']:,.0f}")
print(f"Payback period:            {result['payback_months']:.1f} months")
print(f"3-year savings:            ${result['savings']:,.0f}")

Step 3: Risk Analysis

Risk	API (Buy)	Self-Hosted (Build)
Price increases	HIGH - OpenAI raised prices 3× in 2023	LOW - commodity hardware cost declining
Vendor outage	MEDIUM - 99.9% SLA = 8.7 hrs/yr downtime	LOW - you control the infra
API changes/deprecation	HIGH - GPT-3.5 access changed multiple times	NONE
Security/data privacy	MEDIUM - data sent to third party	LOW - data stays in your infra
Build failure	NONE	MEDIUM - 6-week delay risk
Maintenance burden	NONE	HIGH - requires ML infra expertise

Decision rule: If payback is under 12 months AND quality delta is under 10% AND you have ML infrastructure expertise in-house, self-hosting is almost always the right call at this volume. If payback is over 24 months or you lack the expertise, stay with the API.

Case Study 2: MLflow vs Weights & Biases

This is a common decision for growing ML teams. Both track experiments. The question is which one fits your context.

def evaluate_experiment_tracking_options(
    team_size: int,
    monthly_experiments: int,
    requires_sso: bool,
    requires_self_hosted: bool,
    annual_ml_engineer_cost: float,
) -> dict:
    """
    Compare MLflow (self-managed) vs Weights & Biases (managed).
    """
    # MLflow self-hosted costs
    mlflow_infra_monthly = 200  # small EC2 + RDS for the tracking server
    mlflow_setup_hours = 40     # initial setup + CI integration
    mlflow_maintenance_hrs_mo = 4  # updates, backups, user management
    hourly_eng = annual_ml_engineer_cost / (52 * 40)

    mlflow_initial = mlflow_setup_hours * hourly_eng
    mlflow_monthly = mlflow_infra_monthly + mlflow_maintenance_hrs_mo * hourly_eng

    # Weights & Biases costs
    # Pricing tiers (2024): Free (limited), Team $50/seat/mo, Enterprise custom
    if team_size <= 5 and not requires_sso:
        wandb_monthly = team_size * 50
    elif requires_sso or team_size > 50:
        wandb_monthly = team_size * 80  # Enterprise estimate
    else:
        wandb_monthly = team_size * 50

    wandb_setup_hours = 8  # just API key and integration

    return {
        "mlflow": {
            "initial_cost": mlflow_initial,
            "monthly_cost": mlflow_monthly,
            "annual_cost": mlflow_initial + mlflow_monthly * 12,
            "pros": ["Free", "Full control", "Self-hosted for compliance"],
            "cons": ["Requires maintenance", "Less polished UX", "No hosted sweeps"],
        },
        "wandb": {
            "initial_cost": wandb_setup_hours * hourly_eng,
            "monthly_cost": wandb_monthly,
            "annual_cost": wandb_setup_hours * hourly_eng + wandb_monthly * 12,
            "pros": ["Best-in-class UX", "Sweeps built-in", "Zero maintenance"],
            "cons": ["Data leaves your infra", "Scales expensive at 50+ seats"],
        },
        "recommendation": _recommend_tracking(
            mlflow_monthly, wandb_monthly, team_size, requires_self_hosted
        ),
    }


def _recommend_tracking(
    mlflow_monthly: float,
    wandb_monthly: float,
    team_size: int,
    requires_self_hosted: bool,
) -> str:
    if requires_self_hosted:
        return "MLflow - compliance requires self-hosted data"
    if team_size <= 5:
        return "W&B - low team cost, high productivity gain"
    if team_size >= 30:
        return "MLflow - W&B Enterprise costs exceed MLflow TCO"
    cost_diff = wandb_monthly - mlflow_monthly
    if cost_diff < 500:
        return "W&B - small cost premium worth the UX improvement"
    return "MLflow - W&B premium too high at this team size"

General guidance:

1–10 engineers: W&B is almost always worth it. $500/month for significant productivity gain.
10–30 engineers: Depends on budget. W&B at $50/seat/month may still win on productivity.
30+ engineers: MLflow self-hosted is usually cheaper. W&B Enterprise pricing gets high.
Compliance/regulated industries: MLflow always (data stays in your infra).

Case Study 3: Vector Database - Self-Hosted Qdrant vs Pinecone

def vector_db_tco(
    vectors_count: int,           # total vectors stored
    daily_query_volume: int,      # queries per day
    annual_ml_engineer_cost: float,
) -> dict:
    """Compare Qdrant self-hosted vs Pinecone managed."""

    hourly_eng = annual_ml_engineer_cost / (52 * 40)
    vectors_in_millions = vectors_count / 1_000_000

    # Pinecone pricing (2024 approximate)
    # Serverless: ~$0.096/million reads, $0.2/million writes
    # Storage: $0.00000025/vector/month
    pinecone_monthly_storage = vectors_count * 0.00000025 * 30
    pinecone_monthly_queries = daily_query_volume * 30 * 0.096 / 1_000_000
    pinecone_monthly = pinecone_monthly_storage + pinecone_monthly_queries

    # Qdrant self-hosted (AWS)
    # Memory requirement: ~1.5 GB per million 1536-dim float32 vectors
    memory_gb_needed = vectors_in_millions * 1.5 * 1.5  # 50% headroom
    # r6g.2xlarge: 64 GB RAM, $0.4032/hr
    instances_needed = max(1, int(memory_gb_needed / 60) + 1)
    qdrant_compute_monthly = instances_needed * 0.40 * 730
    qdrant_setup_hours = 60  # initial deployment, HA config, backup setup
    qdrant_maintenance_monthly = 8 * hourly_eng  # 8 hrs/month

    qdrant_initial = qdrant_setup_hours * hourly_eng
    qdrant_monthly = qdrant_compute_monthly + qdrant_maintenance_monthly

    return {
        "pinecone": {
            "initial_cost": 0,
            "monthly_cost": pinecone_monthly,
            "annual_cost": pinecone_monthly * 12,
        },
        "qdrant_self_hosted": {
            "initial_cost": qdrant_initial,
            "monthly_cost": qdrant_monthly,
            "annual_cost": qdrant_initial + qdrant_monthly * 12,
        },
        "payback_months": qdrant_initial / max(0.01, pinecone_monthly - qdrant_monthly),
    }

# Example: 10M vectors, 50K queries/day
result = vector_db_tco(
    vectors_count=10_000_000,
    daily_query_volume=50_000,
    annual_ml_engineer_cost=200_000,
)
# Pinecone: ~$750/month | Qdrant: ~$400 compute + $346 ops = $746/month
# They're essentially equal at this scale - lock-in risk tips toward Qdrant

The Switching Cost Problem

Build vs buy analyses often ignore switching costs - the cost of migrating from one option to another if the first choice doesn't work out.

def calculate_switching_cost(
    current_vendor_monthly_spend: float,
    migration_engineer_weeks: float,
    weekly_engineer_cost: float,
    integration_count: int,          # number of internal systems to update
    integration_cost_per_system: float,  # engineering hours per integration update
) -> dict:
    """Estimate cost of switching vendors."""

    migration_labor = migration_engineer_weeks * weekly_engineer_cost
    integration_updates = integration_count * integration_cost_per_system
    testing_and_validation = migration_labor * 0.3  # 30% overhead for testing
    risk_buffer = (migration_labor + integration_updates) * 0.2  # 20% risk buffer

    total_switching_cost = (
        migration_labor
        + integration_updates
        + testing_and_validation
        + risk_buffer
    )

    return {
        "migration_labor": migration_labor,
        "integration_updates": integration_updates,
        "testing_overhead": testing_and_validation,
        "risk_buffer": risk_buffer,
        "total_switching_cost": total_switching_cost,
        "months_of_spend": total_switching_cost / current_vendor_monthly_spend,
    }

The abstraction layer rule: Any vendor integration should be wrapped in an abstraction layer that separates your business logic from the vendor API. This is not over-engineering - it is the difference between a 2-week migration and a 6-month migration when you need to switch.

# BAD: direct vendor dependency throughout codebase
import pinecone
pinecone.init(api_key="...", environment="us-east1-gcp")
index = pinecone.Index("my-index")
index.upsert(vectors=[...])     # Pinecone-specific API everywhere

# GOOD: abstracted behind an interface
from abc import ABC, abstractmethod

class VectorStore(ABC):
    @abstractmethod
    def upsert(self, vectors: list[dict]) -> None: ...

    @abstractmethod
    def query(self, vector: list[float], top_k: int) -> list[dict]: ...

class PineconeStore(VectorStore):
    def upsert(self, vectors): ...   # Pinecone-specific impl
    def query(self, vector, top_k): ...

class QdrantStore(VectorStore):
    def upsert(self, vectors): ...   # Qdrant-specific impl
    def query(self, vector, top_k): ...

# Business logic uses VectorStore interface only
# Switching vendors = swap implementation class, change 1 line

Production Engineering Notes

Build vs Buy by Component

General guidance for common ML infrastructure components:

Component	Recommendation	Rationale
Experiment tracking	Buy (W&B) up to 20 seats	High dev productivity value
Model serving	Buy (SageMaker/Vertex) at low volume	Buy when ops overhead exceeds savings
Feature store	Buy (Tecton/Feast hosted)	Very complex to build correctly
Vector DB	Buy (Pinecone) at low volume, self-host at high	Pinecone expensive at scale
LLM inference	Buy (OpenAI) under 1M req/mo	Self-host only at high volume
Monitoring	Buy (Arize, WhyLabs)	Data science-specific features
Data pipelines	Build (Airflow/Dagster)	Standard OSS is good enough
Training infra	Build (K8s + Kueue) at scale	Vendor premium high at scale

Common Mistakes

:::danger Making build vs buy a one-time decision The optimal answer changes as your scale changes. A startup at 10K requests/month should buy everything. The same company at 10M requests/month should build several things. Re-evaluate major build vs buy decisions annually or when your volume changes by 10×. :::

:::danger Underestimating the "last mile" of vendor integration Vendors advertise time-to-value as hours. The real integration - production-grade error handling, authentication, monitoring, retry logic, rate limit handling, and testing - takes 2–4 weeks for any non-trivial system. Always add this to your buy cost estimate. :::

:::warning Ignoring vendor concentration risk If your product's critical path runs through a single vendor's API, that vendor has leverage over your business. Vendor downtime becomes your downtime. Price increases are your cost increases. For critical paths, either maintain a backup option or build self-hosted redundancy. Dependency on OpenAI for 100% of inference is a business risk, not just a technical one. :::

Interview Q&A

Q: How would you evaluate whether to self-host an LLM vs use an API?

A: Three-step process. First, quality: benchmark the candidate self-hosted model against the API on your actual task distribution - not generic benchmarks. Use LLM-as-judge for scalable comparison. If quality delta is under 10%, proceed. Second, cost: model the full TCO - compute, engineering time to deploy and maintain, infrastructure overhead - vs API cost at your current and projected volume. The breakeven is usually around 1M requests/month for GPT-3.5 class models. Third, risk: assess vendor concentration risk, data privacy requirements, and your team's infrastructure expertise. If all three are favorable, self-host. If quality gap is large or expertise is lacking, stay with the API but abstract behind an interface so you can migrate later.

Q: What should always be wrapped behind an abstraction layer in ML systems?

A: Any external vendor dependency that touches your critical path. Specifically: LLM API calls, vector database operations, feature store access, and experiment tracking logging. The abstraction doesn't have to be complex - a simple wrapper class with a defined interface is sufficient. The goal is that switching vendors requires changing one implementation class, not rewriting every callsite in the codebase. I've seen a team take 4 months to migrate from one vector DB to another because they had 50+ direct references scattered across 15 services. With a proper abstraction, the same migration would have taken 2 weeks.

Q: What factors make self-hosting ML infrastructure NOT worth it?

A: Four main factors. First, small team: if your ML team is under 5 engineers, you don't have the capacity to properly maintain self-hosted infrastructure - it will become a distraction from model development. Second, small volume: below the crossover point (varies by component), vendor premium is less than engineering savings. Third, lack of expertise: Kubernetes, CUDA drivers, distributed systems debugging - self-hosting requires specialists. The cost of hiring or training is often hidden in these analyses. Fourth, pace of iteration: if you're still finding product-market fit, operational complexity slows you down when speed matters most. Buy everything until you're at scale.

Q: How do you handle a situation where you've built something that should have been bought?

A: Acknowledge it early and evaluate the migration cost honestly. Sunk costs are irrelevant - the question is whether the ongoing cost of maintaining the build is higher than the cost of migrating to a buy solution plus the switching cost. Build a transparent TCO comparison showing current cost (including all engineering time spent on maintenance) vs buy + migration cost. Most engineers resist this because it means admitting the build was a mistake. The right framing: you built it because you had to, you learned from it, and now the context has changed. I've led three such migrations - in all cases, the teams that did it quickly (rip-the-bandaid) saved more than teams that delayed because of emotional attachment to what they'd built.

Q: When is it worth paying a 40% premium for a managed service?

A: When the 40% premium is less than the engineering cost of the operational work you're offloading. For a $10K/month compute bill, 40% premium =$ 4K/month = $48K/year. If the managed service saves 4 hours/week of a$ 200K/year engineer's time: 4 × 52 × $96 =$ 19,968/year. In this case, build it yourself - the premium exceeds the savings. If it saves 10 hours/week: $49,920/year - the managed service just barely pays for itself. At 15 hours/week:$ 74,880 saved, premium costs $48K - clear win for the managed service. The key insight: compute premium is fixed, engineering savings scale with operational complexity. More complex infrastructure = higher relative value of managed services.

The $2M Llama Decision​

Why Build vs Buy Is Always Wrong When Done Wrong​

The Build vs Buy Decision Framework​

The Five Dimensions​

Case Study 1: OpenAI API vs Self-Hosted Llama​

Step 1: Quality Benchmark​

Step 2: TCO Model​

Step 3: Risk Analysis​

Case Study 2: MLflow vs Weights & Biases​

Case Study 3: Vector Database - Self-Hosted Qdrant vs Pinecone​

The Switching Cost Problem​

Production Engineering Notes​

Build vs Buy by Component​

Common Mistakes​

Interview Q&A​