:::tip 🎮 Interactive Playground Visualize this concept: Try the Build vs Buy demo on the EngineersOfAI Playground - no code required. :::
Build vs Buy Analysis
The $2M Llama Decision
The startup had 8 engineers and was spending 540,000 in cloud costs next year - more than the fully-loaded cost of two senior engineers. The CEO's question was simple: "Could we just run Llama ourselves?"
The answer, like all real engineering answers, was: "It depends." And the decision framework to answer "it depends" properly turned out to be more important than the answer itself.
The engineering team spent two weeks doing a rigorous build vs buy analysis. They modeled compute costs for self-hosted Llama 3 70B, engineering time to deploy and maintain, infrastructure complexity, latency improvements, and the risk profile of each option. They ran benchmark tests comparing output quality on their specific task distribution. They talked to two other startups who had made the same decision and charted different paths.
The recommendation they brought back was nuanced: keep OpenAI for complex reasoning tasks (20% of volume), switch to a self-hosted Llama 3 70B for standard generation (80% of volume), and accept 6 weeks of migration risk. The financial case was clear: 12K/month after an initial $30K infrastructure investment, with 6-month payback. They executed the plan. It worked.
The lesson isn't about the specific conclusion. It's about the process - the structured analysis that turned a vague "should we build or buy?" into a defensible engineering decision backed by real numbers.
Why Build vs Buy Is Always Wrong When Done Wrong
Most teams approach build vs buy with hidden biases that corrupt the analysis before it starts:
The engineer's bias: Engineers like to build things. Building is interesting, buying feels like giving up. This bias systematically underestimates build cost and overestimates vendor risk.
The manager's bias: Managers like to buy things. Buying is fast, buying has a clear price, buying transfers responsibility. This bias underestimates the lock-in risk and technical debt of poor vendor fit.
The sunk cost bias: If you've already started building, the investment looks sunk and buying feels like waste. But engineering hours already spent are sunk costs and should not influence the forward-looking decision.
A rigorous build vs buy analysis requires: (1) explicit cost modeling for both paths over 3 years, (2) honest assessment of quality differential, (3) quantified risk analysis, and (4) clear decision criteria established before you look at the numbers.
The Build vs Buy Decision Framework
The Five Dimensions
Every build vs buy analysis should assess these five dimensions:
| Dimension | Build Advantages | Buy Advantages |
|---|---|---|
| Cost | No premium; pay compute + eng time | Predictable pricing; no ops cost |
| Quality | Customize to exact needs | Vendor has years of optimization |
| Speed | Slower (months to build) | Faster (days to integrate) |
| Control | Full control, zero lock-in | Vendor controls roadmap and uptime |
| Risk | Build failure risk, maintenance burden | Vendor risk, price changes, API changes |
Case Study 1: OpenAI API vs Self-Hosted Llama
This is the highest-stakes build vs buy decision for most AI teams in 2024. Here is the full analysis framework:
Step 1: Quality Benchmark
Quality is non-negotiable - if self-hosted can't match API quality on your task, everything else is moot. Run a proper benchmark:
import openai
from transformers import pipeline
import json
from typing import Callable
def benchmark_models(
test_cases: list[dict],
model_a_fn: Callable, # e.g., OpenAI GPT-4
model_b_fn: Callable, # e.g., self-hosted Llama 3 70B
judge_fn: Callable, # automated quality judge (also an LLM)
) -> dict:
"""
Compare two models on a representative test set.
Uses LLM-as-judge for quality scoring.
"""
results = {"a_wins": 0, "b_wins": 0, "ties": 0, "details": []}
for test in test_cases:
response_a = model_a_fn(test["prompt"])
response_b = model_b_fn(test["prompt"])
# LLM judge rates both responses
judgment = judge_fn(
prompt=test["prompt"],
response_a=response_a,
response_b=response_b,
rubric=test.get("rubric", "accuracy, completeness, clarity"),
)
winner = judgment["winner"] # "a", "b", or "tie"
results[f"{winner}_wins" if winner != "tie" else "ties"] += 1
results["details"].append({
"prompt": test["prompt"],
"winner": winner,
"score_a": judgment["score_a"],
"score_b": judgment["score_b"],
})
total = len(test_cases)
results["model_a_win_rate"] = results["a_wins"] / total
results["model_b_win_rate"] = results["b_wins"] / total
results["quality_delta"] = results["model_a_win_rate"] - results["model_b_win_rate"]
return results
# Decision rule: if quality_delta < 0.10 (model B within 10% of model A quality),
# proceed to cost analysis. Otherwise, evaluate fine-tuning model B.
Step 2: TCO Model
from dataclasses import dataclass
@dataclass
class LLMDeploymentCost:
"""Full TCO model for LLM API vs self-hosted comparison."""
# API option
monthly_requests: int
avg_input_tokens: int
avg_output_tokens: int
api_input_price_per_1k: float # e.g., $0.01 for GPT-4o
api_output_price_per_1k: float # e.g., $0.03 for GPT-4o
# Self-hosted option
gpu_instance_hourly_cost: float # e.g., $3.06 for A100
gpu_instances_needed: int
gpu_utilization: float # 0.0-1.0, typically 0.6-0.8
infra_overhead_pct: float # monitoring, storage, etc. typically 0.15
# Engineering costs
engineer_hourly_rate: float # fully-loaded, e.g., $115/hr ($240K/yr)
initial_build_hours: float # deployment + testing, e.g., 400 hours
monthly_maintenance_hours: float # ongoing ops, e.g., 20 hours/month
def api_monthly_cost(self) -> float:
input_cost = self.monthly_requests * self.avg_input_tokens * self.api_input_price_per_1k / 1000
output_cost = self.monthly_requests * self.avg_output_tokens * self.api_output_price_per_1k / 1000
return input_cost + output_cost
def self_hosted_monthly_cost(self) -> float:
compute = self.gpu_instance_hourly_cost * self.gpu_instances_needed * 730
overhead = compute * self.infra_overhead_pct
maintenance = self.monthly_maintenance_hours * self.engineer_hourly_rate
return compute + overhead + maintenance
def self_hosted_initial_cost(self) -> float:
return self.initial_build_hours * self.engineer_hourly_rate
def payback_months(self) -> float:
"""Months until self-hosted TCO breaks even vs API."""
monthly_savings = self.api_monthly_cost() - self.self_hosted_monthly_cost()
if monthly_savings <= 0:
return float('inf') # never pays back
return self.self_hosted_initial_cost() / monthly_savings
def three_year_comparison(self) -> dict:
api_3yr = self.api_monthly_cost() * 36
self_hosted_3yr = (
self.self_hosted_initial_cost()
+ self.self_hosted_monthly_cost() * 36
)
return {
"api_3yr_cost": api_3yr,
"self_hosted_3yr_cost": self_hosted_3yr,
"savings": api_3yr - self_hosted_3yr,
"payback_months": self.payback_months(),
"monthly_api": self.api_monthly_cost(),
"monthly_self_hosted": self.self_hosted_monthly_cost(),
}
# Scenario: startup with 500K requests/month, avg 2K input, 500 output tokens
analysis = LLMDeploymentCost(
monthly_requests=500_000,
avg_input_tokens=2000,
avg_output_tokens=500,
api_input_price_per_1k=0.005, # GPT-4o
api_output_price_per_1k=0.015,
gpu_instance_hourly_cost=3.06,
gpu_instances_needed=2, # 2x A100 for 70B at this volume
gpu_utilization=0.65,
infra_overhead_pct=0.15,
engineer_hourly_rate=115,
initial_build_hours=400,
monthly_maintenance_hours=20,
)
result = analysis.three_year_comparison()
print(f"Monthly API cost: ${result['monthly_api']:,.0f}")
print(f"Monthly self-hosted cost: ${result['monthly_self_hosted']:,.0f}")
print(f"Payback period: {result['payback_months']:.1f} months")
print(f"3-year savings: ${result['savings']:,.0f}")
Step 3: Risk Analysis
| Risk | API (Buy) | Self-Hosted (Build) |
|---|---|---|
| Price increases | HIGH - OpenAI raised prices 3× in 2023 | LOW - commodity hardware cost declining |
| Vendor outage | MEDIUM - 99.9% SLA = 8.7 hrs/yr downtime | LOW - you control the infra |
| API changes/deprecation | HIGH - GPT-3.5 access changed multiple times | NONE |
| Security/data privacy | MEDIUM - data sent to third party | LOW - data stays in your infra |
| Build failure | NONE | MEDIUM - 6-week delay risk |
| Maintenance burden | NONE | HIGH - requires ML infra expertise |
Decision rule: If payback is under 12 months AND quality delta is under 10% AND you have ML infrastructure expertise in-house, self-hosting is almost always the right call at this volume. If payback is over 24 months or you lack the expertise, stay with the API.
Case Study 2: MLflow vs Weights & Biases
This is a common decision for growing ML teams. Both track experiments. The question is which one fits your context.
def evaluate_experiment_tracking_options(
team_size: int,
monthly_experiments: int,
requires_sso: bool,
requires_self_hosted: bool,
annual_ml_engineer_cost: float,
) -> dict:
"""
Compare MLflow (self-managed) vs Weights & Biases (managed).
"""
# MLflow self-hosted costs
mlflow_infra_monthly = 200 # small EC2 + RDS for the tracking server
mlflow_setup_hours = 40 # initial setup + CI integration
mlflow_maintenance_hrs_mo = 4 # updates, backups, user management
hourly_eng = annual_ml_engineer_cost / (52 * 40)
mlflow_initial = mlflow_setup_hours * hourly_eng
mlflow_monthly = mlflow_infra_monthly + mlflow_maintenance_hrs_mo * hourly_eng
# Weights & Biases costs
# Pricing tiers (2024): Free (limited), Team $50/seat/mo, Enterprise custom
if team_size <= 5 and not requires_sso:
wandb_monthly = team_size * 50
elif requires_sso or team_size > 50:
wandb_monthly = team_size * 80 # Enterprise estimate
else:
wandb_monthly = team_size * 50
wandb_setup_hours = 8 # just API key and integration
return {
"mlflow": {
"initial_cost": mlflow_initial,
"monthly_cost": mlflow_monthly,
"annual_cost": mlflow_initial + mlflow_monthly * 12,
"pros": ["Free", "Full control", "Self-hosted for compliance"],
"cons": ["Requires maintenance", "Less polished UX", "No hosted sweeps"],
},
"wandb": {
"initial_cost": wandb_setup_hours * hourly_eng,
"monthly_cost": wandb_monthly,
"annual_cost": wandb_setup_hours * hourly_eng + wandb_monthly * 12,
"pros": ["Best-in-class UX", "Sweeps built-in", "Zero maintenance"],
"cons": ["Data leaves your infra", "Scales expensive at 50+ seats"],
},
"recommendation": _recommend_tracking(
mlflow_monthly, wandb_monthly, team_size, requires_self_hosted
),
}
def _recommend_tracking(
mlflow_monthly: float,
wandb_monthly: float,
team_size: int,
requires_self_hosted: bool,
) -> str:
if requires_self_hosted:
return "MLflow - compliance requires self-hosted data"
if team_size <= 5:
return "W&B - low team cost, high productivity gain"
if team_size >= 30:
return "MLflow - W&B Enterprise costs exceed MLflow TCO"
cost_diff = wandb_monthly - mlflow_monthly
if cost_diff < 500:
return "W&B - small cost premium worth the UX improvement"
return "MLflow - W&B premium too high at this team size"
General guidance:
- 1–10 engineers: W&B is almost always worth it. $500/month for significant productivity gain.
- 10–30 engineers: Depends on budget. W&B at $50/seat/month may still win on productivity.
- 30+ engineers: MLflow self-hosted is usually cheaper. W&B Enterprise pricing gets high.
- Compliance/regulated industries: MLflow always (data stays in your infra).
Case Study 3: Vector Database - Self-Hosted Qdrant vs Pinecone
def vector_db_tco(
vectors_count: int, # total vectors stored
daily_query_volume: int, # queries per day
annual_ml_engineer_cost: float,
) -> dict:
"""Compare Qdrant self-hosted vs Pinecone managed."""
hourly_eng = annual_ml_engineer_cost / (52 * 40)
vectors_in_millions = vectors_count / 1_000_000
# Pinecone pricing (2024 approximate)
# Serverless: ~$0.096/million reads, $0.2/million writes
# Storage: $0.00000025/vector/month
pinecone_monthly_storage = vectors_count * 0.00000025 * 30
pinecone_monthly_queries = daily_query_volume * 30 * 0.096 / 1_000_000
pinecone_monthly = pinecone_monthly_storage + pinecone_monthly_queries
# Qdrant self-hosted (AWS)
# Memory requirement: ~1.5 GB per million 1536-dim float32 vectors
memory_gb_needed = vectors_in_millions * 1.5 * 1.5 # 50% headroom
# r6g.2xlarge: 64 GB RAM, $0.4032/hr
instances_needed = max(1, int(memory_gb_needed / 60) + 1)
qdrant_compute_monthly = instances_needed * 0.40 * 730
qdrant_setup_hours = 60 # initial deployment, HA config, backup setup
qdrant_maintenance_monthly = 8 * hourly_eng # 8 hrs/month
qdrant_initial = qdrant_setup_hours * hourly_eng
qdrant_monthly = qdrant_compute_monthly + qdrant_maintenance_monthly
return {
"pinecone": {
"initial_cost": 0,
"monthly_cost": pinecone_monthly,
"annual_cost": pinecone_monthly * 12,
},
"qdrant_self_hosted": {
"initial_cost": qdrant_initial,
"monthly_cost": qdrant_monthly,
"annual_cost": qdrant_initial + qdrant_monthly * 12,
},
"payback_months": qdrant_initial / max(0.01, pinecone_monthly - qdrant_monthly),
}
# Example: 10M vectors, 50K queries/day
result = vector_db_tco(
vectors_count=10_000_000,
daily_query_volume=50_000,
annual_ml_engineer_cost=200_000,
)
# Pinecone: ~$750/month | Qdrant: ~$400 compute + $346 ops = $746/month
# They're essentially equal at this scale - lock-in risk tips toward Qdrant
The Switching Cost Problem
Build vs buy analyses often ignore switching costs - the cost of migrating from one option to another if the first choice doesn't work out.
def calculate_switching_cost(
current_vendor_monthly_spend: float,
migration_engineer_weeks: float,
weekly_engineer_cost: float,
integration_count: int, # number of internal systems to update
integration_cost_per_system: float, # engineering hours per integration update
) -> dict:
"""Estimate cost of switching vendors."""
migration_labor = migration_engineer_weeks * weekly_engineer_cost
integration_updates = integration_count * integration_cost_per_system
testing_and_validation = migration_labor * 0.3 # 30% overhead for testing
risk_buffer = (migration_labor + integration_updates) * 0.2 # 20% risk buffer
total_switching_cost = (
migration_labor
+ integration_updates
+ testing_and_validation
+ risk_buffer
)
return {
"migration_labor": migration_labor,
"integration_updates": integration_updates,
"testing_overhead": testing_and_validation,
"risk_buffer": risk_buffer,
"total_switching_cost": total_switching_cost,
"months_of_spend": total_switching_cost / current_vendor_monthly_spend,
}
The abstraction layer rule: Any vendor integration should be wrapped in an abstraction layer that separates your business logic from the vendor API. This is not over-engineering - it is the difference between a 2-week migration and a 6-month migration when you need to switch.
# BAD: direct vendor dependency throughout codebase
import pinecone
pinecone.init(api_key="...", environment="us-east1-gcp")
index = pinecone.Index("my-index")
index.upsert(vectors=[...]) # Pinecone-specific API everywhere
# GOOD: abstracted behind an interface
from abc import ABC, abstractmethod
class VectorStore(ABC):
@abstractmethod
def upsert(self, vectors: list[dict]) -> None: ...
@abstractmethod
def query(self, vector: list[float], top_k: int) -> list[dict]: ...
class PineconeStore(VectorStore):
def upsert(self, vectors): ... # Pinecone-specific impl
def query(self, vector, top_k): ...
class QdrantStore(VectorStore):
def upsert(self, vectors): ... # Qdrant-specific impl
def query(self, vector, top_k): ...
# Business logic uses VectorStore interface only
# Switching vendors = swap implementation class, change 1 line
Production Engineering Notes
Build vs Buy by Component
General guidance for common ML infrastructure components:
| Component | Recommendation | Rationale |
|---|---|---|
| Experiment tracking | Buy (W&B) up to 20 seats | High dev productivity value |
| Model serving | Buy (SageMaker/Vertex) at low volume | Buy when ops overhead exceeds savings |
| Feature store | Buy (Tecton/Feast hosted) | Very complex to build correctly |
| Vector DB | Buy (Pinecone) at low volume, self-host at high | Pinecone expensive at scale |
| LLM inference | Buy (OpenAI) under 1M req/mo | Self-host only at high volume |
| Monitoring | Buy (Arize, WhyLabs) | Data science-specific features |
| Data pipelines | Build (Airflow/Dagster) | Standard OSS is good enough |
| Training infra | Build (K8s + Kueue) at scale | Vendor premium high at scale |
Common Mistakes
:::danger Making build vs buy a one-time decision The optimal answer changes as your scale changes. A startup at 10K requests/month should buy everything. The same company at 10M requests/month should build several things. Re-evaluate major build vs buy decisions annually or when your volume changes by 10×. :::
:::danger Underestimating the "last mile" of vendor integration Vendors advertise time-to-value as hours. The real integration - production-grade error handling, authentication, monitoring, retry logic, rate limit handling, and testing - takes 2–4 weeks for any non-trivial system. Always add this to your buy cost estimate. :::
:::warning Ignoring vendor concentration risk If your product's critical path runs through a single vendor's API, that vendor has leverage over your business. Vendor downtime becomes your downtime. Price increases are your cost increases. For critical paths, either maintain a backup option or build self-hosted redundancy. Dependency on OpenAI for 100% of inference is a business risk, not just a technical one. :::
Interview Q&A
Q: How would you evaluate whether to self-host an LLM vs use an API?
A: Three-step process. First, quality: benchmark the candidate self-hosted model against the API on your actual task distribution - not generic benchmarks. Use LLM-as-judge for scalable comparison. If quality delta is under 10%, proceed. Second, cost: model the full TCO - compute, engineering time to deploy and maintain, infrastructure overhead - vs API cost at your current and projected volume. The breakeven is usually around 1M requests/month for GPT-3.5 class models. Third, risk: assess vendor concentration risk, data privacy requirements, and your team's infrastructure expertise. If all three are favorable, self-host. If quality gap is large or expertise is lacking, stay with the API but abstract behind an interface so you can migrate later.
Q: What should always be wrapped behind an abstraction layer in ML systems?
A: Any external vendor dependency that touches your critical path. Specifically: LLM API calls, vector database operations, feature store access, and experiment tracking logging. The abstraction doesn't have to be complex - a simple wrapper class with a defined interface is sufficient. The goal is that switching vendors requires changing one implementation class, not rewriting every callsite in the codebase. I've seen a team take 4 months to migrate from one vector DB to another because they had 50+ direct references scattered across 15 services. With a proper abstraction, the same migration would have taken 2 weeks.
Q: What factors make self-hosting ML infrastructure NOT worth it?
A: Four main factors. First, small team: if your ML team is under 5 engineers, you don't have the capacity to properly maintain self-hosted infrastructure - it will become a distraction from model development. Second, small volume: below the crossover point (varies by component), vendor premium is less than engineering savings. Third, lack of expertise: Kubernetes, CUDA drivers, distributed systems debugging - self-hosting requires specialists. The cost of hiring or training is often hidden in these analyses. Fourth, pace of iteration: if you're still finding product-market fit, operational complexity slows you down when speed matters most. Buy everything until you're at scale.
Q: How do you handle a situation where you've built something that should have been bought?
A: Acknowledge it early and evaluate the migration cost honestly. Sunk costs are irrelevant - the question is whether the ongoing cost of maintaining the build is higher than the cost of migrating to a buy solution plus the switching cost. Build a transparent TCO comparison showing current cost (including all engineering time spent on maintenance) vs buy + migration cost. Most engineers resist this because it means admitting the build was a mistake. The right framing: you built it because you had to, you learned from it, and now the context has changed. I've led three such migrations - in all cases, the teams that did it quickly (rip-the-bandaid) saved more than teams that delayed because of emotional attachment to what they'd built.
Q: When is it worth paying a 40% premium for a managed service?
A: When the 40% premium is less than the engineering cost of the operational work you're offloading. For a 4K/month = 200K/year engineer's time: 4 × 52 × 19,968/year. In this case, build it yourself - the premium exceeds the savings. If it saves 10 hours/week: 74,880 saved, premium costs $48K - clear win for the managed service. The key insight: compute premium is fixed, engineering savings scale with operational complexity. More complex infrastructure = higher relative value of managed services.
