Build vs. Buy Economics for ML Tools
W&B at $30K or Self-Hosted MLflow: The Full Financial Case
A Series B startup's ML team had been using a free tier of Weights & Biases (W&B) for experiment tracking for 18 months. The team had grown from 3 to 14 data scientists. W&B's growth plan was pricing at $30,000 per year for the team's size. The engineering lead had a competing proposal: self-host MLflow on the existing Kubernetes cluster. MLflow is open source. Hosting it in-house should cost almost nothing.
The startup's CTO asked for a rigorous financial analysis before committing to either option.
The engineering lead built a total cost of ownership (TCO) model. The analysis took two weeks and produced a result that surprised the team: self-hosted MLflow was not cheaper. Over a 3-year horizon, self-hosted MLflow was estimated to cost 127,000 in engineering time, infrastructure, and opportunity cost - compared to W&B at 90,000 for the same period (3 × $30,000).
The self-hosting estimate included: initial setup and configuration (40 hours of senior engineer time at 6,000), ongoing maintenance (4 hours/month × 36 months × 21,600), additional infrastructure (database, object storage, Kubernetes resources = 18,000 over 3 years), and one major version migration (estimated 80 hours = 12,000). The opportunity cost - features that W&B has (dashboards, model registry, hyperparameter optimization, integrations with every framework) that MLflow lacks and would need to be built or worked around - was estimated at 60 engineer-days of equivalent work, or 72,000.
The CTO approved W&B.
This lesson covers the framework that makes this analysis systematic and defensible.
:::tip 🎮 Interactive Playground Visualize this concept: Try the ML Cost & Unit Economics demo on the EngineersOfAI Playground - no code required. :::
Why This Exists: The False Promise of "It's Open Source, It's Free"
Open source software is free to download. It is not free to operate. Every self-hosted tool requires:
- Initial setup and configuration
- Security hardening and access control
- Monitoring and alerting
- Backup and disaster recovery
- Version upgrades and migration
- Integration maintenance with other tools
- On-call support when it fails in production at 2 AM
These costs are real, ongoing, and paid in the most expensive currency in any tech company: senior engineering time.
The "build vs. buy" framing is misleading. The actual choices are:
- Buy: Pay a vendor for a managed service. The cost is explicit and predictable.
- Self-host open source: Pay vendor zero dollars; pay your engineering team an implicit tax for ongoing maintenance.
- Build custom: Design, implement, maintain, and evolve a proprietary solution.
Option 2 is systematically undercosted by engineering teams because maintenance costs are invisible - they appear as "time spent on platform tasks" rather than as a line item on a budget. The build vs. buy analysis exists to make these costs explicit and comparable.
Historical Context
The make-vs-buy decision has been a core operations management question since the early days of manufacturing. The academic framework traces to Williamson (1975) and transaction cost theory: firms should internalize activities where the coordination cost of market transactions exceeds the cost of internal production.
In software, the build vs. buy question became ubiquitous with the rise of SaaS products in the 2000s. For ML tooling specifically, the calculus became complex around 2019–2022 as the MLOps tool ecosystem exploded: dozens of startups (MLflow, Weights & Biases, Comet, Neptune, DVC, ClearML, Tecton, Feast, Arize) offered managed versions of functionality that was also available as open-source self-hosted tools.
The key insight that changed many organizations' analysis: the engineering talent cost is the dominant cost in software organizations. At 10K–$100K range.
Core Concepts
The TCO Framework for ML Tools
Total Cost of Ownership (TCO) includes all costs over the analysis horizon - not just the licensing or infrastructure cost.
from dataclasses import dataclass
from typing import Optional
import pandas as pd
import numpy as np
@dataclass
class MLToolTCO:
"""TCO model for an ML tooling decision."""
tool_name: str
analysis_years: int = 3
# Direct costs
license_cost_per_year: float = 0 # SaaS annual cost
infra_cost_per_year: float = 0 # Self-hosted compute/storage
# Engineering time costs (hours × hourly rate)
initial_setup_hours: float = 0
ongoing_maintenance_hours_per_month: float = 0
major_version_upgrade_hours_per_year: float = 0
integration_maintenance_hours_per_year: float = 0
oncall_incidents_hours_per_year: float = 0
# Opportunity costs
missing_features_engineer_days: float = 0 # engineering days to replicate missing features
migration_risk_buffer_pct: float = 10 # % buffer for underestimated costs
# Engineering rate
senior_engineer_hourly_rate: float = 150 # fully-loaded
def compute_tco(tool: MLToolTCO) -> dict:
"""Compute full 3-year TCO for an ML tool option."""
N = tool.analysis_years
rate = tool.senior_engineer_hourly_rate
# Direct costs over N years
license_total = tool.license_cost_per_year * N
infra_total = tool.infra_cost_per_year * N
# Engineering time over N years
setup_cost = tool.initial_setup_hours * rate
maintenance_cost = (
tool.ongoing_maintenance_hours_per_month * 12 * N * rate
)
upgrade_cost = (
tool.major_version_upgrade_hours_per_year * N * rate
)
integration_cost = (
tool.integration_maintenance_hours_per_year * N * rate
)
oncall_cost = (
tool.oncall_incidents_hours_per_year * N * rate
)
# Opportunity cost: build missing features
opportunity_cost = (
tool.missing_features_engineer_days * 8 * rate
)
subtotal = (
license_total + infra_total + setup_cost +
maintenance_cost + upgrade_cost +
integration_cost + oncall_cost + opportunity_cost
)
# Risk buffer
risk_buffer = subtotal * tool.migration_risk_buffer_pct / 100
total_tco = subtotal + risk_buffer
return {
"tool_name": tool.tool_name,
"analysis_years": N,
"license_total": round(license_total, 0),
"infra_total": round(infra_total, 0),
"setup_cost": round(setup_cost, 0),
"maintenance_cost": round(maintenance_cost, 0),
"upgrade_cost": round(upgrade_cost, 0),
"integration_cost": round(integration_cost, 0),
"oncall_cost": round(oncall_cost, 0),
"opportunity_cost": round(opportunity_cost, 0),
"risk_buffer": round(risk_buffer, 0),
"total_3yr_tco": round(total_tco, 0),
"avg_annual_tco": round(total_tco / N, 0)
}
# The W&B vs. MLflow analysis
wb_tco = MLToolTCO(
tool_name="Weights & Biases (managed)",
analysis_years=3,
license_cost_per_year=30_000,
# Minimal setup - vendor manages infrastructure
initial_setup_hours=20,
ongoing_maintenance_hours_per_month=1, # admin: user management, config
major_version_upgrade_hours_per_year=4, # automatic, minimal effort
integration_maintenance_hours_per_year=8,
oncall_incidents_hours_per_year=4, # vendor SLA handles most incidents
missing_features_engineer_days=0,
migration_risk_buffer_pct=5,
)
mlflow_tco = MLToolTCO(
tool_name="MLflow (self-hosted on Kubernetes)",
analysis_years=3,
license_cost_per_year=0,
infra_cost_per_year=6_000, # Postgres, S3, Kubernetes resources
initial_setup_hours=40, # setup, security, auth integration
ongoing_maintenance_hours_per_month=4, # patching, monitoring, user issues
major_version_upgrade_hours_per_year=30, # schema migrations, testing
integration_maintenance_hours_per_year=20,
oncall_incidents_hours_per_year=12, # self-managed incidents
missing_features_engineer_days=60, # W&B dashboards, model registry, sweeps
migration_risk_buffer_pct=15, # self-hosted complexity buffer
)
wb_result = compute_tco(wb_tco)
mlflow_result = compute_tco(mlflow_tco)
print("\n=== W&B vs. Self-Hosted MLflow TCO ===")
print(f"\nWeights & Biases (3-year TCO): ${wb_result['total_3yr_tco']:,}")
print(f"Self-Hosted MLflow (3-year TCO): ${mlflow_result['total_3yr_tco']:,}")
print(f"\nMLflow is ${mlflow_result['total_3yr_tco'] - wb_result['total_3yr_tco']:,} MORE expensive over 3 years")
Build vs. Buy Decision Matrix
Hidden Costs of Self-Hosting
The most commonly underestimated costs in self-hosted ML tools:
Security and compliance: Self-hosted tools require configuring authentication (SSO integration), authorization (role-based access control), network security (private VPC, TLS, firewall rules), and audit logging. This is several weeks of work per tool and ongoing maintenance as security requirements evolve.
Backup and disaster recovery: A self-hosted experiment tracking system that loses data in a database failure causes significant scientific reproducibility problems. Implementing and testing backup and recovery procedures for each self-hosted tool adds 8–16 hours of setup plus monthly testing overhead.
Multi-tenancy: A shared experiment tracking server serving 14 data scientists needs isolation between teams' experiments, quota management (one user can't consume all storage), and user lifecycle management (onboarding, offboarding). These features are included in SaaS products and often absent in the base OSS version.
Support: When a managed service has an incident, the vendor's engineering team investigates. When your self-hosted instance has an incident, your engineering team investigates. At 2 AM. The implicit on-call cost of self-hosted infrastructure is real and should be included in TCO.
def estimate_security_compliance_cost(
tool_name: str,
requires_sso: bool = True,
requires_rbac: bool = True,
requires_audit_logging: bool = True,
requires_private_networking: bool = True,
compliance_frameworks: list = None, # e.g., ["SOC2", "HIPAA"]
senior_engineer_rate: float = 150
) -> dict:
"""
Estimate the security and compliance setup cost for a self-hosted ML tool.
"""
setup_hours = 0
annual_maintenance_hours = 0
if requires_sso:
setup_hours += 16 # SAML/OAuth integration
annual_maintenance_hours += 4 # updates, new user issues
if requires_rbac:
setup_hours += 12
annual_maintenance_hours += 3
if requires_audit_logging:
setup_hours += 8
annual_maintenance_hours += 2
if requires_private_networking:
setup_hours += 16 # VPC, security groups, TLS config
annual_maintenance_hours += 4
compliance_multiplier = 1.0
if compliance_frameworks:
# Each compliance framework adds audit and documentation overhead
compliance_multiplier = 1 + 0.3 * len(compliance_frameworks)
annual_maintenance_hours += 20 * len(compliance_frameworks)
total_setup_hours = setup_hours * compliance_multiplier
total_annual_hours = annual_maintenance_hours * compliance_multiplier
return {
"tool": tool_name,
"setup_cost_usd": round(total_setup_hours * senior_engineer_rate, 0),
"annual_maintenance_cost_usd": round(total_annual_hours * senior_engineer_rate, 0),
"3yr_total_cost_usd": round(
(total_setup_hours + total_annual_hours * 3) * senior_engineer_rate, 0
),
"compliance_frameworks": compliance_frameworks or [],
}
security_cost = estimate_security_compliance_cost(
tool_name="Self-hosted MLflow",
requires_sso=True,
requires_rbac=True,
requires_audit_logging=True,
requires_private_networking=True,
compliance_frameworks=["SOC2"]
)
print(f"Security/compliance setup cost: ${security_cost['setup_cost_usd']:,}")
print(f"Annual maintenance cost: ${security_cost['annual_maintenance_cost_usd']:,}/year")
print(f"3-year total: ${security_cost['3yr_total_cost_usd']:,}")
Vendor Lock-In Risk Assessment
One legitimate concern with managed SaaS tools is vendor lock-in: dependence on a vendor whose pricing may change, who may be acquired, or whose service may be discontinued.
def assess_vendor_lockin_risk(
vendor_name: str,
years_in_market: int,
annual_revenue_m_usd: float = None,
num_enterprise_customers: int = None,
data_exportability: str = "full", # "full", "partial", "locked"
api_standard: str = "proprietary", # "open", "proprietary"
open_source_core: bool = False,
switching_cost_engineer_weeks: float = 8.0,
senior_engineer_rate: float = 150
) -> dict:
"""
Assess the vendor lock-in risk for a managed ML tool.
"""
risk_score = 0 # 0 = low risk, 100 = high risk
# Company stability factors
if years_in_market < 2:
risk_score += 25 # startup risk
elif years_in_market < 5:
risk_score += 10
if annual_revenue_m_usd and annual_revenue_m_usd < 20:
risk_score += 20 # may not reach sustainable scale
# Data portability
data_risk = {"full": 0, "partial": 20, "locked": 40}
risk_score += data_risk.get(data_exportability, 20)
# API lock-in
if api_standard == "proprietary":
risk_score += 15
if open_source_core:
risk_score -= 20 # can self-host if vendor disappears
switching_cost_usd = (
switching_cost_engineer_weeks * 40 * senior_engineer_rate
)
risk_level = (
"LOW" if risk_score < 25 else
"MEDIUM" if risk_score < 50 else
"HIGH"
)
return {
"vendor": vendor_name,
"lock_in_risk_score": min(100, risk_score),
"risk_level": risk_level,
"switching_cost_usd": switching_cost_usd,
"data_exportability": data_exportability,
"mitigation_recommendations": [
"Ensure contractual data export rights"
if data_exportability != "full" else None,
"Build abstraction layer over vendor API"
if api_standard == "proprietary" else None,
"Maintain ability to self-host (evaluate OSS alternative quarterly)"
if risk_score > 50 else None,
]
}
Contract Negotiation Principles
Understanding vendor cost structures enables negotiation:
def estimate_negotiation_leverage(
annual_contract_value: float,
competing_vendor_price: float,
self_host_tco_per_year: float,
team_is_design_partner: bool = False,
multi_year_commitment: bool = False
) -> dict:
"""
Estimate negotiation leverage and expected discount range.
"""
competing_discount = (
(annual_contract_value - competing_vendor_price) /
annual_contract_value * 100
)
self_host_discount_needed = (
(annual_contract_value - self_host_tco_per_year) /
annual_contract_value * 100
)
estimated_discount_floor = 10.0 # minimum any enterprise negotiation yields
estimated_discount_ceiling = min(40.0, max(competing_discount - 5, 20))
if team_is_design_partner:
estimated_discount_ceiling += 10
if multi_year_commitment:
estimated_discount_ceiling += 10
return {
"list_price_annual": annual_contract_value,
"competing_alternative_price": competing_vendor_price,
"self_host_annual_tco": self_host_tco_per_year,
"estimated_discount_range": f"{estimated_discount_floor:.0f}–{estimated_discount_ceiling:.0f}%",
"best_case_price": round(annual_contract_value * (1 - estimated_discount_ceiling / 100), 0),
"expected_price": round(annual_contract_value * (1 - (estimated_discount_floor + estimated_discount_ceiling) / 200), 0),
"negotiation_levers": [
f"Competing alternative at ${competing_vendor_price:,}/year",
f"Self-host alternative at ${self_host_tco_per_year:,}/year",
"3-year commit for larger discount" if not multi_year_commitment else "Already committing multi-year",
"Design partner arrangement for beta features" if not team_is_design_partner else "Already design partner",
]
}
leverage = estimate_negotiation_leverage(
annual_contract_value=30_000,
competing_vendor_price=24_000, # competitor's price
self_host_tco_per_year=42_333, # $127K / 3 years
multi_year_commitment=True
)
print(f"Estimated achievable price: ${leverage['expected_price']:,}/year")
print(f"Best case: ${leverage['best_case_price']:,}/year")
Production Engineering Notes
Re-evaluate annually: The build vs. buy decision is not permanent. The vendor market matures; new entrants offer better pricing; your team grows or shrinks. Review major tooling decisions annually.
Pilot before committing: For any tool costing more than $20K/year, run a 60-day pilot before signing a multi-year contract. Identify integration issues, measure actual usage, and validate that the tool solves the problem it was purchased to solve.
Negotiate multi-year discounts last: Sign one year first, validate fit, then negotiate a 3-year deal with volume and multi-year discounts. Don't commit to 3 years on an unproven vendor.
Common Mistakes
:::danger Comparing SaaS list price to OSS zero-dollar license cost This is the most common error in build vs. buy analysis. The correct comparison is SaaS total cost vs. total cost of self-hosting, including all engineering time. The self-host option is almost never free once engineering time is correctly accounted for. :::
:::danger Not including security and compliance costs in self-host TCO A self-hosted ML platform in an enterprise that requires SOC2 compliance must have SSO integration, audit logging, access controls, and a security review process. These requirements add weeks of engineering time per tool. Teams that omit this from their TCO analysis systematically underestimate self-host costs. :::
:::warning Choosing self-hosted to "avoid vendor lock-in" without assessing the actual switching cost The switching cost from a well-designed SaaS tool with data export is typically 2–8 weeks of engineering time. This is often less than the ongoing maintenance cost of self-hosting for 1 year. Lock-in is a legitimate concern, but it should be quantified and compared to the maintenance cost, not used as a qualitative trump card. :::
:::tip Build abstraction layers over vendor APIs
If you decide to use a SaaS tool, wrap vendor-specific APIs in an internal abstraction layer (e.g., a ModelTracker interface). This makes switching vendors a matter of updating the implementation behind the interface rather than updating every experiment in the codebase. Add this to the setup cost of any new SaaS tool.
:::
Interview Q&A
Q: How would you build a TCO model to compare self-hosted MLflow vs. managed W&B?
A: The model needs four cost categories. Direct costs: W&B annual license (known) vs. MLflow infrastructure (Postgres database, S3 artifact store, Kubernetes compute - estimate 700/month). Engineering time: initial setup (40 hours for MLflow vs. 20 hours for W&B), ongoing maintenance (4 hours/month for MLflow vs. 1 hour/month for W&B), major version upgrades (30 hours/year for MLflow vs. 4 hours/year for W&B), and on-call incidents. Multiply all engineering hours by the senior engineer fully-loaded rate. Opportunity cost: identify features W&B has that MLflow lacks (better dashboards, built-in hyperparameter optimization sweeps, model registry). Estimate the engineering days needed to replicate them or accept the productivity loss of their absence. Risk buffer: add 10–15% to the self-host estimate for underestimated complexity. Sum everything over the analysis horizon (3 years) and compare.
Q: When is self-hosting an ML tool genuinely the right choice?
A: Self-hosting makes sense when: the tool is a core competitive capability (your own model registry with custom IP protection), when the SaaS cost exceeds the self-host TCO over a 3-year horizon (rare, but possible at large scale), when compliance requirements cannot be met by the vendor's architecture (e.g., air-gapped environments, national security data), or when the vendor market is immature and no vendor meets your requirements so you self-host temporarily while the market matures. Self-hosting is the right choice for the capability itself, not to avoid the sticker price of SaaS. If the honest TCO comparison favors SaaS but the team still wants to self-host, that's a preference, not a financial decision.
Q: What are the hidden costs of self-hosting ML infrastructure that teams most commonly underestimate?
A: Security and compliance setup (SSO integration, RBAC, audit logging - often 40–80 hours of senior engineer time), backup and disaster recovery implementation and testing (another 16–24 hours plus monthly drills), major version upgrades (ML tools often have breaking schema migrations - MLflow's artifact store schema has changed between major versions, requiring careful migration planning), multi-tenancy features (the base OSS version rarely has the user isolation, quota management, and access controls needed for a 20-person team), and on-call burden (when the self-hosted tool fails at 2 AM, your engineers are on the hook, not a vendor SLA). Teams that include all of these in their TCO model almost always find that the engineering cost exceeds the SaaS licensing cost.
Q: How do you evaluate vendor lock-in risk for an ML tooling decision?
A: Assess four dimensions. Data portability: can you export all your experiment data, model artifacts, and metadata in standard formats (JSON, Parquet, MLflow format)? If the answer is no or "call us," lock-in risk is high. API abstraction: are you writing vendor-specific API calls throughout your codebase, or have you wrapped them in an internal abstraction? The former makes switching expensive; the latter makes it manageable. Vendor stability: how long has the company been around, what is their revenue profile, do they have a large enterprise customer base? OSS alternative: does an open-source equivalent exist that you could self-host if the vendor's pricing becomes unacceptable? If yes, your lock-in risk is bounded by the cost of migrating to self-hosted. Quantify the switching cost in engineering weeks - if switching costs 8 weeks of engineering time, and the vendor raises prices by $20K/year, you recover the switching cost in less than a year if you switch.
