Skip to main content

MLflow Deep Dive

500 Experiments a Week and Complete Chaos

It is Monday morning at a Series B company with a 20-person data science team. The team runs a recommender system that drives 40% of revenue. Everyone is training models - some people are tuning the embedding dimensions, others are experimenting with different loss functions, a few are running architecture searches. The Slack channel #ml-experiments has become an unreadable stream of "I got 0.847 AUC!" and "what hyperparams did you use?" messages.

The team lead opens the shared experiment spreadsheet and finds 847 rows, roughly half of which are incomplete. The "model path" column points to directories on engineers' laptops. Three of the models marked "best" are from engineers who left the company. One row says "see Jupyter notebook" with no further information.

The team needs to pick a model to promote this week. They have no systematic way to compare the 500 runs from last week. They have no way to reproduce the "best" run from two months ago. They have no idea which training data version corresponds to which run. The model that goes to production will be chosen by whoever argues most convincingly in the Monday meeting.

This team needs MLflow. Not as a nice-to-have, but as the foundation of their entire ML engineering practice.


:::tip 🎮 Interactive Playground Visualize this concept: Try the Experiment Tracking with MLflow demo on the EngineersOfAI Playground - no code required. :::

Why MLflow

MLflow was created at Databricks in 2018 and released as open-source. The core insight of its designers was that ML practitioners needed four capabilities that no single tool provided: tracking experiments, packaging projects, managing models, and serving models. MLflow calls these the four components.

Its strength is being fully open-source, self-hostable, framework-agnostic, and widely adopted. It works with scikit-learn, PyTorch, TensorFlow, XGBoost, LightGBM, and any other framework through its generic logging API. It is the most common experiment tracking system in production ML systems as of 2024.


MLflow Architecture

The tracking server stores metadata (parameters, metrics, tags) in a relational database. It stores artifacts (model files, large binaries) in an object store. This separation is important: metadata queries are fast SQL queries; artifact access goes direct to S3, bypassing the server.


Setting Up a Production MLflow Server

Option 1: Local Development

# Install
pip install mlflow

# Start with local SQLite + local artifact store
mlflow server \
--backend-store-uri sqlite:///mlflow.db \
--default-artifact-root ./mlruns \
--host 0.0.0.0 \
--port 5000

This is fine for a single engineer or small team sharing a workstation. Not suitable for teams.

Option 2: Team Server with PostgreSQL + S3

# Install with extras
pip install "mlflow[extras]" psycopg2-binary boto3

# Start the server
mlflow server \
--backend-store-uri postgresql://mlflow_user:password@db-host:5432/mlflow \
--default-artifact-root s3://your-bucket/mlflow-artifacts \
--host 0.0.0.0 \
--port 5000 \
--workers 4

The PostgreSQL backend supports concurrent writes from many training jobs without corruption. The S3 artifact store means model files are stored durably and accessible from any machine.

Docker Compose for Team Setup

# docker-compose.yml
version: "3.8"
services:
mlflow-db:
image: postgres:16
environment:
POSTGRES_DB: mlflow
POSTGRES_USER: mlflow
POSTGRES_PASSWORD: ${MLFLOW_DB_PASSWORD}
volumes:
- mlflow-db-data:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U mlflow"]
interval: 10s
timeout: 5s
retries: 5

mlflow-server:
image: ghcr.io/mlflow/mlflow:v2.9.2
ports:
- "5000:5000"
depends_on:
mlflow-db:
condition: service_healthy
environment:
MLFLOW_BACKEND_STORE_URI: postgresql://mlflow:${MLFLOW_DB_PASSWORD}@mlflow-db:5432/mlflow
MLFLOW_DEFAULT_ARTIFACT_ROOT: s3://${ARTIFACT_BUCKET}/mlflow
AWS_ACCESS_KEY_ID: ${AWS_ACCESS_KEY_ID}
AWS_SECRET_ACCESS_KEY: ${AWS_SECRET_ACCESS_KEY}
command: >
mlflow server
--backend-store-uri postgresql://mlflow:${MLFLOW_DB_PASSWORD}@mlflow-db:5432/mlflow
--default-artifact-root s3://${ARTIFACT_BUCKET}/mlflow
--host 0.0.0.0
--port 5000
--workers 4

volumes:
mlflow-db-data:

Environment Configuration

In every training script or training container, set:

export MLFLOW_TRACKING_URI=http://mlflow.internal:5000
export MLFLOW_EXPERIMENT_NAME=q4_ctr_improvement

Or in Python:

import mlflow

mlflow.set_tracking_uri("http://mlflow.internal:5000")
mlflow.set_experiment("q4_ctr_improvement")

The Four MLflow Components

Component 1: MLflow Tracking

The tracking API is the heart of MLflow. Runs are organized into Experiments. Each run has parameters, metrics, tags, and artifacts.

import mlflow
import mlflow.pytorch
import torch
import torch.nn as nn
from torch.optim import AdamW
from torch.optim.lr_scheduler import CosineAnnealingLR

mlflow.set_tracking_uri("http://mlflow.internal:5000")
mlflow.set_experiment("ctr_prediction_transformer")

with mlflow.start_run(run_name="transformer_lr1e4_bs64_v3"):
# --- Log all hyperparameters before training ---
params = {
"model": "TransformerRanker",
"learning_rate": 1e-4,
"batch_size": 64,
"max_epochs": 100,
"d_model": 128,
"num_heads": 8,
"num_layers": 4,
"dropout": 0.1,
"optimizer": "AdamW",
"weight_decay": 0.01,
"scheduler": "cosine",
"random_seed": 42,
"dataset": "clickstream_2024q3_v2",
"dataset_hash": "a3f9c1d2e5b8",
}
mlflow.log_params(params)

model = TransformerRanker(**{k: v for k, v in params.items()
if k in TransformerRanker.__init__.__code__.co_varnames})
optimizer = AdamW(model.parameters(), lr=params["learning_rate"],
weight_decay=params["weight_decay"])
scheduler = CosineAnnealingLR(optimizer, T_max=params["max_epochs"])

best_val_auc = 0.0

for epoch in range(params["max_epochs"]):
train_loss = train_epoch(model, train_loader, optimizer)
val_metrics = evaluate(model, val_loader)

# Log metrics with the epoch as step
mlflow.log_metrics({
"train/loss": train_loss,
"val/loss": val_metrics["loss"],
"val/auc": val_metrics["auc"],
"val/ndcg_10": val_metrics["ndcg_10"],
"learning_rate": scheduler.get_last_lr()[0],
}, step=epoch)

if val_metrics["auc"] > best_val_auc:
best_val_auc = val_metrics["auc"]
# Log the best model as an artifact
mlflow.pytorch.log_model(model, "best_model")

scheduler.step()

# Log final summary metrics (easy to query without plotting curves)
mlflow.log_metrics({
"final/best_val_auc": best_val_auc,
"final/train_epochs_completed": epoch + 1,
})

# Log any output files
mlflow.log_artifact("evaluation_outputs/confusion_matrix.png")
mlflow.log_artifact("configs/model_config.yaml")

Component 2: MLflow Projects

MLflow Projects package ML code so anyone can reproduce a training run with a single command. A project is a directory with an MLproject file.

# MLproject
name: ctr_prediction

conda_env: conda.yaml
# Or: docker_env:
# image: python:3.11-slim

entry_points:
train:
parameters:
learning_rate: {type: float, default: 1e-4}
batch_size: {type: int, default: 64}
max_epochs: {type: int, default: 100}
experiment_name: {type: str, default: "default"}
command: >
python train.py
--learning_rate {learning_rate}
--batch_size {batch_size}
--max_epochs {max_epochs}
--experiment_name {experiment_name}

evaluate:
parameters:
run_id: {type: str}
command: "python evaluate.py --run_id {run_id}"

Run it from anywhere:

# Run from a git repo with specific params
mlflow run [email protected]:yourorg/ctr-model.git \
-P learning_rate=5e-5 \
-P batch_size=128 \
--experiment-name "lr_sweep_oct"

Component 3: MLflow Models

The Models component defines a standard format for packaging trained models. The key concept is the flavor - a framework-specific way of loading a model. A logged PyTorch model can be loaded as a PyTorch model, or as a generic Python function, or as a REST endpoint, without changing the saved artifact.

import mlflow.pytorch
import mlflow.pyfunc

# Log a model with custom pre/post-processing
class CTRModelWrapper(mlflow.pyfunc.PythonModel):
def load_context(self, context):
import torch
self.model = torch.load(context.artifacts["model_path"])
self.preprocessor = joblib.load(context.artifacts["preprocessor_path"])
self.model.eval()

def predict(self, context, model_input: pd.DataFrame) -> pd.Series:
features = self.preprocessor.transform(model_input)
with torch.no_grad():
scores = self.model(torch.tensor(features, dtype=torch.float32))
return pd.Series(scores.numpy().flatten())

# Log the wrapped model
artifacts = {
"model_path": "models/best_model.pt",
"preprocessor_path": "models/preprocessor.pkl",
}

mlflow.pyfunc.log_model(
artifact_path="ctr_model",
python_model=CTRModelWrapper(),
artifacts=artifacts,
pip_requirements=["torch==2.1.0", "scikit-learn==1.3.0", "pandas==2.0.3"],
)

Component 4: MLflow Model Registry

The Model Registry is where trained models graduate from "experimental run" to "production candidate." It manages the promotion lifecycle: None → Staging → Production → Archived.

from mlflow.tracking import MlflowClient

client = MlflowClient()

# Register a model from a run
run_id = "3f4e5d6c7b8a..."
model_uri = f"runs:/{run_id}/best_model"

model_version = mlflow.register_model(
model_uri=model_uri,
name="ctr_ranker",
)

# Add description to this version
client.update_model_version(
name="ctr_ranker",
version=model_version.version,
description=(
"Transformer ranker trained on Q3 2024 clickstream. "
f"Val AUC: 0.891. Trained by: @sarah. "
f"Run ID: {run_id}"
),
)

# Transition to Staging after automated tests pass
client.transition_model_version_stage(
name="ctr_ranker",
version=model_version.version,
stage="Staging",
archive_existing_versions=False,
)

# Transition to Production after manual approval
client.transition_model_version_stage(
name="ctr_ranker",
version=model_version.version,
stage="Production",
archive_existing_versions=True, # archive the previous production model
)

Load the production model without knowing its run ID:

# In your serving code - always gets the current production model
model = mlflow.pyfunc.load_model("models:/ctr_ranker/Production")

Autologging

MLflow autologging captures common hyperparameters and metrics automatically for supported frameworks. One line of code replaces hundreds of manual log_param calls.

import mlflow

# Enable autologging for all supported frameworks
mlflow.autolog(
log_input_examples=True, # log a sample of input data
log_model_signatures=True, # log input/output schema
log_models=True, # log the trained model
disable=False,
exclusive=False,
disable_for_unsupported_versions=False,
silent=False,
)

# Or enable for specific frameworks only
mlflow.sklearn.autolog() # scikit-learn
mlflow.pytorch.autolog() # PyTorch Lightning
mlflow.xgboost.autolog() # XGBoost
mlflow.lightgbm.autolog() # LightGBM
mlflow.keras.autolog() # Keras / TensorFlow

# After enabling, just train normally - everything is logged
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split

with mlflow.start_run():
# Autologging captures: n_estimators, max_depth, learning_rate,
# train accuracy, test accuracy, feature importances, and the model itself
clf = GradientBoostingClassifier(n_estimators=200, max_depth=5, learning_rate=0.05)
clf.fit(X_train, y_train)
# No explicit log_param or log_metric calls needed
note

Autologging is a starting point, not a complete solution. It captures the framework-level hyperparameters and standard metrics, but it does not know about your business metrics, your dataset version, or your custom preprocessing steps. Always supplement autologging with explicit manual logging for domain-specific information.


Nested Runs for Hyperparameter Optimization

When running HPO sweeps, you want a parent run that represents the entire sweep and child runs for each trial. This keeps the UI clean and lets you query "what was the best trial from sweep X?"

import optuna
import mlflow

def objective(trial: optuna.Trial) -> float:
"""Each Optuna trial is a nested MLflow run."""
with mlflow.start_run(nested=True,
run_name=f"trial_{trial.number:04d}"):
# Suggest hyperparameters
lr = trial.suggest_float("learning_rate", 1e-5, 1e-2, log=True)
batch_size = trial.suggest_categorical("batch_size", [32, 64, 128, 256])
dropout = trial.suggest_float("dropout", 0.0, 0.5)
d_model = trial.suggest_categorical("d_model", [64, 128, 256])
num_layers = trial.suggest_int("num_layers", 2, 8)

mlflow.log_params({
"learning_rate": lr,
"batch_size": batch_size,
"dropout": dropout,
"d_model": d_model,
"num_layers": num_layers,
})

val_auc = train_and_evaluate(lr, batch_size, dropout, d_model, num_layers)
mlflow.log_metric("val_auc", val_auc)

return val_auc

# The parent run wraps the entire sweep
with mlflow.start_run(run_name="optuna_sweep_transformer_v2"):
mlflow.log_params({
"sweep_algorithm": "TPE",
"n_trials": 100,
"n_jobs": 4,
"direction": "maximize",
})

study = optuna.create_study(
direction="maximize",
sampler=optuna.samplers.TPESampler(seed=42),
pruner=optuna.pruners.MedianPruner(n_warmup_steps=10),
)
study.optimize(objective, n_trials=100, n_jobs=4)

best_trial = study.best_trial
mlflow.log_metrics({
"best_val_auc": best_trial.value,
"best_trial_number": best_trial.number,
})
mlflow.log_params({f"best_{k}": v for k, v in best_trial.params.items()})

MLflow UI: Key Workflows

Run Comparison

The comparison view is the most used feature in practice. Select multiple runs, click "Compare," and MLflow renders side-by-side parameter tables and overlaid metric curves.

Key workflow for a team: at the end of each experiment week, open the comparison view, filter runs by experiment name and status=completed, sort by val/auc descending, and screenshot the top 10 for the weekly sync.

Searching Runs Programmatically

from mlflow.tracking import MlflowClient

client = MlflowClient()

# Find all runs with val AUC above 0.88 using the Q3 dataset
runs = client.search_runs(
experiment_ids=["2"],
filter_string=(
"metrics.`val/auc` > 0.88 "
"AND params.dataset = 'clickstream_2024q3_v2' "
"AND tags.status = 'completed'"
),
order_by=["metrics.`val/auc` DESC"],
max_results=20,
)

for run in runs:
print(f"Run: {run.info.run_name}")
print(f" AUC: {run.data.metrics['val/auc']:.4f}")
print(f" LR: {run.data.params['learning_rate']}")
print(f" Run ID: {run.info.run_id}")

MLflow at Scale: 500+ Experiments Per Week

When the team grows to 20 engineers running 500 experiments per week, several operational issues emerge:

Issue 1: Tracking server becomes a bottleneck. Solution: increase --workers to match your core count, use connection pooling in PostgreSQL (PgBouncer), and confirm that artifact uploads go direct to S3 (not through the server).

Issue 2: S3 artifact costs balloon. Solution: implement lifecycle policies to move artifacts older than 90 days to S3 Glacier. Archive or delete runs from failed jobs automatically. Keep only the "best" model artifact per parent run.

Issue 3: Experiment names collide. Solution: enforce naming conventions via a wrapper around mlflow.start_run() that validates the naming schema before creating the run.

import re
from contextlib import contextmanager

VALID_EXPERIMENT_PATTERN = re.compile(
r"^[a-z][a-z0-9_]+/[a-z][a-z0-9_]+$"
)

@contextmanager
def tracked_run(experiment_name: str, run_name: str, **tags):
"""Enforces naming conventions and adds standard tags."""
if not VALID_EXPERIMENT_PATTERN.match(experiment_name):
raise ValueError(
f"Experiment name '{experiment_name}' must match pattern "
f"'team/project' with lowercase letters and underscores."
)

mlflow.set_experiment(experiment_name)

with mlflow.start_run(run_name=run_name) as run:
# Automatically add standard tags
mlflow.set_tags({
"mlflow.user": os.environ.get("USER", "unknown"),
"hostname": socket.gethostname(),
"job_id": os.environ.get("SLURM_JOB_ID", "local"),
**tags,
})

# Automatically log environment
mlflow.log_params({
"python_version": sys.version.split()[0],
"torch_version": torch.__version__,
})

yield run

Common Mistakes

:::danger Running Without a Tracking URI Set If MLFLOW_TRACKING_URI is not set, MLflow writes to a local mlruns/ directory. Every engineer's local mlruns/ is an island - no sharing, no comparison. Always set the tracking URI at the top of every training script or in environment variables. :::

:::danger Logging Model Weights on Every Epoch Logging a 500MB model checkpoint every epoch generates 50GB per run. Log the best model only (tracked by a running best metric), not every epoch. Use callbacks to conditionally log:

if val_auc > best_val_auc:
best_val_auc = val_auc
mlflow.pytorch.log_model(model, "best_model")

:::

:::warning Not Using the Registry for Promotion Engineers sometimes just copy model files to a "production" directory instead of using the model registry. This bypasses lineage tracking and makes it impossible to answer "which run produced the current production model?" Enforce the use of the registry via team norms or automated checks in your deployment pipeline. :::

:::warning Mixing Experiments and Runs Semantics An Experiment in MLflow groups related runs testing one hypothesis or project. A Run is a single training job. A common mistake is creating one experiment for the entire team and dumping all runs into it. This makes filtering and comparison nearly impossible after 100 runs. Create one experiment per project or per major hypothesis. :::


Interview Q&A

Q: Explain the difference between MLflow's backend store and artifact store.

A: The backend store holds structured metadata: run parameters, metrics, tags, and run status. It is a relational database (SQLite, PostgreSQL, MySQL). The artifact store holds binary files: model weights, datasets, images, and other large files. It is typically an object store (S3, GCS, Azure Blob). The tracking server reads from and writes to the backend store. Artifact uploads go directly from the client to the artifact store, bypassing the tracking server. This separation is important for performance - metadata queries are fast SQL operations, while artifact transfers can be gigabytes and should not route through the server.

Q: How would you implement blue-green model deployment using the MLflow Model Registry?

A: Use the staging/production transition as the deployment gate. When a new model passes automated evaluation, transition it to Staging. Your deployment pipeline polls for the current Staging model, runs integration tests against it, then transitions it to Production (archiving the previous production version). The serving infrastructure always loads models:/model_name/Production. Since the registry is the source of truth, rolling back is just transitioning the previous version back to Production and archiving the new one. No file copying needed.

Q: How do you handle MLflow in a multi-cloud or hybrid environment where some training runs happen on-premises and others in the cloud?

A: Use a single centralized MLflow tracking server accessible from all environments (on-prem and cloud). Set MLFLOW_TRACKING_URI to the server's URL in all training environments. For artifact storage, use a cloud object store (S3 or GCS) accessible from all environments - this is the simplest approach. If data sovereignty requires keeping artifacts on-prem, configure per-experiment artifact locations using mlflow.create_experiment() with a custom artifact location pointing to an on-prem MinIO or NFS store.

Q: What is the purpose of MLflow's model signature and input example?

A: A model signature defines the expected input schema (column names and types) and output schema. An input example is a sample input record. Together they serve as a contract between the model and its consumers. When you load a model and pass it an input with wrong column names or types, MLflow raises an error before the model even runs - catching integration bugs early. The input example is also used to generate serving infrastructure (REST endpoint schemas, batch inference schemas). Log them using mlflow.models.infer_signature(X_train[:5], model.predict(X_train[:5])).

Q: How would you debug a situation where two runs with identical hyperparameters produce different validation metrics?

A: Check in order: (1) random seed - are all library seeds (Python, NumPy, PyTorch) set and logged? (2) dataset version - do both runs use exactly the same data, confirmed by hash? (3) environment - are library versions identical, confirmed by pip freeze? (4) hardware - are both runs on the same GPU model with the same CUDA version? (5) data loading order - if using multi-worker DataLoaders with shuffle, is the shuffle seed fixed? (6) floating-point non-determinism - CUDA operations are not deterministic by default; setting torch.backends.cudnn.deterministic=True forces determinism at the cost of performance.

© 2026 EngineersOfAI. All rights reserved.