Skip to main content

Data Collection Strategy - Building the Moat Before Training the Model

Reading time: ~35 minutes | Level: ML System Design | Role: MLE, Data Scientist, AI Engineer


The Dataset That Determined Everything

In 2018, a startup raised $2M to build a medical imaging AI for detecting diabetic retinopathy - one of the leading causes of preventable blindness. The founding team was technically exceptional. They trained ResNet50 on 5,000 labeled fundus images from a partner hospital in San Francisco. On their held-out test set: 91% accuracy. Investors were impressed. The demo was clean.

Six months after deployment in a rural clinic in southern India, the accuracy had dropped to 67%. That gap - 91% to 67% - was not a model failure. The model was doing exactly what it had been trained to do. It had learned to predict diabetic retinopathy from a specific distribution of fundus images: a particular brand of camera used at one San Francisco hospital, a patient demographic skewed toward insured adults, images taken in controlled lighting conditions, images pre-screened by an ophthalmologist who had already filtered out the blurriest captures.

The rural Indian clinic used a different camera. Different demographics. Different lighting. Different image quality distribution. Different pre-processing pipeline. The model had learned "retinopathy as seen by one Topcon camera in San Francisco" - not "retinopathy as a biological phenomenon." The training data strategy had determined the model's fate before a single line of code was written.

This is not a rare story. It is the modal story of ML projects that fail after deployment. And the solution is not a better model architecture. It is a better understanding of where data comes from, what distribution it represents, and how that distribution relates to the deployment context.


Why Data Strategy Comes Before Model Design

The machine learning community spent most of its first decade optimizing model architectures, loss functions, and training procedures. Andrew Ng's 2021 framing of "data-centric AI" made explicit what practitioners had long known privately: for most real-world ML problems, the quality of your training data has more impact on model performance than your model architecture choices.

This does not mean "just collect more data." It means:

  1. The right data - data from the distribution you will actually serve
  2. With the right labels - labels that accurately reflect the concept you are trying to learn
  3. Of sufficient quality - low noise, consistent annotation, no leakage
  4. Versioned and reproducible - so you can trace model behavior back to specific training data

Getting this wrong is expensive in two ways. First, the direct cost: you train an expensive model on bad data and it fails in production. Second, the hidden cost: you debug the model architecture for weeks looking for the problem that actually lives in the data pipeline.

The framework in this lesson gives you a systematic way to design data collection strategies before training begins - the way senior ML engineers at Google, Meta, and Amazon approach it.


Where Data Comes From

Every training dataset is sourced from one or more of four categories. Understanding the tradeoffs between them is the first design decision.

First-Party Data: Your Product Logs

This is the most valuable category for most ML applications. It is data generated by users interacting with your product: clicks, purchases, searches, session events, API calls, sensor readings. It is high volume, continuously generated, and directly represents your serving distribution - because it comes from the same users in the same contexts you will serve.

Advantages:

  • Perfectly aligned with your deployment distribution (no distribution shift on day one)
  • High volume at scale - billions of events per day for large platforms
  • Continuously refreshed - new data arrives with every user interaction
  • Captures the full context of each decision (user state, item state, ranking position)

Disadvantages:

  • Biased by your current system - you only observe outcomes for items your system chose to show. Items the system ranked low never appear in logs. This is exposure bias.
  • Implicit labels only - you observe behavior (clicks, scrolls, time spent) not intent or satisfaction
  • Dependent on existing user base - if you have no users, you have no logs

When to use: any application where your product is already generating interaction data. Search, recommendations, ad systems, fraud detection, content moderation.

Second-Party Data: Partner Data

Data licensed or shared from a business partner. A hospital consortium sharing anonymized patient records. A financial institution sharing transaction data for fraud model training. A retailer sharing purchase histories for recommendation system development.

Advantages:

  • Domain-specific and often higher quality than generic third-party data
  • May come with expert labels (medical data labeled by specialists)
  • Can cover distributions your own product cannot yet reach (important for cold start)

Disadvantages:

  • Access-controlled and legally complex (HIPAA, data sharing agreements)
  • May have usage restrictions that limit how you train or serve
  • Partner's data collection process may not match your deployment context (see the retinopathy story above)

Third-Party Data: Purchased Datasets

Commercial data providers sell datasets that can augment model training.

Advantages:

  • Immediately available without needing users or partners
  • Can provide signals you cannot observe directly
  • Good for cold start - use to bootstrap before first-party data accumulates

Disadvantages:

  • Expensive and often generic - not tailored to your specific task
  • Quality is opaque and variable
  • Privacy and regulatory risk - GDPR/CCPA restrictions on using third-party personal data

Synthetic Data: Generated Data

Data generated programmatically or by a generative model. Simulation-generated data for robotics, LLM-generated examples for NLP fine-tuning, augmented images for computer vision.

Advantages:

  • Fully controllable distribution - you can oversample rare cases (rare disease examples, edge-case fraud patterns)
  • No privacy concerns - no real individuals in the training data
  • Can generate arbitrarily large datasets cheaply

Disadvantages:

  • Domain gap - synthetic data may not generalize to real-world inputs
  • LLM-generated text fine-tuning can cause mode collapse if the model trains primarily on its own outputs

When to use: to augment real data for rare classes, to pretrain before real data is available, or for applications where simulation is high-fidelity (game environments, physical simulations).


The Data Flywheel

The data flywheel is the most powerful competitive moat in ML. It is a self-reinforcing cycle: more users generate more data, which trains better models, which create better products, which attract more users.

Amazon's flywheel: every purchase generates a data point (user, item, price, context, outcome). Amazon's recommendation model trains on billions of these data points. Better recommendations increase sales. More sales generate more data. This is why Amazon's recommendations improve every year without fundamental architecture changes.

Duolingo's flywheel: every lesson attempt generates data about which exercises cause learners to fail and which cause them to succeed. Duolingo's adaptive learning model trains on this to serve easier or harder exercises. Better calibration increases lesson completion. More completed lessons generate more data.

Google Maps' flywheel: every navigated trip generates data about actual travel times vs predicted travel times. This data trains better ETA models. Better ETAs make navigation more reliable. More reliable navigation means more users navigate with Maps.

Bootstrapping from Zero

The data flywheel requires data to spin. How do you start?

Strategy 1: Manual curation. Before you have any users, curate a small, high-quality dataset by hand. 1,000 labeled examples, carefully selected to cover the distribution you expect to serve, are more valuable than 100,000 random examples. Use this to train a V1 model good enough to attract initial users.

Strategy 2: Use a related existing model. If you are building a sentiment classifier for product reviews, use an open-source model (BERT fine-tuned on Amazon reviews) to pseudo-label your data. Then train on pseudo-labels. Then use your product's actual data to fine-tune further.

Strategy 3: Heuristics and rules. Before any ML, deploy a rule-based system. Rules generate predictions. Predictions generate user interactions. Interactions generate data. After 3-6 months, you have enough data to train a model that beats the rules. This is how Google's spam filter started.

Strategy 4: Third-party and transfer. Buy or use public datasets. Train a model. Deploy it. Collect first-party data. Retrain on first-party data. Phase out third-party data.


Labeling Strategies

Labels are the supervision signal for your model. The quality of your labels is a hard ceiling on your model's performance - you cannot train a model to be more accurate than your labels. Choosing the right labeling strategy is one of the highest-leverage decisions in ML system design.

Human Annotation

A team of human annotators labels individual examples. This is the highest-quality approach but the most expensive and least scalable.

Crowdsourcing (MTurk, Scale AI, Labelbox): large pools of annotators label high volumes of data cheaply. Best for tasks that require no domain expertise: image classification, transcription, basic sentiment analysis. Quality control through majority vote across multiple annotators and inter-annotator agreement metrics (Cohen's kappa):

κ=PoPe1Pe\kappa = \frac{P_o - P_e}{1 - P_e}

where PoP_o is observed agreement and PeP_e is expected agreement by chance. κ>0.8\kappa > 0.8 indicates strong agreement; κ<0.6\kappa < 0.6 indicates the task is ambiguous or annotators need better guidelines.

Expert annotation: tasks that require specialized knowledge (medical imaging, legal document review, scientific claims) need domain experts. Expensive - a radiologist's time costs 100x a crowdworker's - but necessary when the signal requires expertise to extract.

Active disagreement resolution: when annotators disagree, do not just take majority vote. Disagreement is information. Cases where annotators disagree are often the hard cases - precisely the examples your model needs most. Flag high-disagreement cases and resolve them with a senior annotator or subject matter expert.

Weak Supervision: Programmatic Labeling with Snorkel

Instead of labeling individual examples, write functions that programmatically label large datasets based on heuristics, patterns, or distant supervision.

The Snorkel framework: define labeling functions λi:X{1,0,1}\lambda_i: \mathcal{X} \to \{-1, 0, 1\} where 1-1 is negative, 00 is abstain, and 11 is positive. Each labeling function covers some subset of examples and has some accuracy. A label model learns the accuracy and coverage of each labeling function and produces a probabilistic label:

P(yx)LabelModel(λ1(x),,λm(x))P(y \mid \vec{x}) \approx \text{LabelModel}\left(\lambda_1(\vec{x}), \ldots, \lambda_m(\vec{x})\right)

Example labeling functions for spam detection:

def lf_contains_link(email):
"""Emails with links are more likely spam."""
if "http://" in email.text or "https://" in email.text:
return 1 # likely spam
return 0 # abstain

def lf_known_spam_domain(email):
"""Emails from known spam domains are spam."""
spam_domains = {"spammer.ru", "offers-4u.biz", "clicknow.xyz"}
if any(domain in email.sender for domain in spam_domains):
return 1
return 0

def lf_reply_to_mismatch(email):
"""Reply-to header different from sender is a red flag."""
if email.reply_to and email.reply_to != email.sender:
return 1
return 0

def lf_sender_in_contacts(email):
"""Emails from known contacts are probably not spam."""
if email.sender in user_contacts:
return -1 # likely not spam
return 0

# Apply labeling functions to unlabeled dataset
# Snorkel's LabelModel learns each function's accuracy + coverage
from snorkel.labeling import LabelModel, PandasLFApplier

applier = PandasLFApplier(
lfs=[lf_contains_link, lf_known_spam_domain,
lf_reply_to_mismatch, lf_sender_in_contacts]
)
L_train = applier.apply(df_unlabeled)

label_model = LabelModel(cardinality=2)
label_model.fit(L_train, n_epochs=500, lr=0.001)
probabilistic_labels = label_model.predict_proba(L_train)

When to use weak supervision: when you have a large unlabeled dataset and the labeling task can be partially encoded in rules, patterns, or existing knowledge bases. Particularly powerful in NLP (regex patterns, knowledge graph lookups, keyword matching) and structured data (business rule encoding). Weak supervision typically achieves 80-90% of human annotation quality at 1-5% of the cost.

Semi-Supervised Learning

Use a small labeled dataset to train an initial model. Use that model to pseudo-label a large unlabeled dataset. Retrain on the combined labeled and pseudo-labeled data.

Self-training loop:

import numpy as np
from sklearn.ensemble import GradientBoostingClassifier

def self_training_loop(X_labeled, y_labeled, X_unlabeled,
n_iterations=5, confidence_threshold=0.95):
model = GradientBoostingClassifier(n_estimators=100)

for iteration in range(n_iterations):
# Train on current labeled set
model.fit(X_labeled, y_labeled)

if len(X_unlabeled) == 0:
break

# Predict on unlabeled data
probs = model.predict_proba(X_unlabeled)
confidence = probs.max(axis=1)

# Select high-confidence predictions as pseudo-labels
high_conf_mask = confidence >= confidence_threshold
if high_conf_mask.sum() == 0:
break

X_pseudo = X_unlabeled[high_conf_mask]
y_pseudo = probs[high_conf_mask].argmax(axis=1)

# Update labeled set
X_labeled = np.vstack([X_labeled, X_pseudo])
y_labeled = np.concatenate([y_labeled, y_pseudo])
X_unlabeled = X_unlabeled[~high_conf_mask]

print(f"Iteration {iteration+1}: added {high_conf_mask.sum()} "
f"pseudo-labels, {len(X_unlabeled)} unlabeled remaining")

return model

Risk: if the initial model is wrong about high-confidence predictions, pseudo-labels contaminate training. Mitigate with high confidence thresholds (0.95+) and ensemble uncertainty estimation.

Active Learning: Query the Most Informative Examples

Instead of labeling randomly, use your model to identify which unlabeled examples, if labeled, would most improve model performance. Then label those examples.

The intuition: if the model is already 99% confident about a class of examples, labeling more of them provides little information gain. Label the examples where the model is most uncertain - these are the cases closest to the decision boundary.

Uncertainty sampling: query the examples with the highest prediction entropy:

x=argmaxxUH(yx)=argmaxxUcP(y=cx)logP(y=cx)x^* = \arg\max_{x \in \mathcal{U}} H(y \mid x) = \arg\max_{x \in \mathcal{U}} -\sum_c P(y=c \mid x) \log P(y=c \mid x)

For binary classification this simplifies to: query examples where P(y=1x)P(y=1 \mid x) is closest to 0.5.

Active learning loop:

import numpy as np
from scipy.stats import entropy
from sklearn.linear_model import LogisticRegression

def uncertainty_sampling(model, X_unlabeled, n_query=100):
"""Select the n_query most uncertain examples."""
probs = model.predict_proba(X_unlabeled)
uncertainties = entropy(probs, axis=1) # prediction entropy
query_indices = np.argsort(uncertainties)[-n_query:]
return query_indices

def active_learning_loop(X, y_true, initial_labeled_mask,
n_iterations=10, n_query=100):
labeled_mask = initial_labeled_mask.copy()
model = LogisticRegression(max_iter=1000)

for iteration in range(n_iterations):
X_labeled = X[labeled_mask]
y_labeled = y_true[labeled_mask]
X_unlabeled = X[~labeled_mask]
unlabeled_indices = np.where(~labeled_mask)[0]

# Train on current labeled set
model.fit(X_labeled, y_labeled)

# Select most informative unlabeled examples
query_idx = uncertainty_sampling(model, X_unlabeled, n_query)
global_query_idx = unlabeled_indices[query_idx]

# Simulate oracle annotation (in practice: send to human annotators)
labeled_mask[global_query_idx] = True

acc = model.score(X[labeled_mask], y_true[labeled_mask])
print(f"Iteration {iteration+1}: {labeled_mask.sum()} labeled, "
f"train acc {acc:.3f}")

return model

Active learning typically achieves equivalent performance with 30-70% fewer labeled examples than random sampling.


Data Quality vs Quantity

More data does not always mean better models. Label noise above a critical rate actively degrades performance.

Label smoothing to handle suspected noise - instead of hard labels y{0,1}y \in \{0, 1\}, use soft labels:

y~=(1ε)y+εK\tilde{y} = (1 - \varepsilon) \cdot y + \frac{\varepsilon}{K}

where ε\varepsilon is typically 0.1 and KK is the number of classes. This prevents the model from becoming overconfident about potentially incorrect labels.

Focal loss for highly imbalanced or noisy datasets:

FL(pt)=αt(1pt)γlog(pt)FL(p_t) = -\alpha_t (1 - p_t)^\gamma \log(p_t)

The (1pt)γ(1-p_t)^\gamma factor down-weights easy examples (including the many easy negatives in imbalanced datasets) and focuses training capacity on hard examples. With γ=2\gamma = 2, an example classified with 90% confidence contributes only 1% as much to the loss as a 50%-confidence example.


Distribution Concerns

The most expensive ML mistakes happen when the training distribution differs from the serving distribution.

Covariate Shift

Ptrain(X)Pserve(X)P_{\text{train}}(X) \neq P_{\text{serve}}(X) - the input distribution changes. The model learned decision boundaries in one input space and is applied in another.

Detection using Population Stability Index:

PSI=i(Pserve,iPtrain,i)lnPserve,iPtrain,i\text{PSI} = \sum_i \left(P_{\text{serve},i} - P_{\text{train},i}\right) \ln \frac{P_{\text{serve},i}}{P_{\text{train},i}}

Values below 0.1: no significant shift. 0.1-0.2: moderate shift, investigate. Above 0.2: significant shift, retrain required.

Detection using t-SNE visualization:

import numpy as np
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Sample from training and serving sets
n_sample = 2000
X_train_sample = X_train[np.random.choice(len(X_train), n_sample)]
X_serve_sample = X_serve[np.random.choice(len(X_serve), n_sample)]

combined = np.vstack([X_train_sample, X_serve_sample])
source = ['train'] * n_sample + ['serve'] * n_sample

tsne = TSNE(n_components=2, random_state=42, perplexity=30)
embedding = tsne.fit_transform(combined)

colors = ['#3b82f6' if s == 'train' else '#ef4444' for s in source]
plt.figure(figsize=(10, 8))
plt.scatter(embedding[:, 0], embedding[:, 1],
c=colors, alpha=0.4, s=8)
plt.title("Training vs Serving Distribution (t-SNE)\nBlue=Train, Red=Serve")
plt.savefig("distribution_shift_audit.png", dpi=150)
# Well-overlapping clusters → low shift. Distinct clusters → severe shift.

Label Shift and Concept Drift

Label shift: Ptrain(y)Pserve(y)P_{\text{train}}(y) \neq P_{\text{serve}}(y) - class priors change. A fraud model trained at 0.5% fraud rate miscalibrates when fraud jumps to 2%. Fix: monitor predicted score distributions; recalibrate with Platt scaling on recent data.

Concept drift: Ptrain(yX)Pserve(yX)P_{\text{train}}(y \mid X) \neq P_{\text{serve}}(y \mid X) - the feature-to-label relationship changes. Hardest to detect because feature distributions may look stable. Fix: monitor model performance on a held-out recent window; trigger retraining when performance degrades beyond a threshold.


Logging for Future Training

Every production ML system should log its decisions in a way that enables future training. Design this first, not last.

Schema for joinable training logs:

# At prediction time - log everything needed to reconstruct the training example
prediction_log = {
"request_id": "req_abc123", # join key
"timestamp": "2024-03-15T10:23:45Z",
"user_id": "user_xyz",
"item_id": "item_456",
"raw_features": { # raw values, not preprocessed
"user_history_7d": 23,
"item_category": "electronics",
"hour_of_day": 10,
"user_device": "mobile"
},
"model_version": "v2.3.1",
"model_score": 0.73, # raw probability
"model_decision": "show", # action taken
"exploration_flag": False # was this an exploration sample?
}

# At outcome time (hours or days later) - log the label
outcome_log = {
"request_id": "req_abc123", # join key
"outcome_timestamp": "2024-03-15T10:24:12Z",
"outcome": "click",
"dwell_seconds": 142, # downstream engagement
"downstream_purchase": False
}

# JOIN: request_id links features + model decision + outcome
# This gives you (features, label) pairs for retraining

Data versioning with DVC:

# Initialize DVC in your ML project
dvc init
git add .dvc
git commit -m "initialize DVC"

# Track the training dataset
dvc add data/training/features_v3.parquet
git add data/training/features_v3.parquet.dvc .gitignore
git commit -m "track training dataset v3"

# Push data to remote storage (S3, GCS, Azure)
dvc remote add -d s3remote s3://my-ml-datasets/
dvc push

# Later: reproduce exact training dataset for model v2.3.1
git checkout model-v2.3.1
dvc checkout # restores the dataset version used for that model

Full Data Pipeline


Common Mistakes

:::danger Test set contamination Training examples leaking into your test set. Always split by time for temporal data - use events before date T for training, events after T for evaluation. Never use random shuffling for time-series data. A model that achieves great offline metrics on a contaminated test set will collapse in production. :::

:::danger Label leakage Including features computed using information not available at prediction time. In churn prediction, including the "cancellation confirmation email sent" feature (only sent after churn). In fraud detection, including the chargeback flag (only known after the transaction is reviewed). Audit every feature: was this value available at the moment the prediction was made? :::

:::danger Survivorship bias from your own model Your model only shows items it predicts will be clicked. Your logs contain no information about items ranked below the fold. Training on this data reinforces existing biases. Fix: inject exploration - show random items or items chosen by an alternative policy to collect counterfactual data. :::

:::warning Not versioning training data If you cannot reproduce the training dataset used to produce a given model version, you cannot audit that model, debug its failures, or satisfy regulatory requirements. Data versioning is a prerequisite for operating at scale, not a nice-to-have. :::

:::warning Confusing inter-annotator disagreement with label noise When annotators disagree, the standard response is majority vote and discard minority opinions. This discards information. Disagreement indicates ambiguity - and ambiguous examples are where your model is most likely to fail. Log disagreement rates and route high-disagreement cases to expert resolution. :::


Video Resources

ResourceCreatorWhat It Covers
Data-Centric AIAndrew NgData quality over model architecture
Active Learning ExplainedMutual InformationHow active learning works
Weak Supervision with SnorkelStanford DAWNProgrammatic data labeling
Dealing with Distribution ShiftICML TutorialCovariate shift, label shift

Interview Q&A

Q1: How do you design a data flywheel for a new recommendation system?

A data flywheel cannot start spinning on its own - you need to bootstrap it. The design has three phases.

Phase 1: Cold start. No user data, no model. Use editorial picks (human experts select high-quality items), popularity signals (show trending items by aggregate sales and views), and any available demographic or context signals (time of day, geographic region, device type). Deploy a simple rule-based ranker.

Phase 2: First-party data accumulation. Every user interaction is logged: clicks, time-spent, purchases, explicit ratings when available. After 4-8 weeks of logging, you have enough data to train a first-generation model. Train on interaction logs. Deploy. Continue logging.

Phase 3: Flywheel engagement. The model improves recommendations. Better recommendations increase engagement. Increased engagement generates more data. At this point, the key design decisions become: (a) how to maintain exploration so the model does not collapse to a filter bubble, (b) how to handle cold start for new users with no interaction history, and (c) how to retrain frequently enough to track seasonal trends and new content.

The critical insight: the flywheel is not automatic. You need to design the logging infrastructure, data pipeline, and retraining schedule from day one. Companies that designed logging as an afterthought spent years retrofitting it.

Q2: How do you handle data collection for medical imaging where labels require expert annotation?

Medical imaging annotation is expensive ($50-500 per labeled image depending on modality and task) and slow (radiologists are scarce). The design must minimize expert labels needed while maximizing model quality.

Strategy: active learning and weak supervision and transfer learning, in layers.

First, use transfer learning from a pre-trained model (ImageNet for general features, or a publicly available medical imaging model like CheXNet for chest X-rays). Pre-trained features reduce the labeled examples needed by an order of magnitude.

Second, use weak supervision with programmatic labeling functions based on radiology report text (if available), ICD codes, and structured metadata. Many medical imaging datasets have associated reports - text-mining these reports provides noisy labels at scale.

Third, use active learning for the remaining expert annotation budget. Start with a small randomly labeled seed set (200 images). Train. Identify the most uncertain examples. Route to expert annotators. Repeat. Active learning for medical imaging typically reduces required expert annotations by 40-60% vs random sampling to reach equivalent model performance.

Finally, use data augmentation aggressively: random rotation, flipping, brightness and contrast variation, random crop. Medical images have natural geometric symmetry that augmentation can exploit.

Q3: How do you handle label noise in a large crowdsourced dataset?

Start by measuring the noise rate before deciding how to handle it. Compute inter-annotator agreement (Cohen's kappa) on a subset where each example has multiple independent annotations. If kappa is above 0.8, noise is low and majority vote labels work. If kappa is below 0.6, noise is significant and requires active intervention.

For significant noise, use Confident Learning (Cleanlab). The approach: train a model on the noisy labels. For each example, compute the probability that the given label is correct vs each alternative label. Flag examples where the given label is unlikely. Route flagged examples to senior annotators for re-annotation.

If re-annotation is infeasible, use noise-robust loss functions. Label smoothing is the simplest: replace one-hot targets with (1ε)(1 - \varepsilon) for the given class and ε/(K1)\varepsilon / (K-1) for other classes. This prevents the model from fitting noisy labels too confidently.

For structured annotation tasks, use a quality model: train a model to predict annotation quality using a small gold-standard set, and use quality scores to weight training examples.

Q4: Design the logging strategy for a recommendation system to support future model training.

The goal is to enable point-in-time correct feature reconstruction and outcome attribution for any model decision.

Prediction log - capture at request time: user ID, item IDs shown, rank position (affects click probability), timestamp, all features used for ranking with exact values, model version, predicted scores, and a request ID as the join key.

Outcome log - capture asynchronously: request ID, outcome type (click, purchase, share, report), outcome timestamp, and downstream engagement (time on page, return visit within 24h). Multiple outcomes can attach to one request ID.

Feature log - for complex features computed at serving time (user embeddings, session context), log the raw feature vector alongside the prediction. This prevents training-serving skew in future retraining.

Exploration logs - mark whether a recommendation was made by the production model or an exploration policy. This lets you train on counterfactual data separately from on-policy data.

Schema evolution - version your feature schema. When a new feature is added, backfill it where possible. When a feature is removed, keep the column in the log as null to avoid breaking the schema.

Q5: How do you bootstrap a training dataset when you have zero users and zero data?

Zero users does not mean zero data - it means no first-party data. Use a combination of strategies.

Public datasets: find the closest public dataset to your task. ImageNet for image classification, Amazon reviews for sentiment analysis, Common Crawl for language modeling. Fine-tune from public data, then refine with first-party data as it accumulates.

Programmatic data generation: for structured prediction tasks, write programs that generate training examples with known correct labels. A math tutor app can generate arithmetic problems programmatically with exact correct answers. A code assistant can generate programming tasks and verified solutions.

Synthetic data augmentation: use a generative model to generate synthetic examples. Use these as training data with the caveat that the synthetic distribution may not perfectly match real inputs.

Human-generated seed data: hire domain experts to generate examples. Have a content team write 500 examples of your target task done well. 500 carefully crafted examples often outperform 50,000 noisy ones.

Alpha and beta users: recruit 100-500 beta users who use your product intensively in exchange for early access. Their interactions are your seed data. Design the beta program to cover the full distribution of your eventual user base - if your product will serve both technical and non-technical users, recruit both types.

The common thread: get any signal, regardless of volume, that is aligned with your deployment distribution. A small clean dataset aligned with your serving context is worth more than a large misaligned dataset.


Data Collection Anti-Patterns at Scale

Understanding what goes wrong - not just what to do right - is how you build the intuition to catch data problems before they become production incidents.

Anti-Pattern 1: The Balanced Dataset Trap

A common mistake when dealing with class imbalance (rare fraud, rare disease, rare churn) is to create a perfectly balanced training dataset by oversampling the minority class or undersampling the majority class until the ratio is 50-50.

The problem: your model learns the wrong prior. In a real deployment environment where fraud rate is 0.1%, a model that was trained on 50-50 balanced data will output probabilities calibrated to the training distribution - it will predict 50% fraud probability for many transactions that actually have 0.1% fraud probability. The model is miscalibrated.

Fix: do not balance the dataset by sampling. Instead, use class weights in the loss function. For a fraud rate of 0.1%, assign class weight 999 to fraud examples and class weight 1 to legitimate examples. The model trains on the true class distribution and its probability outputs are calibrated to the real world.

If you do need to oversample (e.g., for SMOTE-based augmentation), apply it only to the training fold and not to the validation fold. Then calibrate model probabilities after training using Platt scaling on a held-out calibration set.

Anti-Pattern 2: Collecting Data First, Defining the Task Second

Teams often collect all available data, aggregate it into a massive table, and then decide what ML problem to solve. This seems pragmatic - you have the data, now use it - but it systematically produces bad training datasets.

The issue: the data that is available is shaped by what was worth collecting for operational purposes, not what is informative for your specific ML task. A customer support database has tickets, categories, and resolutions - all collected for ticket routing, not for churn prediction. Using it for churn prediction introduces selection bias (only customers who reached out to support are represented) and label noise (ticket categories are imprecise proxies for customer health).

Fix: define the ML task first. Then ask: what data would I need to train this model well? Then assess what data exists and what must be collected. The gap between what you need and what you have defines your data collection roadmap.

Anti-Pattern 3: Ignoring the Feedback Loop Between Model and Data

When a model is deployed, it shapes the data that is collected. A recommendation model decides which items users see. Users can only click on items they can see. The next training dataset reflects only the items the model showed - creating a filter bubble in the training data.

Over multiple retraining cycles, the model reinforces its own biases. Popular items get more clicks. More clicks generate more training data. More training data makes the model recommend popular items more. Niche items that would have been relevant for some users never get enough exposure to accumulate training examples. The model becomes increasingly homogeneous.

Fix: design explicit exploration into the serving policy. Reserve 1-5% of recommendations for randomly selected items (epsilon-greedy) or items selected by an alternative policy (Thompson sampling). Log outcomes for explored items separately. Use these exploration logs to train a de-biased version of the model. The exploration budget is the price you pay for unbiased training data.

Anti-Pattern 4: Treating All Negative Examples as Equal

In recommendation and search systems, a negative example is typically defined as "item was shown, user did not click." But not all negatives are equally informative.

A user who scrolled past an item in 0.5 seconds probably did not read the title. Their non-click is weak evidence that the item was irrelevant. A user who read the title and description carefully and then did not click is a much stronger negative signal. And a user who actively hid or reported the item is an explicit negative signal.

Training with all negatives equally weighted teaches the model that "scrolled past quickly" and "carefully considered and rejected" are equivalent negative signals. They are not.

Fix: weight negative examples by signal strength. Assign weights based on dwell time on the negative item (longer dwell = stronger negative signal), explicit negative actions (hide, report = strongest), and scroll speed (fast scroll = weak negative, slow scroll = stronger). This requires logging scroll speed and viewport time in addition to clicks - which must be designed into the logging infrastructure from day one.


Data Strategy for Different ML Domains

The data collection challenges vary significantly across ML domains. A data strategy that works for recommendation systems does not work for medical AI.

Recommendation Systems

Data abundance, label quality problem. Recommendation systems typically have vast interaction data (billions of events per day at scale), but the labels are implicit (clicks) rather than explicit (ratings). The key challenges are: position bias (items shown at rank 1 get clicked regardless of relevance), feedback loop bias (model shapes its own future training data), and the cold start problem (new users and new items have no interaction history).

Data strategy:

  • Log impressions with rank position and dwell time, not just clicks
  • Implement exploration (epsilon-greedy or bandit-based) to collect off-policy data
  • Use a "random baseline" model that serves random items to a small fraction of users - this provides unbiased interaction data
  • For cold start: collect user context features (device, location, time, referral source) at registration; collect item features at upload time; use content-based signals to bridge the cold start gap

Natural Language Processing

Moderate data, annotation cost problem. NLP tasks often have moderate amounts of relevant text but require expensive human annotation (especially for tasks like information extraction, relation classification, or clinical NLP). Pre-trained language models (BERT, GPT, LLaMA) have dramatically changed the data economics - you need far less labeled data when you start from a pre-trained model.

Data strategy:

  • Use pre-training on domain-relevant unlabeled text before fine-tuning (domain-adaptive pre-training)
  • Use instruction-tuning data to align a general model to your specific task format
  • Use GPT-4 or Claude to generate initial pseudo-labels for a subset; human annotators verify and correct rather than labeling from scratch
  • Active learning within the fine-tuning budget: label examples that are most uncertain for the current model

Computer Vision

Data augmentation is almost always worth it. For image tasks, geometric augmentation (random horizontal flip, rotation, crop) and color augmentation (brightness, contrast, saturation) typically provide 2-5% accuracy improvement at no labeling cost.

Data strategy:

  • Standard augmentation: horizontal flip, random crop, color jitter for most classification tasks
  • Domain-specific augmentation: for medical imaging, simulate different scanner types; for satellite imagery, simulate different lighting conditions; for retail products, simulate different backgrounds
  • For rare event detection (defects, anomalies): collect data specifically in the rare event conditions. A model trained predominantly on normal examples will be poorly calibrated on the rare events it is supposed to detect.
  • Synthetic data for rare classes: use diffusion models to generate additional training examples for rare classes, then filter generated examples with a quality classifier

Time-Series and Tabular Data

Feature drift and temporal validity are the core challenges. Features that are predictive at training time may stop being predictive as the world changes. A credit risk model trained before a recession has a different relationship between features and defaults than after the recession begins.

Data strategy:

  • Always split validation sets by time - use the most recent data as validation, not a random sample
  • Track feature importance over time. If a feature that was highly predictive last quarter is no longer predictive this quarter, it may signal concept drift
  • Design retraining triggers based on both schedule (retrain monthly) and performance (retrain if AUC drops below 0.XX on the most recent week's data)
  • Maintain a long tail of historical data for rare patterns (economic downturns, pandemic-level disruptions) that occur infrequently but matter greatly when they do

The Data Audit: A Pre-Training Checklist

Before training any model, run through this audit. These are the checks that catch the problems most likely to cause production failures.

DATA AUDIT CHECKLIST

□ Split correctness
□ Is the test set temporally after the training set (for time-series data)?
□ Is there any user or item overlap between train and test that could inflate metrics?
□ Is the test set representative of the serving distribution?

□ Label correctness
□ Are labels computed without future information?
□ Is the label window appropriate for the task latency?
□ Are there any features that encode the label (leakage)?

□ Distribution alignment
□ Have you compared training vs serving feature distributions (PSI, t-SNE)?
□ Is the serving data collected from the same context as training data?
□ Is there any known distribution shift between training collection period and deployment?

□ Label quality
□ What is the inter-annotator agreement (Cohen's kappa) on a sample?
□ What is the estimated label noise rate?
□ Are ambiguous examples flagged for expert review?

□ Bias and coverage
□ Is the training data representative of all user segments that will be served?
□ Are rare but important cases (rare diseases, rare fraud patterns) covered?
□ Is there exploration data to cover items the current model does not recommend?

□ Versioning and reproducibility
□ Is the training dataset versioned (DVC, MLflow, or equivalent)?
□ Given the model version, can you reproduce the exact training dataset?
□ Are data transformations code-reviewed and tested?

Running this checklist before training saves weeks of debugging after deployment. Every item on this list corresponds to a class of production failure that is well-documented in the ML engineering literature.


Data Governance and Compliance

Data collection strategy is not just a technical problem - it is a legal and compliance problem. Ignoring data governance is how ML teams create regulatory liability for their companies.

GDPR and Data Minimization

The General Data Protection Regulation (GDPR) in the EU imposes the principle of data minimization: collect only the data you need for the stated purpose. In practice, this means:

  • You cannot collect all possible signals "just in case" they might be useful for future models
  • You must declare the purpose of data collection and can only use data for that stated purpose (purpose limitation)
  • Users have the right to deletion - if a user requests deletion, their data must be removed from training datasets and from model artifacts that memorize their data

Practical implications for data collection design:

  • Document why each signal is collected before collecting it
  • Implement data deletion pipelines that can remove a user's data from training datasets, not just from databases
  • Design for differential privacy if you will train on personal data: adding noise to gradients during training (DP-SGD) provides formal privacy guarantees

Data Retention Policies

Training datasets cannot be stored indefinitely. Most companies have data retention policies (6 months, 1 year, 7 years for financial data) that also apply to ML training data. This means:

  • Old training data is automatically deleted after the retention period
  • If you want to retain historical training data for longer (to train on long-horizon patterns), you need explicit policy approval
  • Model artifacts trained on deleted data can be retained - the model weights themselves do not typically contain individually identifiable information (with exceptions for memorization in large language models)

For consumer-facing ML applications, users typically consent to data collection through Terms of Service. But consent is specific - users who consented to data collection for product improvement may not have consented to that data being used for training a model that could make consequential decisions about them (loan approval, insurance pricing, hiring).

The safe approach: treat training data for consequential models as requiring explicit, informed consent. This is more restrictive than what is strictly legally required in most jurisdictions, but it reduces regulatory and reputational risk.


Handling Data at Scale: Practical Systems

For large-scale ML systems, the data pipeline must be engineered for throughput, reliability, and correctness simultaneously. Here is how the major components work together.

Batch Data Pipelines (Apache Spark)

For training data preparation over large historical datasets, Spark is the standard tool. It distributes computation across a cluster, handles petabyte-scale datasets, and integrates with cloud storage (S3, GCS).

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.window import Window

spark = SparkSession.builder.appName("TrainingDataPipeline").getOrCreate()

# Load raw event logs from cloud storage
events = spark.read.parquet("s3://data-lake/events/year=2024/month=*/")

# Compute user activity features with window functions
# CRITICAL: use rangeBetween to ensure strict temporal ordering
user_window = Window.partitionBy("user_id").orderBy("event_timestamp_unix")

user_features = events.groupBy("user_id").agg(
F.count("*").alias("total_events_30d"),
F.sum(F.when(F.col("event_type") == "purchase",
F.col("purchase_value")).otherwise(0)
).alias("total_spend_30d"),
F.countDistinct("item_id").alias("unique_items_seen_30d"),
F.max("event_timestamp").alias("last_activity_timestamp"),
)

# Compute recency
user_features = user_features.withColumn(
"days_since_last_activity",
(F.unix_timestamp(F.lit("2024-03-01")) -
F.unix_timestamp("last_activity_timestamp")) / 86400
)

# Join with labels
labels = spark.read.parquet("s3://data-lake/labels/churn_labels_2024_q1.parquet")
training_data = labels.join(user_features, on="user_id", how="left")

# Write training dataset (partitioned for efficient reading)
training_data.write.mode("overwrite").parquet(
"s3://ml-training/churn_model/v5/training_data/"
)

For near-real-time features (user's activity in the last 5 minutes, session-level signals), streaming pipelines process events as they arrive. Flink is the standard choice for stateful stream processing at scale.

The key concept: exactly-once semantics - each event is processed exactly once, even in the presence of network failures or machine crashes. This guarantees that your streaming feature counts are accurate.

For ML purposes, the streaming pipeline computes features and writes them to an online feature store (Redis, DynamoDB). The online feature store is the bridge between the streaming pipeline and the serving layer: it stores the latest value of each feature for each entity (user ID, item ID) so that the model can look it up with sub-10ms latency at serving time.

Data Quality Monitoring with Great Expectations

Great Expectations is the standard library for data quality testing in Python ML pipelines. It lets you define expectations about your data (e.g., "this column should never be null", "this value should be between 0 and 1", "this column should have fewer than 100 unique values") and validates them automatically as part of your data pipeline.

import great_expectations as gx

# Initialize context
context = gx.get_context()

# Define expectations for training data
datasource = context.sources.add_pandas("training_data")
data_asset = datasource.add_dataframe_asset("user_features")
batch_request = data_asset.build_batch_request(dataframe=training_data_df)

# Create expectation suite
suite = context.add_expectation_suite("training_data_validation")

validator = context.get_validator(
batch_request=batch_request,
expectation_suite_name="training_data_validation"
)

# Set data quality expectations
validator.expect_column_values_to_not_be_null("user_id")
validator.expect_column_values_to_not_be_null("label")
validator.expect_column_values_to_be_between(
"days_since_last_activity", min_value=0, max_value=365
)
validator.expect_column_proportion_of_unique_values_to_be_between(
"user_id", min_value=0.99 # user_id should be nearly all unique
)
validator.expect_column_mean_to_be_between(
"label", min_value=0.01, max_value=0.30 # 1-30% churn rate expected
)

# Run validation
validation_results = validator.validate()
if not validation_results.success:
raise ValueError(f"Training data failed quality checks: {validation_results}")

Integrating data quality checks into the training pipeline catches data issues before they produce a bad model. A data quality failure that triggers an alert is far cheaper than a model quality failure that triggers a production incident.


Key Takeaways

Data collection strategy is not a preprocessing step - it is the most consequential architectural decision in an ML system. The retinopathy story illustrates this: 24 months of engineering work failed because the data collection strategy was wrong.

The data flywheel is the most durable competitive moat in ML. Designing for it from day one - with logging infrastructure, data pipelines, and retraining schedules - determines whether your model gets better over time or stays frozen.

Labeling strategy is a cost optimization problem under a quality constraint. Weak supervision and active learning together can achieve 80-90% of human annotation quality at 5-10% of the cost. For most production applications, that tradeoff is excellent.

Distribution auditing - comparing training and serving distributions - should be a standard part of your pre-deployment checklist. A model that passes offline evaluation but fails in production has almost always been bitten by distribution shift that could have been detected before deployment.

The data anti-patterns in this lesson - balanced dataset trap, feedback loop bias, treating all negatives as equal - are not theoretical. They appear in production ML systems at every major technology company. The engineers who catch them early are the ones who have internalized data strategy as a first-class engineering discipline, not an afterthought.

Data governance - GDPR compliance, purpose limitation, data retention, consent - is part of data collection strategy for any ML system that processes personal information. Treating it as a legal department problem rather than an engineering problem is a mistake. The engineers who design the data pipeline are the ones who determine whether it is compliant.

Data collection strategy is not a preprocessing step - it is the most consequential architectural decision in an ML system. The retinopathy story illustrates this: 24 months of engineering work failed because the data collection strategy was wrong.

The data flywheel is the most durable competitive moat in ML. Designing for it from day one - with logging infrastructure, data pipelines, and retraining schedules - determines whether your model gets better over time or stays frozen.

Labeling strategy is a cost optimization problem under a quality constraint. Weak supervision and active learning together can achieve 80-90% of human annotation quality at 5-10% of the cost. For most production applications, that tradeoff is excellent.

Distribution auditing - comparing training and serving distributions - should be a standard part of your pre-deployment checklist. A model that passes offline evaluation but fails in production has almost always been bitten by distribution shift that could have been detected before deployment.

The data anti-patterns in this lesson - balanced dataset trap, feedback loop bias, treating all negatives as equal - are not theoretical. They appear in production ML systems at every major technology company. The engineers who catch them early are the ones who have internalized data strategy as a first-class engineering discipline, not an afterthought.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Feature Engineering Transformations demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.