What is Machine Learning?
Reading time: ~20 minutes | Level: ML Foundations | Role: MLE, ML Engineer, Data Scientist, Research Engineer
A fintech company spent six months building a fraud detection system using hand-crafted rules: if transaction amount > $5,000 AND country != user's home country AND time since last transaction < 2 minutes, flag as suspicious. The rules team maintained a 3,000-line rule file. Every new fraud pattern required a meeting, a code review, and a deployment. The fraud ring adapted in days. The engineering team adapted in weeks. By the time a new rule deployed, the pattern had moved on.
They rewrote the system using gradient boosted trees trained on 50 million labeled transactions. The new system caught 3x more fraud at the same false positive rate, adapted to new patterns through weekly retraining, and the entire rule file was replaced by a 200-line training script.
This is not a story about machine learning being magic. It is a story about the precise problem that machine learning is designed to solve: learning a mapping from inputs to outputs from examples, without explicitly programming every case.
Understanding what ML is - precisely, not vaguely - is the difference between engineers who use it correctly and engineers who reach for it as a hammer.
What You Will Learn
- The three rigorous ways to think about what ML does
- The precise difference between ML and rules-based systems
- The fundamental abstraction: f(x) ≈ y
- When ML is the right engineering choice (and when it is not)
- The major divisions in the ML taxonomy
- Why most production ML failures are data problems, not model problems
Part 1 - Three Ways to Think About ML
Machine learning has been defined in many ways. The best ML engineers hold all three of these views simultaneously - each one is useful in different situations.
View 1: ML as Optimization
The optimization view is the most precise and the most useful for understanding why algorithms work.
Machine learning is the process of finding the parameters of a function that minimize a loss on training data.
Formally: given a dataset , a hypothesis class of functions parameterized by , and a loss function , machine learning solves:
This is Empirical Risk Minimization (ERM) - the formal foundation of supervised learning.
Every ML algorithm you will use is a variant of this. Linear regression minimizes squared error. Logistic regression minimizes cross-entropy. Neural networks minimize whatever loss you specify via gradient descent. Random forests minimize an impurity criterion. Support vector machines minimize a hinge loss with a margin constraint.
Why this view matters for engineers: When a model fails, ask first - is the loss function the right one? Is the optimization converging? Is the hypothesis class expressive enough? These questions come directly from the optimization framing.
View 2: ML as Compression
The compression view explains why ML generalizes and why overfit models fail.
Machine learning is the process of finding a compact description (a model) that captures the structure in data - discarding noise and retaining signal.
A model with 1 million parameters trained on 10 million examples has compressed 10 million data points into 1 million numbers. If the model generalizes, those numbers captured the true underlying structure. If it overfits, those numbers memorized noise that does not appear in new data.
Minimum Description Length (MDL) theory formalizes this: the best model is the one that provides the shortest combined description of the model and the data given the model. Simpler models that still explain the data are preferred. This is the theoretical foundation for regularization.
Why this view matters for engineers: When you add L2 regularization, you are penalizing the model's "description length." When you use a smaller model than you could, you are choosing compression over memorization. When you do early stopping, you are stopping before the model starts fitting noise.
View 3: ML as Function Approximation
The approximation view is the most intuitive and closest to how practitioners talk.
Machine learning is the process of approximating an unknown function using examples of its input-output behavior.
There exists some true (unknown) function that maps customer features to fraud probability, or image pixels to object class, or text to sentiment. You never know this function. You only observe noisy samples from it. Machine learning uses those samples to approximate it as closely as possible.
This view makes explicit that:
- There is a true signal to find (the function really exists)
- You only see noisy, partial observations of it
- Your model is always an approximation - the question is how good
Why this view matters for engineers: This framing makes the irreducible error concept concrete. No matter how good your model or how much data you have, if the output has inherent randomness (noise), you cannot achieve zero error. The job is to separate signal from noise, not to eliminate all error.
import numpy as np
# Illustration: the true function vs. what ML approximates
np.random.seed(42)
# True function: f*(x) = sin(2πx) + 0.1*x²
# We observe it through noise
def true_function(x):
return np.sin(2 * np.pi * x) + 0.1 * x**2
n_samples = 50
x = np.sort(np.random.uniform(0, 2, n_samples))
noise = np.random.normal(0, 0.2, n_samples) # irreducible error
y = true_function(x) + noise # observed data: noisy samples
# ML goal: from (x, y) pairs, recover an approximation of true_function
# without knowing true_function directly
print(f"True function noise variance (irreducible): {np.var(noise):.4f}")
print(f"Any model's minimum achievable MSE: ~{np.var(noise):.4f}")
# No model can beat this floor - it is the irreducible error
Part 2 - ML vs. Rules-Based Systems
The choice between ML and hand-crafted rules is an engineering decision, not a philosophical one.
Rules-based systems
# Fraud detection: rules-based approach
def is_fraudulent_rules(transaction: dict) -> bool:
"""Hand-crafted fraud detection rules."""
# Rule 1: Large transaction in unusual location
if (transaction['amount'] > 5000 and
transaction['country'] != transaction['user_home_country']):
return True
# Rule 2: Rapid-fire small transactions
if (transaction['amount'] < 20 and
transaction['transactions_last_hour'] > 15):
return True
# Rule 3: Known high-risk merchants with large amounts
if transaction['merchant_category'] in {'gambling', 'crypto_exchange'}:
if transaction['amount'] > 500:
return True
# ... 2,997 more rules ...
return False
ML systems
from sklearn.ensemble import GradientBoostingClassifier
import numpy as np
# The "rules" are learned from 50M labeled transactions
# No human needs to enumerate patterns
class FraudDetector:
def __init__(self):
self.model = GradientBoostingClassifier(
n_estimators=500,
learning_rate=0.05,
max_depth=6
)
def train(self, X_train: np.ndarray, y_train: np.ndarray):
"""Learn patterns from historical labeled data."""
self.model.fit(X_train, y_train)
def predict_proba(self, X: np.ndarray) -> np.ndarray:
"""Returns fraud probability - not a binary rule."""
return self.model.predict_proba(X)[:, 1]
# The model discovers interactions far too complex for human-written rules:
# "If amount is 3.7 standard deviations above user mean AND
# merchant location entropy over last 30 days < 0.4 AND
# velocity in last 2 minutes > user's 95th percentile..."
The decision matrix
| Dimension | Rules-Based | ML |
|---|---|---|
| Pattern space | Small, enumerable | Large, complex |
| Label data needed | None | Yes (hundreds to millions) |
| Explainability | Perfect | Varies by model type |
| Adaptation to drift | Manual update | Retrain on new data |
| Maintenance | High (rule sprawl) | Medium (data pipeline, retraining) |
| Failure mode | Missing rule → miss pattern | Data shift → silent degradation |
| Regulatory compliance | Easy to audit | Requires LIME, SHAP, or simpler model |
Rules-based is right when: rules are known, stable, finite; decisions must be fully auditable; you have very little labeled data; failures are catastrophic.
ML is right when: pattern space is too large to enumerate; you have labeled data; patterns shift over time; you need personalization at scale; continuous outputs are needed.
Part 3 - The Fundamental ML Abstraction: f(x) ≈ y
Strip away every algorithm, every framework - the fundamental abstraction of supervised ML is:
Where:
- is the input (features, observations)
- is the output (label, target, prediction)
- is the model (the function being learned)
- are the parameters (learned from data)
- is the prediction (the model's approximation of )
This abstraction separates:
- What to learn: the mapping
- How to learn it: the optimization algorithm
- What counts as good: the loss function
- How general the learning is: the hypothesis class
import numpy as np
from sklearn.linear_model import LogisticRegression
# The fundamental abstraction made concrete
# x: user-item features for a recommendation system
# y: did the user click? (binary)
x_example = np.array([
0.85, # user-item cosine similarity
12.0, # item popularity (log scale)
0.3, # user recency score
0.72, # content category match
5.0, # user session length (minutes)
1.0, # item is new release (binary)
])
# After training: f_θ maps feature vectors to click probabilities
# f_θ([0.85, 12.0, 0.3, 0.72, 5.0, 1.0]) ≈ 0.73
# → 73% probability of click
# The same abstraction works for every model:
# - Linear regression: f_θ(x) = θᵀx
# - Logistic regression: f_θ(x) = σ(θᵀx)
# - Neural network: f_θ(x) = deep_layers(x; θ)
# - Random forest: f_θ(x) = mean(tree_k(x; θ_k) for k in trees)
# The abstraction does not change. The function class changes.
The components of a learning system
┌─────────────────────────────────────────────────────────────┐
│ ML Learning System │
│ │
│ Training Data Hypothesis Class │
│ {(x₁,y₁),...,(xₙ,yₙ)} f: X → Y parameterized by θ │
│ │ │ │
│ └──────────┬────────────┘ │
│ ↓ │
│ Loss Function ℒ(f_θ(x), y) │
│ ↓ │
│ Optimization Algorithm (SGD, Adam, LBFGS...) │
│ ↓ │
│ Learned Parameters θ̂ │
│ ↓ │
│ f_θ̂: X → Y (the trained model) │
│ ↓ │
│ Evaluation on held-out test data │
└─────────────────────────────────────────────────────────────┘
Every ML system - from a logistic regression to GPT-4 - is an instantiation of this template. GPT-4 has a different hypothesis class (transformer), a different loss (next-token cross-entropy), and a different optimizer (AdamW), but the abstraction is identical.
Part 4 - When ML Is the Right Tool (and When It Isn't)
Use ML when:
1. The pattern is complex and hard to enumerate by hand Spam detection, image classification, speech recognition, fraud detection - the pattern space is so vast that no engineering team could enumerate the rules.
2. You have labeled data and the signal is stable If you have 100,000+ labeled examples and the mapping does not change dramatically over time, ML is likely right.
3. The problem involves continuous prediction Credit scoring, price prediction, demand forecasting - you need a real number, not a yes/no.
4. The task requires personalization at scale Recommendation systems, dynamic pricing, personalized search - no team writes rules for every individual user.
Do NOT use ML when:
1. A formula or lookup table works
If the rule is tax = income * rate, this is arithmetic, not ML. Applying ML adds complexity and prediction error for no benefit.
2. You have no labeled data and labeling is not feasible A model with 50 labeled examples and 50 features will likely perform worse than simple heuristics.
3. The problem requires hard guarantees ML models can fail in unpredictable ways. Safety-critical systems (aircraft control, nuclear safety interlocks) often require formally verified, deterministic logic.
4. The decision needs to be rule-by-rule auditable Some regulated domains require decisions traceable to explicit rules for compliance. Interpretable models help but sometimes cannot satisfy the requirement.
:::warning The most common mistake Reaching for a neural network when a decision tree would work, or reaching for ML when a SQL query would work. Start simple. Escalate complexity only when simpler tools fail. This is engineering judgment, not timidity. :::
Part 5 - The ML Taxonomy
Parametric vs. Nonparametric
| Aspect | Parametric | Nonparametric |
|---|---|---|
| Parameter count | Fixed at architecture design | Grows with training data size |
| Examples | Linear regression, logistic regression, neural networks | KNN, kernel SVM, Gaussian processes |
| Memory at inference | Constant | Scales with training data |
| Inference speed | Fast (compute θᵀx) | Slow (compare to training points) |
| Assumption strength | Assumes a functional form | Fewer shape assumptions |
| Best for | Large scale, latency requirements | Small data, unknown function shape |
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
import numpy as np
np.random.seed(42)
n = 1000
X = np.random.randn(n, 5)
y = X @ np.array([1.0, -0.5, 2.0, 0.3, -1.2]) + np.random.randn(n) * 0.5
# Parametric: Linear Regression
# Fixed number of parameters (5 weights + 1 bias) regardless of data size
lr = LinearRegression().fit(X, y)
# lr.coef_ has 5 values - always 5 values, even with 1M training points
print(f"LR params: {len(lr.coef_) + 1}") # 6 (5 weights + intercept)
# Nonparametric: KNN
# "Model" IS the training data - stores all 1000 points
knn = KNeighborsRegressor(n_neighbors=5).fit(X, y)
# knn._fit_X stores all n training points
# Inference: finds 5 nearest neighbors in the training set - O(n) time
print(f"KNN stores {knn._fit_X.shape[0]} training points") # 1000
Generative vs. Discriminative
Discriminative models learn - how to discriminate between classes given features. They answer: "Given these features, what is the label?"
Generative models learn the joint distribution or - how data is generated. They can answer: "How likely is this data?" and generate new samples.
| Model Type | Learns | Can generate? | Examples |
|---|---|---|---|
| Discriminative | No | Logistic regression, SVM, neural classifiers, BERT fine-tuned | |
| Generative | or | Yes | Naive Bayes, VAE, GAN, diffusion models, GPT |
# Generative vs. discriminative - conceptual illustration
# DISCRIMINATIVE: given features, what is P(spam | features)?
# Logistic regression: P(y=1 | x) = σ(θᵀx)
# Goal: learn the decision boundary between classes
# Cannot generate a "typical spam email"
# GENERATIVE: what does spam look like? Learn P(x | spam)
# Naive Bayes: P(word_i | spam) for each word in vocabulary
# Then classify via Bayes theorem: P(spam | x) ∝ P(x | spam) * P(spam)
# CAN generate a "typical spam email" by sampling from P(x | y=spam)
# In modern practice:
# - GPT-4 is GENERATIVE: learns P(next_token | context)
# - BERT for classification is DISCRIMINATIVE after fine-tuning
# - Stable Diffusion is GENERATIVE: learns P(image | text_caption)
When to prefer discriminative: Prediction is the only goal, sufficient labeled data exists. Discriminative models typically achieve better classification accuracy per labeled example.
When to prefer generative: You need to generate data, detect anomalies, handle missing inputs, or use large unlabeled datasets. Modern pretraining (GPT, BERT) is generative pretraining followed by discriminative fine-tuning.
Part 6 - Production Reality Check: Data Problems, Not Model Problems
Surveys of production ML projects consistently find:
- ~80% of ML engineering time is spent on data - collection, cleaning, labeling, pipelines
- Most model failures are data failures - drift, labeling errors, leakage, distribution mismatch
- Model complexity has diminishing returns - going from logistic regression to neural network typically gains 1–3% accuracy; fixing a data quality issue gains 10–20%
The data failure taxonomy
| Failure Type | Description | How to Catch |
|---|---|---|
| Distribution shift | Train distribution ≠ deployment distribution | Monitor input feature statistics in production |
| Label noise | Labels in training set are wrong | Audit a random sample; check inter-annotator agreement |
| Data leakage | Future info leaks into training features | Review feature engineering; use strict temporal splits |
| Sampling bias | Training set not representative | Evaluate metrics sliced by demographic/region/time |
| Feature drift | Feature values change over time | Monitor feature distributions post-deployment |
| Missing data mismatch | Missing patterns differ train vs. serving | Compare missingness rates in training vs. production |
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Illustration: data quality > model complexity
np.random.seed(42)
n_samples = 5000
n_features = 20
# Generate clean dataset
X = np.random.randn(n_samples, n_features)
true_weights = np.random.randn(n_features)
y = (X @ true_weights + np.random.randn(n_samples) * 0.5 > 0).astype(int)
# Scenario B: 20% label noise (common in real annotation pipelines)
y_noisy = y.copy()
noise_idx = np.random.choice(n_samples, size=int(0.2 * n_samples), replace=False)
y_noisy[noise_idx] = 1 - y_noisy[noise_idx] # flip 20% of labels
X_train, X_test, y_clean_tr, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
_, _, y_noisy_tr, _ = train_test_split(
X, y_noisy, test_size=0.2, random_state=42
)
lr_clean = LogisticRegression(max_iter=1000).fit(X_train, y_clean_tr)
lr_noisy = LogisticRegression(max_iter=1000).fit(X_train, y_noisy_tr)
print(f"Model on clean labels: {accuracy_score(y_test, lr_clean.predict(X_test)):.3f}")
print(f"Model on 20% noise: {accuracy_score(y_test, lr_noisy.predict(X_test)):.3f}")
# Typical output:
# Model on clean labels: 0.887
# Model on 20% noise: 0.761
# ~13% accuracy drop from data noise alone - larger than any model improvement
:::tip The ML engineer's first question Before tuning hyperparameters, trying a new architecture, or adding features - ask: "Is my data right?" Check for leakage, labeling errors, distribution shift, and sampling bias first. Data fixes have higher ROI than model changes in the vast majority of cases. :::
Part 7 - The System Around the Model
The model.fit(X, y) call is typically less than 5% of the code in a production ML system. The rest is:
Production ML System
├── Data ingestion (10–15%) ← schema validation, connectors
├── Feature engineering (20–30%) ← compute, store, version features
├── Training infrastructure (15%) ← experiment tracking, hyperparams
├── Evaluation framework (10–15%) ← sliced evaluation, metrics
├── Serving infrastructure (15–20%) ← model packaging, API, caching
└── Monitoring (15–20%) ← drift detection, alerting
Understanding what ML is - as optimization, compression, and approximation - gives you the mental model to reason about all of these components, not just the training step.
Interview Questions
Q1: What is Empirical Risk Minimization and why is it the foundation of supervised learning?
Empirical Risk Minimization (ERM) is the principle of finding model parameters that minimize the average loss on the training data:
It is the foundation of supervised learning because:
-
It operationalizes learning: Instead of "learn from data" (vague), ERM gives a precise, computable objective.
-
Every supervised algorithm is ERM: Linear regression minimizes MSE, logistic regression minimizes cross-entropy, SVMs minimize hinge loss with a margin. The loss and hypothesis class vary; the principle is the same.
-
It has theoretical guarantees: Under statistical learning theory conditions (Lesson 12), minimizing empirical risk also minimizes expected risk with high probability given sufficient data.
-
It separates concerns cleanly: The loss function encodes what "good" means. The optimizer finds the minimum. The hypothesis class defines learnable functions. These three components are largely independent.
Key limitation: minimizing empirical risk can overfit - fitting training data well without generalizing. This is why regularization (penalty on complexity) and held-out evaluation are essential.
Q2: What is the difference between parametric and nonparametric models? When would you choose each in production?
A parametric model has a fixed number of parameters determined by its architecture, independent of training data size. Linear regression with d features always has d+1 parameters. A neural network with a fixed architecture has a fixed parameter count. After training, you discard the training data - the model is fully encoded in θ.
A nonparametric model has a parameter count that effectively grows with training data. KNN stores all n training points. A kernel SVM stores support vectors (a subset of training points). A Gaussian process has an n×n covariance matrix.
Choose parametric for production when:
- Inference latency is critical: parametric inference is O(1) in training set size
- You have large training sets: storing millions of examples is impractical
- The inductive bias fits the problem (e.g., data is roughly linear → linear model)
- Memory is constrained: the model must fit in a small serving container
Choose nonparametric when:
- Dataset is small and you cannot afford wrong assumptions about the function shape
- The problem has natural local structure (nearby points really do have similar labels)
- You are prototyping and want a reasonable baseline with minimal tuning
- The distribution is multimodal or non-convex in ways that parametric families cannot capture
Key tradeoff: nonparametric models require storing training data at inference time (or precomputed kernel matrices) and inference time scales with training set size. This is often prohibitive at production scale.
Q3: A colleague suggests replacing your rules-based content moderation system with an ML model. What questions do you ask before agreeing?
Before replacing a rules-based system with ML:
-
What is the labeled data situation? Content moderation requires millions of high-quality labeled examples. Are labels consistent across annotators (check inter-annotator agreement)? Who labels edge cases?
-
What are the false positive / false negative costs? Removing legitimate content (FP) causes user churn and possible legal liability. Missing violations (FN) causes regulatory fines. The asymmetry determines whether to optimize precision, recall, or a business-specific metric.
-
How stable is the decision boundary? Content policies change. Adversarial actors adapt. If the boundary shifts quarterly, you need a retraining pipeline. What is the infrastructure cost and latency?
-
What are the explainability requirements? Regulatory domains often require explaining to users why content was removed. Can the ML model provide auditable explanations? This may constrain model choice.
-
What are the latency requirements? Rules execute in microseconds. An ML model adds inference latency. If moderation is inline (real-time posting), this constrains model size.
-
What does the existing system achieve? If the rules-based system achieves 99% precision and 97% recall, the ML system needs to meaningfully exceed that to justify the engineering investment and operational complexity.
The answer might be "yes, use ML" - but these questions define the scope and success criteria of the project.
Q4: Explain the difference between generative and discriminative models. Which achieves better classification accuracy, and why?
Discriminative models learn - the conditional probability of the label given the input. They model the decision boundary directly. Examples: logistic regression, SVM, neural classifiers.
Generative models learn the joint distribution or the marginal . To classify, they use Bayes' theorem: . Examples: Naive Bayes, VAE, GAN, GPT.
Which achieves better accuracy? Discriminative models typically achieve lower asymptotic error given sufficient labeled data. The argument: if your goal is , why model at all? Every parameter spent modeling how data is generated is a parameter not spent improving the decision boundary.
Ng and Jordan (2002) showed that generative models (Naive Bayes) reach near-optimal performance with fewer labeled examples (faster convergence), but discriminative models (logistic regression) achieve lower error given enough data.
When generative models win:
- Very limited labeled data (generative models converge faster)
- Generation is the goal (image synthesis, text generation)
- Semi-supervised learning (use unlabeled data for , labeled data for )
- Anomaly detection (model "normal" distribution, flag deviations)
- Missing features at inference (can marginalize over missing variables)
Q5: In practice, what are the most common reasons production ML models fail, and which failures are preventable at design time?
From surveys of production ML systems, failure modes distribute roughly as:
Distribution shift (~30–40%): Training distribution differs from deployment distribution. Preventable by: using data that reflects actual deployment (not historical data for a future-deployed system), implementing input distribution monitoring, designing retraining pipelines with the right cadence.
Data quality issues (~20–30%): Label noise, missing value handling that differs between training and serving, feature pipelines that behave differently offline vs. online. Preventable by: rigorous data validation, consistency tests between offline feature computation and online serving, unit tests for feature pipelines.
Data leakage (~10–20%): Future information in training features yields falsely optimistic offline metrics. Preventable by: strict temporal splits, group-based splits, careful feature engineering review, and always asking "could this feature know the answer before the prediction point?"
Wrong metric/objective (~10%): The model is optimized for a metric that does not align with the business objective. Preventable by: aligning loss functions and evaluation metrics with the business objective at design time, not after.
Model architecture issues (~5–10%): Underfitting or overfitting. Gets the most textbook attention but is least common relative to data issues.
Key insight: Most preventable failures are design and process decisions, not model decisions. This is why ML engineering is fundamentally about data systems and process discipline.
Key Takeaways
- Machine learning is simultaneously optimization (minimize a loss), compression (find a compact description), and function approximation (learn from noisy observations) - hold all three views
- The fundamental abstraction is , where are learned by minimizing a loss - this is Empirical Risk Minimization
- ML is the right tool when the pattern space is too large to enumerate, when you have labeled data, and when the mapping is stable enough to learn
- Parametric models have fixed parameter counts; nonparametric models grow with data - each is appropriate in different regimes
- Discriminative models learn and typically achieve better accuracy; generative models learn and can generate new data
- In production, most ML failures are data problems: distribution shift, label noise, leakage, sampling bias - not model architecture problems
Next: Lesson 02 - Supervised, Unsupervised, and Reinforcement Learning →
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Neural Network Forward Pass demo on the EngineersOfAI Playground - no code required.
:::
