Module 04: Neural Networks
Why Neural Networks Changed Everything
In 2012, a neural network called AlexNet halved the ImageNet error rate overnight. Nothing in classical ML - SVMs, random forests, gradient boosting - had ever produced a leap like that. The field pivoted almost overnight, and by 2024, neural networks underpin nearly every frontier system: LLMs, image generators, protein folding models, recommendation engines serving billions of users.
But neural networks are not magic. They are differentiable function approximators - parameterized mathematical functions that learn by gradient descent. They are also infamously difficult to debug, sensitive to initialization, prone to training instability, and capable of spectacular failure modes that don't exist in classical ML. This module teaches you not just how they work, but what breaks in practice and how to fix it.
:::note Engineering Mindset Every lesson in this module approaches neural networks from an engineering perspective: here is the concept, here is the math, here is what goes wrong in production, and here is how to diagnose and fix it. Theory exists to explain failure modes, not as an end in itself. :::
Module Map
How Neural Networks Differ from Classical ML
Classical ML models - linear regression, SVMs, random forests - require feature engineering. A human expert decides which features to compute, how to transform them, and how to combine them. The model learns only the final mapping.
Neural networks learn features automatically. Each layer learns a progressively more abstract representation of the input. A convolutional network doesn't need hand-crafted edge detectors - it learns them. A language model doesn't need POS tags - it learns syntactic structure implicitly.
This creates a different set of engineering problems:
| Classical ML | Neural Networks |
|---|---|
| Feature engineering dominates | Architecture and training dominate |
| Training is usually fast and stable | Training is slow and can diverge |
| Interpretability is easier | Interpretability requires specialized tools |
| Hyperparameters are few | Hyperparameters are many |
| Overfitting is common but obvious | Overfitting patterns are more subtle |
| Debugging is relatively straightforward | Debugging requires gradient inspection |
Neither is universally better. For tabular data with thousands of rows, gradient-boosted trees often beat neural networks. For images, text, and audio at scale, neural networks dominate. Knowing when to use which is a key ML engineering skill.
The Neural Network Engineering Workflow
Every production neural network project follows approximately this sequence. Failures at any stage cascade into the next.
This module teaches each stage in depth, with the connections between them made explicit.
Lesson Breakdown
| # | Lesson | Core Concept | Production Impact |
|---|---|---|---|
| 01 | Perceptron and MLP | Neurons, layers, forward pass | Architecture design decisions |
| 02 | Backpropagation | Gradient computation, chain rule | Gradient bugs, in-place ops |
| 03 | Activation Functions | Non-linearity, saturation, GELU | Activation choice per architecture |
| 04 | Weight Initialization | Kaiming He, Xavier, symmetry | Training stability from step 1 |
| 05 | Batch Normalization | Normalize + scale, train vs eval | BN in eval mode is a classic bug |
| 06 | Dropout and Regularization | Inverted dropout, L2, label smoothing | Overfitting prevention strategy |
| 07 | Optimizers | Adam, AdamW, SGD+momentum | Choosing optimizer for the task |
| 08 | LR Scheduling | Warmup, cosine, OneCycleLR | LR is the most impactful hyperparameter |
| 09 | Training Dynamics and Debugging | Loss curves, gradient flow, NaN | Production debugging toolkit |
| 10 | Universal Approximation | Depth vs width theory | Architecture design justification |
Prerequisites
Before this module, you should have solid foundations in:
- Linear algebra: matrix multiplication, eigenvalues, dot products
- Calculus: partial derivatives, chain rule
- Probability: basic distributions, expectations
- Python + NumPy: vectorized operations, broadcasting
- PyTorch basics: tensors, autograd, basic training loop (Modules 01–03 of this curriculum)
If any of these feel shaky, the Module 01 (ML Foundations) lessons on math prerequisites are worth reviewing first.
PyTorch Version and Setup
All code in this module uses:
# Requirements
# torch >= 2.0
# torchvision >= 0.15 (for some examples)
# numpy >= 1.24
# matplotlib >= 3.7 (for visualization examples)
Install via:
pip install torch torchvision numpy matplotlib
The module assumes GPU availability for some examples but all code runs on CPU. Where GPU matters for performance, it is noted explicitly.
How to Use This Module
If you are preparing for ML engineering interviews, start with Lessons 01, 02, and 07. These cover the highest-frequency interview topics. Then read 03, 04, and 05 for depth.
If you are debugging a failing training run, go directly to Lesson 09 (Training Dynamics and Debugging). It contains the production checklist.
If you are starting a new project, read Lessons 01 → 04 → 07 → 08 in sequence before writing any training code.
If you are architecting a system, Lesson 10 (Universal Approximation) provides theoretical grounding for architecture choices.
Every lesson ends with interview Q&A covering the questions most commonly asked in ML engineering roles at top-tier companies.
