What is numpy vectorization?

Understand why Python loops are slow, how NumPy's C-level loops bypass interpreter overhead, broadcasting rules, views vs copies, memory layout, ufuncs, and real-world data pipeline optimization.

How does python loop overhead work in practice?

Vectorization with NumPy - Escaping Python's Loop Overhead covers numpy vectorization, python loop overhead, numpy broadcasting from first principles with code examples. Free lesson at https://engineersofai.com/docs/python/python-advanced/performance-engineering/vectorization-with-numpy

What is the difference between numpy vectorization and numpy broadcasting?

See the full breakdown at https://engineersofai.com/docs/python/python-advanced/performance-engineering/vectorization-with-numpy

Vectorization with NumPy - Escaping Python's Loop Overhead

Predict the performance difference:

import numpy as np
import time

n = 10_000_000

# Pure Python
data_list = list(range(n))
start = time.perf_counter()
result_list = [x * 2 + 1 for x in data_list]
t_python = time.perf_counter() - start

# NumPy
data_array = np.arange(n)
start = time.perf_counter()
result_array = data_array * 2 + 1
t_numpy = time.perf_counter() - start

print(f"Python: {t_python:.3f}s")
print(f"NumPy:  {t_numpy:.3f}s")
print(f"Speedup: {t_python / t_numpy:.0f}x")

Most engineers guess "2-3x faster." The actual result:

Python: 1.250s
NumPy:  0.018s
Speedup: 69x

A 69x speedup from changing two lines. The operation is identical - multiply every element by 2 and add 1. The difference is not the algorithm. It is not parallelism. It is not a better language. It is the elimination of per-element interpreter overhead.

This lesson explains exactly why Python loops are slow at the bytecode level, how NumPy sidesteps this overhead, and the engineering patterns that let you write fast numerical code.

What You Will Learn

Why Python loops are fundamentally slow (bytecode dispatch, PyObject boxing, type checking)
How NumPy's C-level loops bypass interpreter overhead
Broadcasting rules and how to use them correctly
The difference between views and copies (and why it matters for performance)
Memory layout: C-order vs Fortran-order and cache locality
Structured arrays for heterogeneous tabular data
Universal functions (ufuncs) and vectorized operations
Real-world patterns for data pipeline optimization and feature engineering

Prerequisites

Completed Lessons 1-5 (profiling, caching, memory optimization)
Basic familiarity with NumPy arrays (creating, indexing, slicing)
Understanding of CPU cache hierarchy (L1/L2/L3 caches)

Part 1 - Why Python Loops Are Slow

A Python for loop over a list of numbers is doing far more work than you think. Let us trace what happens at the bytecode level:

# This simple loop:
total = 0
for x in [1, 2, 3]:
    total = total + x

import dis
code = compile("""
total = 0
for x in data:
    total = total + x
""", "<string>", "exec")
dis.dis(code)

Each iteration of this loop involves:

Iterator protocol: call __next__ on the iterator (function call overhead)
Name lookup: look up total in the local namespace (dict lookup)
Unboxing: extract the C long from the Python int object
Type checking: determine what + means for these operand types
Arithmetic: actually perform the addition (the only useful work)
Boxing: wrap the C long result back into a Python int object
Name binding: store the result back as total (dict insertion)

For 10 million iterations, steps 1-4 and 6-7 are pure overhead. Step 5 (the actual addition) takes nanoseconds. Everything else takes microseconds.

The Cost in Numbers

import timeit

n = 1_000_000
setup = f"""
import numpy as np
data_list = list(range({n}))
data_array = np.arange({n}, dtype=np.float64)
"""

# Python loop: sum
t_py = timeit.timeit('sum(data_list)', setup=setup, number=10)
# NumPy: sum
t_np = timeit.timeit('np.sum(data_array)', setup=setup, number=10)
# Built-in sum on numpy array (worst of both worlds)
t_bad = timeit.timeit('sum(data_array)', setup=setup, number=10)

print(f"Python sum():     {t_py/10*1000:.1f} ms")
print(f"NumPy np.sum():   {t_np/10*1000:.1f} ms")
print(f"sum(np.array):    {t_bad/10*1000:.1f} ms")  # SLOW!

# Typical output:
# Python sum():     35.2 ms
# NumPy np.sum():    0.8 ms    (44x faster)
# sum(np.array):   120.5 ms    (3.4x SLOWER than Python list!)

:::danger Never Use Python's Built-in sum() on NumPy Arrays sum(numpy_array) iterates element-by-element through the Python interpreter, extracting each element as a Python object. It is slower than sum(list) because NumPy element extraction has more overhead than list element access. Always use np.sum(), array.sum(), or other NumPy functions. :::

Part 2 - NumPy's C-Level Loops

When you write array * 2 + 1, NumPy does not create a Python loop. It calls a compiled C function that operates on the raw memory buffer directly.

How NumPy Arrays Are Stored

import numpy as np

a = np.array([1.0, 2.0, 3.0, 4.0], dtype=np.float64)

# NumPy array is a thin wrapper around a contiguous memory buffer
print(f"Data pointer: {a.ctypes.data}")
print(f"Shape: {a.shape}")
print(f"Strides: {a.strides}")      # Bytes between consecutive elements
print(f"Dtype: {a.dtype}")
print(f"Itemsize: {a.itemsize}")     # 8 bytes per float64
print(f"Nbytes: {a.nbytes}")        # Total buffer size
print(f"Contiguous: {a.flags['C_CONTIGUOUS']}")

The key differences:

Contiguous memory: NumPy stores raw C doubles (8 bytes each) in a single contiguous buffer. Lists store pointers to scattered PyFloat objects.
Fixed dtype: NumPy knows every element is float64 at array creation time. No per-element type checking.
C-level iteration: NumPy's compiled C code loops over the buffer with pointer arithmetic. No interpreter involvement per element.

Vectorized Operations

import numpy as np
import time

n = 5_000_000
a = np.random.randn(n)
b = np.random.randn(n)

# Vectorized (C-level loop)
start = time.perf_counter()
c = a * b + np.sin(a) - np.log(np.abs(b) + 1)
t_vec = time.perf_counter() - start

# Python loop equivalent
start = time.perf_counter()
c_py = []
import math
for i in range(n):
    c_py.append(a[i] * b[i] + math.sin(a[i]) - math.log(abs(b[i]) + 1))
t_py = time.perf_counter() - start

print(f"Vectorized: {t_vec:.3f}s")
print(f"Python loop: {t_py:.3f}s")
print(f"Speedup: {t_py / t_vec:.0f}x")
# Vectorized: 0.065s
# Python loop: 8.200s
# Speedup: 126x

Part 3 - Broadcasting Rules

Broadcasting is NumPy's mechanism for performing arithmetic on arrays with different shapes. Understanding it is essential for writing vectorized code without explicit loops.

The Rules

If arrays have different numbers of dimensions, the shape of the smaller array is padded with 1s on the left.
Arrays with size 1 along a dimension behave as if they had the same size as the other array along that dimension - the value is "broadcast" (repeated).
If arrays are incompatible (neither dimension is 1 and they are not equal), NumPy raises ValueError.

import numpy as np

# Rule 1: Different number of dimensions
a = np.array([1, 2, 3])        # Shape: (3,)
b = np.array([[10], [20]])     # Shape: (2, 1)

# Step 1: pad a to (1, 3)
# Step 2: broadcast a from (1, 3) to (2, 3)
#          broadcast b from (2, 1) to (2, 3)
result = a + b
print(result)
# [[11, 12, 13],
#  [21, 22, 23]]
print(result.shape)  # (2, 3)

# Rule 2: Size-1 dimension broadcasts
row = np.array([1, 2, 3])       # Shape (3,) → treated as (1, 3)
col = np.array([[10], [20], [30]])  # Shape (3, 1)

result = row + col
print(result.shape)  # (3, 3)
print(result)
# [[11, 12, 13],
#  [21, 22, 23],
#  [31, 32, 33]]

# Rule 3: Incompatible shapes
a = np.array([1, 2, 3])        # Shape (3,)
b = np.array([1, 2])           # Shape (2,)
# a + b → ValueError: shapes (3,) and (2,) not aligned

Practical Broadcasting

import numpy as np

# Normalize each column to zero mean, unit variance
data = np.random.randn(1000, 5)   # 1000 samples, 5 features

# Without broadcasting (loop):
# for col in range(5):
#     data[:, col] = (data[:, col] - data[:, col].mean()) / data[:, col].std()

# With broadcasting (vectorized):
means = data.mean(axis=0)          # Shape: (5,)
stds = data.std(axis=0)            # Shape: (5,)
normalized = (data - means) / stds  # (1000, 5) - (5,) → broadcasts!

print(f"Means after normalization: {normalized.mean(axis=0)}")  # ~[0, 0, 0, 0, 0]
print(f"Stds after normalization: {normalized.std(axis=0)}")    # ~[1, 1, 1, 1, 1]

# Pairwise distances using broadcasting
points = np.random.randn(100, 3)   # 100 points in 3D

# Compute all pairwise distances without any loop
diff = points[:, np.newaxis, :] - points[np.newaxis, :, :]  # (100, 100, 3)
distances = np.sqrt((diff ** 2).sum(axis=2))                 # (100, 100)

print(f"Distance matrix shape: {distances.shape}")  # (100, 100)
print(f"Max distance: {distances.max():.3f}")

:::tip Broadcasting Visualization When in doubt, align shapes from the right:

Array A:     256 x 256 x 3
Array B:                 3
------------------------------
Result:      256 x 256 x 3    (B broadcasts along first two dimensions)

Array A:     256 x   1 x 3
Array B:       1 x 256 x 3
------------------------------
Result:      256 x 256 x 3    (Both broadcast along different dimensions)

:::

Part 4 - Views vs Copies

This distinction is critical for both correctness and performance. A view shares memory with the original array. A copy allocates new memory.

import numpy as np

a = np.arange(10)

# VIEWS - share memory, no allocation cost
v1 = a[2:5]           # Slicing creates a view
v2 = a.reshape(2, 5)  # Reshape creates a view (usually)
v3 = a.T               # Transpose creates a view

# Modifying a view modifies the original!
v1[0] = 999
print(a)  # [ 0,  1, 999,  3,  4,  5,  6,  7,  8,  9]

# Check if it's a view
print(v1.base is a)  # True - v1 is a view of a

# COPIES - new memory allocation
c1 = a.copy()           # Explicit copy
c2 = a[[0, 2, 4]]       # Fancy indexing creates a copy
c3 = a[a > 3]            # Boolean indexing creates a copy

c1[0] = -1
print(a[0])  # 0 - original is unchanged

print(c1.base is a)  # False - c1 is independent

Performance Implications

import numpy as np
import time

n = 100_000_000

a = np.arange(n, dtype=np.float64)

# View: zero-cost - just adjusts the offset and shape
start = time.perf_counter()
for _ in range(1000):
    v = a[::2]  # Every other element (view)
t_view = time.perf_counter() - start

# Copy: allocates n/2 * 8 bytes = ~400 MB
start = time.perf_counter()
c = a[::2].copy()  # Force a copy
t_copy = time.perf_counter() - start

print(f"View creation (1000x): {t_view*1000:.2f} ms")
print(f"Copy creation (1x):    {t_copy*1000:.2f} ms")
# View creation (1000x): 0.15 ms
# Copy creation (1x):    180.00 ms

When Views Bite You

import numpy as np

# Trap: keeping a view of a large array prevents GC of the large array
huge = np.random.randn(100_000_000)  # ~800 MB
small_view = huge[:10]                # Keeps ALL 800 MB alive!

del huge
# The 800 MB is NOT freed because small_view.base still references it

# Fix: copy the small slice if you don't need the rest
small_copy = huge[:10].copy()
del huge
# Now the 800 MB is freed

:::danger Views Keep the Entire Base Array Alive If you slice a small portion of a large array and keep only the slice, the entire base array remains in memory. This is a common source of "memory leaks" in data pipelines. Always .copy() small slices from large arrays when you no longer need the original. :::

Part 5 - Memory Layout: C-Order vs Fortran-Order

NumPy arrays can store data in row-major (C-order) or column-major (Fortran-order). This affects performance due to CPU cache locality.

import numpy as np

# C-order (row-major): elements in a row are contiguous in memory
c_array = np.array([[1, 2, 3],
                     [4, 5, 6]], order='C')
# Memory layout: [1, 2, 3, 4, 5, 6]

# Fortran-order (column-major): elements in a column are contiguous
f_array = np.array([[1, 2, 3],
                     [4, 5, 6]], order='F')
# Memory layout: [1, 4, 2, 5, 3, 6]

print(f"C-order strides: {c_array.strides}")    # (24, 8) - row stride = 24 bytes
print(f"F-order strides: {f_array.strides}")    # (8, 16) - col stride = 8 bytes

Performance Impact

import numpy as np
import time

n = 5000
a_c = np.random.randn(n, n)  # C-order (default)
a_f = np.asfortranarray(a_c)  # F-order copy

# Row-wise sum: C-order wins (rows are contiguous)
start = time.perf_counter()
for _ in range(10):
    row_sums_c = a_c.sum(axis=1)
t_c_row = time.perf_counter() - start

start = time.perf_counter()
for _ in range(10):
    row_sums_f = a_f.sum(axis=1)
t_f_row = time.perf_counter() - start

# Column-wise sum: F-order wins (columns are contiguous)
start = time.perf_counter()
for _ in range(10):
    col_sums_c = a_c.sum(axis=0)
t_c_col = time.perf_counter() - start

start = time.perf_counter()
for _ in range(10):
    col_sums_f = a_f.sum(axis=0)
t_f_col = time.perf_counter() - start

print(f"Row sum - C-order: {t_c_row:.3f}s,  F-order: {t_f_row:.3f}s")
print(f"Col sum - C-order: {t_c_col:.3f}s,  F-order: {t_f_col:.3f}s")
# Row sum - C-order is faster (cache-friendly traversal)
# Col sum - F-order is faster (cache-friendly traversal)

:::tip Choose Memory Layout Based on Access Pattern

Row-wise access (iterating features per sample): use C-order (default)
Column-wise access (iterating samples per feature): use Fortran-order
Most ML libraries expect C-order. Most linear algebra (LAPACK) expects Fortran-order.
NumPy handles the conversion automatically, but the performance cost is real. :::

Part 6 - Structured Arrays

For heterogeneous tabular data, structured arrays combine the memory efficiency of raw arrays with the convenience of named fields.

import numpy as np

# Define a compound dtype
sensor_dtype = np.dtype([
    ('timestamp', np.float64),
    ('sensor_id', np.uint16),
    ('temperature', np.float32),
    ('humidity', np.float32),
    ('status', np.uint8),
])

print(f"Record size: {sensor_dtype.itemsize} bytes")
# 19 bytes per record (vs ~200+ bytes for a Python dict)

# Create structured array
n = 1_000_000
readings = np.zeros(n, dtype=sensor_dtype)

# Fill with data
readings['timestamp'] = np.linspace(1700000000, 1700086400, n)
readings['sensor_id'] = np.random.randint(0, 100, n)
readings['temperature'] = np.random.uniform(15, 35, n).astype(np.float32)
readings['humidity'] = np.random.uniform(30, 90, n).astype(np.float32)
readings['status'] = np.random.randint(0, 4, n)

print(f"Total memory: {readings.nbytes / 1024 / 1024:.1f} MB")
# ~19 MB for 1M records (vs ~200 MB as dicts)

# Access columns efficiently
avg_temp = readings['temperature'].mean()
print(f"Average temperature: {avg_temp:.1f}")

# Filter: active sensors with temp > 30
mask = (readings['status'] == 0) & (readings['temperature'] > 30)
hot_readings = readings[mask]
print(f"Hot readings: {len(hot_readings):,}")

# Sort by timestamp
sorted_readings = np.sort(readings, order='timestamp')

Structured Arrays vs Pandas

import numpy as np
import sys

# Structured array: compact, fast, low overhead
n = 1_000_000
dt = np.dtype([('x', np.float64), ('y', np.float64), ('label', np.int32)])
sa = np.zeros(n, dtype=dt)
print(f"Structured array: {sa.nbytes / 1024 / 1024:.1f} MB")
# ~19.1 MB

# Comparison: if using separate arrays
x = np.zeros(n, dtype=np.float64)     # 8 MB
y = np.zeros(n, dtype=np.float64)     # 8 MB
label = np.zeros(n, dtype=np.int32)   # 4 MB
# Total: 20 MB - slightly more, but column access is faster

# When to use structured arrays:
# - Binary file I/O (direct read/write with np.fromfile/tofile)
# - Interop with C structs via ctypes
# - Memory-mapped structured data
# When to use separate arrays or Pandas:
# - Column-wise operations (filtering, aggregation)
# - Mixed-type analysis
# - When readability matters more than memory

Part 7 - Universal Functions (ufuncs)

Ufuncs are NumPy's mechanism for element-wise operations. Every mathematical operation on arrays (+, *, np.sin, np.exp, etc.) is implemented as a ufunc.

Built-in Ufuncs

import numpy as np

a = np.array([1.0, 4.0, 9.0, 16.0])

# Arithmetic ufuncs
print(np.add(a, 10))        # [11. 14. 19. 26.]  - same as a + 10
print(np.multiply(a, 2))    # [2. 8. 18. 32.]   - same as a * 2
print(np.sqrt(a))           # [1. 2. 3. 4.]

# Reduction ufuncs
print(np.add.reduce(a))     # 30.0 - same as np.sum(a)
print(np.multiply.reduce(a))# 576.0 - same as np.prod(a)

# Accumulation ufuncs
print(np.add.accumulate(a)) # [1. 5. 14. 30.] - cumulative sum

# Outer products
print(np.multiply.outer([1, 2, 3], [10, 20]))
# [[10, 20],
#  [20, 40],
#  [30, 60]]

Writing Custom Ufuncs with np.vectorize

import numpy as np

# np.vectorize is NOT true vectorization - it's a loop wrapper!
# But it provides a convenient interface for applying Python functions
# to arrays.

@np.vectorize
def custom_activation(x):
    """Leaky ReLU activation function."""
    if x > 0:
        return x
    else:
        return 0.01 * x

a = np.array([-2, -1, 0, 1, 2], dtype=np.float64)
print(custom_activation(a))  # [-0.02, -0.01, 0, 1, 2]

# WARNING: np.vectorize is still slow (Python loop underneath)
# For performance, use np.where:
def leaky_relu_vectorized(x, alpha=0.01):
    """True vectorized leaky ReLU."""
    return np.where(x > 0, x, alpha * x)

# Benchmark
import timeit
data = np.random.randn(1_000_000)

t_vectorize = timeit.timeit(lambda: custom_activation(data), number=10)
t_true_vec = timeit.timeit(lambda: leaky_relu_vectorized(data), number=10)

print(f"np.vectorize: {t_vectorize/10*1000:.1f} ms")
print(f"np.where:     {t_true_vec/10*1000:.1f} ms")
# np.vectorize: 450.0 ms
# np.where:       5.2 ms  - 86x faster

:::danger np.vectorize Is Not True Vectorization Despite the name, np.vectorize wraps a Python function in a loop. It provides the interface of vectorized code (broadcasting, dtype handling) but not the performance. For fast element-wise operations, use NumPy's built-in ufuncs, np.where, or np.piecewise. :::

Efficient Conditional Operations

import numpy as np

data = np.random.randn(10_000_000)

# BAD: Python loop
result_bad = np.zeros_like(data)
for i in range(len(data)):          # 10M Python iterations
    if data[i] > 0:
        result_bad[i] = data[i] ** 2
    else:
        result_bad[i] = 0

# GOOD: np.where (vectorized)
result_good = np.where(data > 0, data ** 2, 0)

# BETTER: avoid computing data**2 for elements we discard
result_best = np.zeros_like(data)
mask = data > 0
result_best[mask] = data[mask] ** 2

# All three produce the same result, but:
# BAD: ~12 seconds
# GOOD: ~0.05 seconds
# BETTER: ~0.04 seconds (avoids unnecessary computation)

Part 8 - Real-World: Data Pipeline Optimization

Feature Engineering

import numpy as np

def create_features_python(prices: list, window: int = 20) -> dict:
    """Pure Python feature engineering - slow."""
    n = len(prices)
    features = {
        'returns': [0.0] * n,
        'sma': [0.0] * n,
        'volatility': [0.0] * n,
        'z_score': [0.0] * n,
    }

    for i in range(1, n):
        features['returns'][i] = (prices[i] - prices[i-1]) / prices[i-1]

    for i in range(window, n):
        window_data = prices[i-window:i]
        mean = sum(window_data) / window
        features['sma'][i] = mean

        variance = sum((x - mean) ** 2 for x in window_data) / window
        std = variance ** 0.5
        features['volatility'][i] = std

        if std > 0:
            features['z_score'][i] = (prices[i] - mean) / std

    return features


def create_features_numpy(prices: np.ndarray, window: int = 20) -> dict:
    """Vectorized feature engineering - fast."""
    # Returns: vectorized division
    returns = np.zeros_like(prices)
    returns[1:] = np.diff(prices) / prices[:-1]

    # SMA: cumulative sum trick (O(n) instead of O(n*window))
    cumsum = np.cumsum(prices)
    sma = np.zeros_like(prices)
    sma[window:] = (cumsum[window:] - cumsum[:-window]) / window

    # Volatility: rolling standard deviation
    # Using the cumsum trick for rolling variance
    cumsum_sq = np.cumsum(prices ** 2)
    mean_sq = np.zeros_like(prices)
    mean_sq[window:] = (cumsum_sq[window:] - cumsum_sq[:-window]) / window
    sq_mean = sma ** 2
    volatility = np.zeros_like(prices)
    variance = mean_sq - sq_mean
    # Clamp negative values from floating point errors
    variance = np.maximum(variance, 0)
    volatility[window:] = np.sqrt(variance[window:])

    # Z-score: vectorized division with safe divide
    z_score = np.zeros_like(prices)
    mask = volatility > 0
    z_score[mask] = (prices[mask] - sma[mask]) / volatility[mask]

    return {
        'returns': returns,
        'sma': sma,
        'volatility': volatility,
        'z_score': z_score,
    }


# Benchmark
import time
np.random.seed(42)
prices = np.cumsum(np.random.randn(1_000_000)) + 100
prices_list = prices.tolist()

start = time.perf_counter()
features_py = create_features_python(prices_list)
t_py = time.perf_counter() - start

start = time.perf_counter()
features_np = create_features_numpy(prices)
t_np = time.perf_counter() - start

print(f"Python: {t_py:.3f}s")
print(f"NumPy:  {t_np:.3f}s")
print(f"Speedup: {t_py / t_np:.0f}x")
# Python: 8.500s
# NumPy:  0.035s
# Speedup: 243x

Batch Data Processing

import numpy as np

def process_sensor_batch_vectorized(
    timestamps: np.ndarray,
    values: np.ndarray,
    sensor_ids: np.ndarray,
    min_value: float = -100,
    max_value: float = 100,
) -> dict:
    """
    Process a batch of sensor readings using vectorized operations.
    No Python loops.
    """
    # Step 1: Filter invalid readings
    valid_mask = (values >= min_value) & (values <= max_value) & np.isfinite(values)
    valid_values = values[valid_mask]
    valid_sensors = sensor_ids[valid_mask]
    valid_timestamps = timestamps[valid_mask]

    # Step 2: Per-sensor statistics using np.unique + np.bincount
    unique_sensors, inverse = np.unique(valid_sensors, return_inverse=True)
    n_sensors = len(unique_sensors)

    # Sum per sensor using bincount
    sensor_sums = np.bincount(inverse, weights=valid_values, minlength=n_sensors)
    sensor_counts = np.bincount(inverse, minlength=n_sensors)
    sensor_means = np.divide(
        sensor_sums, sensor_counts,
        out=np.zeros(n_sensors),
        where=sensor_counts > 0,
    )

    # Step 3: Detect anomalies (values > 3 std from sensor mean)
    expected_means = sensor_means[inverse]
    # Compute per-sensor std
    sq_diff = (valid_values - expected_means) ** 2
    sensor_sq_sums = np.bincount(inverse, weights=sq_diff, minlength=n_sensors)
    sensor_stds = np.sqrt(np.divide(
        sensor_sq_sums, sensor_counts,
        out=np.zeros(n_sensors),
        where=sensor_counts > 0,
    ))
    expected_stds = sensor_stds[inverse]

    anomaly_mask = np.abs(valid_values - expected_means) > 3 * expected_stds

    return {
        'n_valid': int(valid_mask.sum()),
        'n_invalid': int((~valid_mask).sum()),
        'n_sensors': n_sensors,
        'n_anomalies': int(anomaly_mask.sum()),
        'sensor_means': dict(zip(unique_sensors.tolist(), sensor_means.tolist())),
    }


# Test with 10 million readings
n = 10_000_000
timestamps = np.arange(n, dtype=np.float64)
values = np.random.randn(n) * 10 + 25  # ~N(25, 10)
sensor_ids = np.random.randint(0, 1000, n).astype(np.int32)

# Inject some anomalies and invalid values
values[::10000] = np.nan
values[::7777] = 999

import time
start = time.perf_counter()
result = process_sensor_batch_vectorized(timestamps, values, sensor_ids)
elapsed = time.perf_counter() - start

print(f"Processed {n:,} readings in {elapsed:.3f}s")
print(f"  Valid: {result['n_valid']:,}")
print(f"  Invalid: {result['n_invalid']:,}")
print(f"  Sensors: {result['n_sensors']}")
print(f"  Anomalies: {result['n_anomalies']:,}")

Key Takeaways

Python loops are slow because of per-element overhead: type checking, boxing/unboxing, dict lookups - not the arithmetic itself. NumPy eliminates all of this.
Vectorization means using array operations instead of loops: a * 2 + 1 instead of [x * 2 + 1 for x in a]. The speedup is typically 10-100x.
Broadcasting lets you avoid explicit loops over dimensions: learn the three rules and you can replace nested loops with single array expressions.
Views share memory; copies allocate new memory: slices are views (free but dangerous), fancy indexing creates copies (safe but expensive). Beware of views keeping large arrays alive.
Memory layout affects cache performance: C-order is fast for row operations, Fortran-order for column operations. Match your layout to your access pattern.
np.vectorize is not vectorization: it wraps a Python loop. Use np.where, boolean indexing, or ufuncs for actual vectorized code.
Use cumulative sum tricks for rolling operations: np.cumsum lets you compute rolling means and variances in O(n) instead of O(n * window).

Graded Practice Challenges

Level 1 - Predict the Output

Question 1: What does this print?

import numpy as np
a = np.arange(12).reshape(3, 4)
b = a[1:3, ::2]
b[0, 0] = 99
print(a[1, 0])

Answer

It prints 99. b = a[1:3, ::2] creates a view (slicing always creates views). b[0, 0] corresponds to a[1, 0]. Modifying the view modifies the original array.

Question 2: What shape does this broadcast produce?

import numpy as np
a = np.ones((5, 1, 3))
b = np.ones((4, 1))
result = a + b
print(result.shape)

Answer

(5, 4, 3). Align from the right:

a: 5 x 1 x 3
b:     4 x 1
-----------
  5 x 4 x 3

Dimension 3: a has 3, b has 1 (from padding) → broadcast b to 3
Dimension 2: a has 1, b has 4 → broadcast a to 4
Dimension 1: a has 5, b has 1 (from padding) → broadcast b to 5

Question 3: Why is sum(numpy_array) slower than sum(python_list) for the same data?

Answer

sum(numpy_array) uses Python's built-in sum, which iterates through the array using Python's iterator protocol. Each element access on a NumPy array requires:

Computing the memory offset from the index
Reading the raw bytes from the buffer
Creating a new Python scalar object (e.g., np.float64)

This per-element overhead is higher than reading a pointer from a Python list (which just dereferences a pointer to an existing PyFloat). The proper way is numpy_array.sum() or np.sum(numpy_array), which iterates in C over the raw buffer.

Level 2 - Debug Challenge

This vectorized code is supposed to compute the moving average but produces incorrect results. Find and fix the bug:

import numpy as np

def moving_average(data, window):
    cumsum = np.cumsum(data)
    result = np.zeros_like(data)
    result[window:] = (cumsum[window:] - cumsum[:-window]) / window
    return result

data = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
print(moving_average(data, 3))
# Expected: [0, 0, 0, 2.0, 3.0, 4.0] for window=3
# (average of [1,2,3]=2.0, average of [2,3,4]=3.0, average of [3,4,5]=4.0)

Answer

The bug is an off-by-one error. The formula cumsum[window:] - cumsum[:-window] computes the sum of elements from index i-window+1 to i+1 (exclusive). But we want the average of the window elements ending at index i.

The correct formula needs to account for the fact that cumsum[i] is the sum of elements data[0] through data[i]. The sum of elements from data[i-window+1] to data[i] is cumsum[i] - cumsum[i-window].

def moving_average(data, window):
    cumsum = np.concatenate([[0], np.cumsum(data)])  # Prepend 0
    result = np.zeros_like(data)
    result[window-1:] = (cumsum[window:] - cumsum[:-window]) / window
    return result

data = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
print(moving_average(data, 3))
# [0. 0. 2. 3. 4.]

The fix: (1) prepend a 0 to cumsum so that cumsum[0] = 0, cumsum[1] = data[0], etc., and (2) start filling at index window-1 instead of window.

Level 3 - Design Challenge

You have a dataset of 100 million user events (timestamp, user_id, event_type, value). The data is stored in a CSV file (~5 GB). You need to compute:

Hourly event counts per event type
Per-user rolling 7-day average value
Top 100 users by total value

Design a processing pipeline that:

Fits in 16 GB RAM
Completes in under 5 minutes
Uses NumPy vectorization (not Pandas, not Spark)

Solution Sketch

Step 1: Load data efficiently

Don't load 5 GB CSV into Python strings. Use np.loadtxt or np.genfromtxt with specific dtypes, or chunk-read with np.fromfile after converting to binary format.

import numpy as np

# Pre-process: convert CSV to binary format once
# Then load as structured array
dt = np.dtype([
    ('timestamp', np.int64),    # Unix timestamp
    ('user_id', np.int32),
    ('event_type', np.int16),   # Encoded as integer
    ('value', np.float32),
])
# ~14 bytes per record * 100M = 1.4 GB - fits in RAM

data = np.fromfile('events.bin', dtype=dt)

Step 2: Hourly event counts (vectorized)

# Convert timestamps to hour buckets
hours = data['timestamp'] // 3600

# Use np.unique + np.bincount for grouping
event_hours = hours * 1000 + data['event_type']  # Composite key
unique_keys, counts = np.unique(event_hours, return_counts=True)
# Decode: hour = key // 1000, event_type = key % 1000

Step 3: Per-user rolling 7-day average

# Sort by (user_id, timestamp)
sort_idx = np.lexsort((data['timestamp'], data['user_id']))
sorted_data = data[sort_idx]

# For each user, compute rolling average
# Use np.split or np.searchsorted to find user boundaries
user_boundaries = np.searchsorted(
    sorted_data['user_id'],
    np.unique(sorted_data['user_id']),
    side='left'
)
# Process each user's data using cumsum trick for rolling average

Step 4: Top 100 users by total value

unique_users, inverse = np.unique(data['user_id'], return_inverse=True)
user_totals = np.bincount(inverse, weights=data['value'].astype(np.float64))
top_100_idx = np.argpartition(user_totals, -100)[-100:]
top_100_idx = top_100_idx[np.argsort(user_totals[top_100_idx])[::-1]]
top_100 = [(unique_users[i], user_totals[i]) for i in top_100_idx]

Memory budget: ~1.4 GB for data + ~1 GB for intermediate arrays = ~2.4 GB peak. Well within 16 GB.

Performance: All operations are vectorized with O(n) or O(n log n) complexity. Expected runtime: 30-90 seconds for 100M records.

What's Next

NumPy vectorization handles the majority of numerical performance needs. But sometimes even C-level loops over arrays are not fast enough - you need custom C code. In C Extensions and FFI, you will learn to write C extensions, use ctypes and cffi, and accelerate hot paths with Cython and pybind11.

What You Will Learn​

Prerequisites​

Part 1 - Why Python Loops Are Slow​

The Cost in Numbers​

Part 2 - NumPy's C-Level Loops​

How NumPy Arrays Are Stored​

Vectorized Operations​

Part 3 - Broadcasting Rules​

The Rules​

Practical Broadcasting​

Part 4 - Views vs Copies​

Performance Implications​

When Views Bite You​

Part 5 - Memory Layout: C-Order vs Fortran-Order​

Performance Impact​

Part 6 - Structured Arrays​

Structured Arrays vs Pandas​

Part 7 - Universal Functions (ufuncs)​

Built-in Ufuncs​

Writing Custom Ufuncs with np.vectorize​

Efficient Conditional Operations​

Part 8 - Real-World: Data Pipeline Optimization​

Feature Engineering​

Batch Data Processing​

Key Takeaways​

Graded Practice Challenges​

Level 1 - Predict the Output​

Level 2 - Debug Challenge​

Level 3 - Design Challenge​

What's Next​

What You Will Learn

Prerequisites

Part 1 - Why Python Loops Are Slow

The Cost in Numbers

Part 2 - NumPy's C-Level Loops

How NumPy Arrays Are Stored

Vectorized Operations

Part 3 - Broadcasting Rules

The Rules

Practical Broadcasting

Part 4 - Views vs Copies

Performance Implications

When Views Bite You

Part 5 - Memory Layout: C-Order vs Fortran-Order

Performance Impact

Part 6 - Structured Arrays

Structured Arrays vs Pandas

Part 7 - Universal Functions (ufuncs)

Built-in Ufuncs

Writing Custom Ufuncs with np.vectorize

Efficient Conditional Operations

Part 8 - Real-World: Data Pipeline Optimization

Feature Engineering

Batch Data Processing

Key Takeaways

Graded Practice Challenges

Level 1 - Predict the Output

Level 2 - Debug Challenge

Level 3 - Design Challenge

What's Next