Vectorization with NumPy - Escaping Python's Loop Overhead
Predict the performance difference:
import numpy as np
import time
n = 10_000_000
# Pure Python
data_list = list(range(n))
start = time.perf_counter()
result_list = [x * 2 + 1 for x in data_list]
t_python = time.perf_counter() - start
# NumPy
data_array = np.arange(n)
start = time.perf_counter()
result_array = data_array * 2 + 1
t_numpy = time.perf_counter() - start
print(f"Python: {t_python:.3f}s")
print(f"NumPy: {t_numpy:.3f}s")
print(f"Speedup: {t_python / t_numpy:.0f}x")
Most engineers guess "2-3x faster." The actual result:
Python: 1.250s
NumPy: 0.018s
Speedup: 69x
A 69x speedup from changing two lines. The operation is identical - multiply every element by 2 and add 1. The difference is not the algorithm. It is not parallelism. It is not a better language. It is the elimination of per-element interpreter overhead.
This lesson explains exactly why Python loops are slow at the bytecode level, how NumPy sidesteps this overhead, and the engineering patterns that let you write fast numerical code.
What You Will Learn
- Why Python loops are fundamentally slow (bytecode dispatch, PyObject boxing, type checking)
- How NumPy's C-level loops bypass interpreter overhead
- Broadcasting rules and how to use them correctly
- The difference between views and copies (and why it matters for performance)
- Memory layout: C-order vs Fortran-order and cache locality
- Structured arrays for heterogeneous tabular data
- Universal functions (ufuncs) and vectorized operations
- Real-world patterns for data pipeline optimization and feature engineering
Prerequisites
- Completed Lessons 1-5 (profiling, caching, memory optimization)
- Basic familiarity with NumPy arrays (creating, indexing, slicing)
- Understanding of CPU cache hierarchy (L1/L2/L3 caches)
Part 1 - Why Python Loops Are Slow
A Python for loop over a list of numbers is doing far more work than you think. Let us trace what happens at the bytecode level:
# This simple loop:
total = 0
for x in [1, 2, 3]:
total = total + x
import dis
code = compile("""
total = 0
for x in data:
total = total + x
""", "<string>", "exec")
dis.dis(code)
Each iteration of this loop involves:
- Iterator protocol: call
__next__on the iterator (function call overhead) - Name lookup: look up
totalin the local namespace (dict lookup) - Unboxing: extract the C
longfrom the Pythonintobject - Type checking: determine what
+means for these operand types - Arithmetic: actually perform the addition (the only useful work)
- Boxing: wrap the C
longresult back into a Pythonintobject - Name binding: store the result back as
total(dict insertion)
For 10 million iterations, steps 1-4 and 6-7 are pure overhead. Step 5 (the actual addition) takes nanoseconds. Everything else takes microseconds.
The Cost in Numbers
import timeit
n = 1_000_000
setup = f"""
import numpy as np
data_list = list(range({n}))
data_array = np.arange({n}, dtype=np.float64)
"""
# Python loop: sum
t_py = timeit.timeit('sum(data_list)', setup=setup, number=10)
# NumPy: sum
t_np = timeit.timeit('np.sum(data_array)', setup=setup, number=10)
# Built-in sum on numpy array (worst of both worlds)
t_bad = timeit.timeit('sum(data_array)', setup=setup, number=10)
print(f"Python sum(): {t_py/10*1000:.1f} ms")
print(f"NumPy np.sum(): {t_np/10*1000:.1f} ms")
print(f"sum(np.array): {t_bad/10*1000:.1f} ms") # SLOW!
# Typical output:
# Python sum(): 35.2 ms
# NumPy np.sum(): 0.8 ms (44x faster)
# sum(np.array): 120.5 ms (3.4x SLOWER than Python list!)
:::danger Never Use Python's Built-in sum() on NumPy Arrays
sum(numpy_array) iterates element-by-element through the Python interpreter, extracting each element as a Python object. It is slower than sum(list) because NumPy element extraction has more overhead than list element access. Always use np.sum(), array.sum(), or other NumPy functions.
:::
Part 2 - NumPy's C-Level Loops
When you write array * 2 + 1, NumPy does not create a Python loop. It calls a compiled C function that operates on the raw memory buffer directly.
How NumPy Arrays Are Stored
import numpy as np
a = np.array([1.0, 2.0, 3.0, 4.0], dtype=np.float64)
# NumPy array is a thin wrapper around a contiguous memory buffer
print(f"Data pointer: {a.ctypes.data}")
print(f"Shape: {a.shape}")
print(f"Strides: {a.strides}") # Bytes between consecutive elements
print(f"Dtype: {a.dtype}")
print(f"Itemsize: {a.itemsize}") # 8 bytes per float64
print(f"Nbytes: {a.nbytes}") # Total buffer size
print(f"Contiguous: {a.flags['C_CONTIGUOUS']}")
The key differences:
- Contiguous memory: NumPy stores raw C doubles (8 bytes each) in a single contiguous buffer. Lists store pointers to scattered PyFloat objects.
- Fixed dtype: NumPy knows every element is
float64at array creation time. No per-element type checking. - C-level iteration: NumPy's compiled C code loops over the buffer with pointer arithmetic. No interpreter involvement per element.
Vectorized Operations
import numpy as np
import time
n = 5_000_000
a = np.random.randn(n)
b = np.random.randn(n)
# Vectorized (C-level loop)
start = time.perf_counter()
c = a * b + np.sin(a) - np.log(np.abs(b) + 1)
t_vec = time.perf_counter() - start
# Python loop equivalent
start = time.perf_counter()
c_py = []
import math
for i in range(n):
c_py.append(a[i] * b[i] + math.sin(a[i]) - math.log(abs(b[i]) + 1))
t_py = time.perf_counter() - start
print(f"Vectorized: {t_vec:.3f}s")
print(f"Python loop: {t_py:.3f}s")
print(f"Speedup: {t_py / t_vec:.0f}x")
# Vectorized: 0.065s
# Python loop: 8.200s
# Speedup: 126x
Part 3 - Broadcasting Rules
Broadcasting is NumPy's mechanism for performing arithmetic on arrays with different shapes. Understanding it is essential for writing vectorized code without explicit loops.
The Rules
- If arrays have different numbers of dimensions, the shape of the smaller array is padded with 1s on the left.
- Arrays with size 1 along a dimension behave as if they had the same size as the other array along that dimension - the value is "broadcast" (repeated).
- If arrays are incompatible (neither dimension is 1 and they are not equal), NumPy raises
ValueError.
import numpy as np
# Rule 1: Different number of dimensions
a = np.array([1, 2, 3]) # Shape: (3,)
b = np.array([[10], [20]]) # Shape: (2, 1)
# Step 1: pad a to (1, 3)
# Step 2: broadcast a from (1, 3) to (2, 3)
# broadcast b from (2, 1) to (2, 3)
result = a + b
print(result)
# [[11, 12, 13],
# [21, 22, 23]]
print(result.shape) # (2, 3)
# Rule 2: Size-1 dimension broadcasts
row = np.array([1, 2, 3]) # Shape (3,) → treated as (1, 3)
col = np.array([[10], [20], [30]]) # Shape (3, 1)
result = row + col
print(result.shape) # (3, 3)
print(result)
# [[11, 12, 13],
# [21, 22, 23],
# [31, 32, 33]]
# Rule 3: Incompatible shapes
a = np.array([1, 2, 3]) # Shape (3,)
b = np.array([1, 2]) # Shape (2,)
# a + b → ValueError: shapes (3,) and (2,) not aligned
Practical Broadcasting
import numpy as np
# Normalize each column to zero mean, unit variance
data = np.random.randn(1000, 5) # 1000 samples, 5 features
# Without broadcasting (loop):
# for col in range(5):
# data[:, col] = (data[:, col] - data[:, col].mean()) / data[:, col].std()
# With broadcasting (vectorized):
means = data.mean(axis=0) # Shape: (5,)
stds = data.std(axis=0) # Shape: (5,)
normalized = (data - means) / stds # (1000, 5) - (5,) → broadcasts!
print(f"Means after normalization: {normalized.mean(axis=0)}") # ~[0, 0, 0, 0, 0]
print(f"Stds after normalization: {normalized.std(axis=0)}") # ~[1, 1, 1, 1, 1]
# Pairwise distances using broadcasting
points = np.random.randn(100, 3) # 100 points in 3D
# Compute all pairwise distances without any loop
diff = points[:, np.newaxis, :] - points[np.newaxis, :, :] # (100, 100, 3)
distances = np.sqrt((diff ** 2).sum(axis=2)) # (100, 100)
print(f"Distance matrix shape: {distances.shape}") # (100, 100)
print(f"Max distance: {distances.max():.3f}")
:::tip Broadcasting Visualization When in doubt, align shapes from the right:
Array A: 256 x 256 x 3
Array B: 3
------------------------------
Result: 256 x 256 x 3 (B broadcasts along first two dimensions)
Array A: 256 x 1 x 3
Array B: 1 x 256 x 3
------------------------------
Result: 256 x 256 x 3 (Both broadcast along different dimensions)
:::
Part 4 - Views vs Copies
This distinction is critical for both correctness and performance. A view shares memory with the original array. A copy allocates new memory.
import numpy as np
a = np.arange(10)
# VIEWS - share memory, no allocation cost
v1 = a[2:5] # Slicing creates a view
v2 = a.reshape(2, 5) # Reshape creates a view (usually)
v3 = a.T # Transpose creates a view
# Modifying a view modifies the original!
v1[0] = 999
print(a) # [ 0, 1, 999, 3, 4, 5, 6, 7, 8, 9]
# Check if it's a view
print(v1.base is a) # True - v1 is a view of a
# COPIES - new memory allocation
c1 = a.copy() # Explicit copy
c2 = a[[0, 2, 4]] # Fancy indexing creates a copy
c3 = a[a > 3] # Boolean indexing creates a copy
c1[0] = -1
print(a[0]) # 0 - original is unchanged
print(c1.base is a) # False - c1 is independent
Performance Implications
import numpy as np
import time
n = 100_000_000
a = np.arange(n, dtype=np.float64)
# View: zero-cost - just adjusts the offset and shape
start = time.perf_counter()
for _ in range(1000):
v = a[::2] # Every other element (view)
t_view = time.perf_counter() - start
# Copy: allocates n/2 * 8 bytes = ~400 MB
start = time.perf_counter()
c = a[::2].copy() # Force a copy
t_copy = time.perf_counter() - start
print(f"View creation (1000x): {t_view*1000:.2f} ms")
print(f"Copy creation (1x): {t_copy*1000:.2f} ms")
# View creation (1000x): 0.15 ms
# Copy creation (1x): 180.00 ms
When Views Bite You
import numpy as np
# Trap: keeping a view of a large array prevents GC of the large array
huge = np.random.randn(100_000_000) # ~800 MB
small_view = huge[:10] # Keeps ALL 800 MB alive!
del huge
# The 800 MB is NOT freed because small_view.base still references it
# Fix: copy the small slice if you don't need the rest
small_copy = huge[:10].copy()
del huge
# Now the 800 MB is freed
:::danger Views Keep the Entire Base Array Alive
If you slice a small portion of a large array and keep only the slice, the entire base array remains in memory. This is a common source of "memory leaks" in data pipelines. Always .copy() small slices from large arrays when you no longer need the original.
:::
Part 5 - Memory Layout: C-Order vs Fortran-Order
NumPy arrays can store data in row-major (C-order) or column-major (Fortran-order). This affects performance due to CPU cache locality.
import numpy as np
# C-order (row-major): elements in a row are contiguous in memory
c_array = np.array([[1, 2, 3],
[4, 5, 6]], order='C')
# Memory layout: [1, 2, 3, 4, 5, 6]
# Fortran-order (column-major): elements in a column are contiguous
f_array = np.array([[1, 2, 3],
[4, 5, 6]], order='F')
# Memory layout: [1, 4, 2, 5, 3, 6]
print(f"C-order strides: {c_array.strides}") # (24, 8) - row stride = 24 bytes
print(f"F-order strides: {f_array.strides}") # (8, 16) - col stride = 8 bytes
Performance Impact
import numpy as np
import time
n = 5000
a_c = np.random.randn(n, n) # C-order (default)
a_f = np.asfortranarray(a_c) # F-order copy
# Row-wise sum: C-order wins (rows are contiguous)
start = time.perf_counter()
for _ in range(10):
row_sums_c = a_c.sum(axis=1)
t_c_row = time.perf_counter() - start
start = time.perf_counter()
for _ in range(10):
row_sums_f = a_f.sum(axis=1)
t_f_row = time.perf_counter() - start
# Column-wise sum: F-order wins (columns are contiguous)
start = time.perf_counter()
for _ in range(10):
col_sums_c = a_c.sum(axis=0)
t_c_col = time.perf_counter() - start
start = time.perf_counter()
for _ in range(10):
col_sums_f = a_f.sum(axis=0)
t_f_col = time.perf_counter() - start
print(f"Row sum - C-order: {t_c_row:.3f}s, F-order: {t_f_row:.3f}s")
print(f"Col sum - C-order: {t_c_col:.3f}s, F-order: {t_f_col:.3f}s")
# Row sum - C-order is faster (cache-friendly traversal)
# Col sum - F-order is faster (cache-friendly traversal)
:::tip Choose Memory Layout Based on Access Pattern
- Row-wise access (iterating features per sample): use C-order (default)
- Column-wise access (iterating samples per feature): use Fortran-order
- Most ML libraries expect C-order. Most linear algebra (LAPACK) expects Fortran-order.
- NumPy handles the conversion automatically, but the performance cost is real. :::
Part 6 - Structured Arrays
For heterogeneous tabular data, structured arrays combine the memory efficiency of raw arrays with the convenience of named fields.
import numpy as np
# Define a compound dtype
sensor_dtype = np.dtype([
('timestamp', np.float64),
('sensor_id', np.uint16),
('temperature', np.float32),
('humidity', np.float32),
('status', np.uint8),
])
print(f"Record size: {sensor_dtype.itemsize} bytes")
# 19 bytes per record (vs ~200+ bytes for a Python dict)
# Create structured array
n = 1_000_000
readings = np.zeros(n, dtype=sensor_dtype)
# Fill with data
readings['timestamp'] = np.linspace(1700000000, 1700086400, n)
readings['sensor_id'] = np.random.randint(0, 100, n)
readings['temperature'] = np.random.uniform(15, 35, n).astype(np.float32)
readings['humidity'] = np.random.uniform(30, 90, n).astype(np.float32)
readings['status'] = np.random.randint(0, 4, n)
print(f"Total memory: {readings.nbytes / 1024 / 1024:.1f} MB")
# ~19 MB for 1M records (vs ~200 MB as dicts)
# Access columns efficiently
avg_temp = readings['temperature'].mean()
print(f"Average temperature: {avg_temp:.1f}")
# Filter: active sensors with temp > 30
mask = (readings['status'] == 0) & (readings['temperature'] > 30)
hot_readings = readings[mask]
print(f"Hot readings: {len(hot_readings):,}")
# Sort by timestamp
sorted_readings = np.sort(readings, order='timestamp')
Structured Arrays vs Pandas
import numpy as np
import sys
# Structured array: compact, fast, low overhead
n = 1_000_000
dt = np.dtype([('x', np.float64), ('y', np.float64), ('label', np.int32)])
sa = np.zeros(n, dtype=dt)
print(f"Structured array: {sa.nbytes / 1024 / 1024:.1f} MB")
# ~19.1 MB
# Comparison: if using separate arrays
x = np.zeros(n, dtype=np.float64) # 8 MB
y = np.zeros(n, dtype=np.float64) # 8 MB
label = np.zeros(n, dtype=np.int32) # 4 MB
# Total: 20 MB - slightly more, but column access is faster
# When to use structured arrays:
# - Binary file I/O (direct read/write with np.fromfile/tofile)
# - Interop with C structs via ctypes
# - Memory-mapped structured data
# When to use separate arrays or Pandas:
# - Column-wise operations (filtering, aggregation)
# - Mixed-type analysis
# - When readability matters more than memory
Part 7 - Universal Functions (ufuncs)
Ufuncs are NumPy's mechanism for element-wise operations. Every mathematical operation on arrays (+, *, np.sin, np.exp, etc.) is implemented as a ufunc.
Built-in Ufuncs
import numpy as np
a = np.array([1.0, 4.0, 9.0, 16.0])
# Arithmetic ufuncs
print(np.add(a, 10)) # [11. 14. 19. 26.] - same as a + 10
print(np.multiply(a, 2)) # [2. 8. 18. 32.] - same as a * 2
print(np.sqrt(a)) # [1. 2. 3. 4.]
# Reduction ufuncs
print(np.add.reduce(a)) # 30.0 - same as np.sum(a)
print(np.multiply.reduce(a))# 576.0 - same as np.prod(a)
# Accumulation ufuncs
print(np.add.accumulate(a)) # [1. 5. 14. 30.] - cumulative sum
# Outer products
print(np.multiply.outer([1, 2, 3], [10, 20]))
# [[10, 20],
# [20, 40],
# [30, 60]]
Writing Custom Ufuncs with np.vectorize
import numpy as np
# np.vectorize is NOT true vectorization - it's a loop wrapper!
# But it provides a convenient interface for applying Python functions
# to arrays.
@np.vectorize
def custom_activation(x):
"""Leaky ReLU activation function."""
if x > 0:
return x
else:
return 0.01 * x
a = np.array([-2, -1, 0, 1, 2], dtype=np.float64)
print(custom_activation(a)) # [-0.02, -0.01, 0, 1, 2]
# WARNING: np.vectorize is still slow (Python loop underneath)
# For performance, use np.where:
def leaky_relu_vectorized(x, alpha=0.01):
"""True vectorized leaky ReLU."""
return np.where(x > 0, x, alpha * x)
# Benchmark
import timeit
data = np.random.randn(1_000_000)
t_vectorize = timeit.timeit(lambda: custom_activation(data), number=10)
t_true_vec = timeit.timeit(lambda: leaky_relu_vectorized(data), number=10)
print(f"np.vectorize: {t_vectorize/10*1000:.1f} ms")
print(f"np.where: {t_true_vec/10*1000:.1f} ms")
# np.vectorize: 450.0 ms
# np.where: 5.2 ms - 86x faster
:::danger np.vectorize Is Not True Vectorization
Despite the name, np.vectorize wraps a Python function in a loop. It provides the interface of vectorized code (broadcasting, dtype handling) but not the performance. For fast element-wise operations, use NumPy's built-in ufuncs, np.where, or np.piecewise.
:::
Efficient Conditional Operations
import numpy as np
data = np.random.randn(10_000_000)
# BAD: Python loop
result_bad = np.zeros_like(data)
for i in range(len(data)): # 10M Python iterations
if data[i] > 0:
result_bad[i] = data[i] ** 2
else:
result_bad[i] = 0
# GOOD: np.where (vectorized)
result_good = np.where(data > 0, data ** 2, 0)
# BETTER: avoid computing data**2 for elements we discard
result_best = np.zeros_like(data)
mask = data > 0
result_best[mask] = data[mask] ** 2
# All three produce the same result, but:
# BAD: ~12 seconds
# GOOD: ~0.05 seconds
# BETTER: ~0.04 seconds (avoids unnecessary computation)
Part 8 - Real-World: Data Pipeline Optimization
Feature Engineering
import numpy as np
def create_features_python(prices: list, window: int = 20) -> dict:
"""Pure Python feature engineering - slow."""
n = len(prices)
features = {
'returns': [0.0] * n,
'sma': [0.0] * n,
'volatility': [0.0] * n,
'z_score': [0.0] * n,
}
for i in range(1, n):
features['returns'][i] = (prices[i] - prices[i-1]) / prices[i-1]
for i in range(window, n):
window_data = prices[i-window:i]
mean = sum(window_data) / window
features['sma'][i] = mean
variance = sum((x - mean) ** 2 for x in window_data) / window
std = variance ** 0.5
features['volatility'][i] = std
if std > 0:
features['z_score'][i] = (prices[i] - mean) / std
return features
def create_features_numpy(prices: np.ndarray, window: int = 20) -> dict:
"""Vectorized feature engineering - fast."""
# Returns: vectorized division
returns = np.zeros_like(prices)
returns[1:] = np.diff(prices) / prices[:-1]
# SMA: cumulative sum trick (O(n) instead of O(n*window))
cumsum = np.cumsum(prices)
sma = np.zeros_like(prices)
sma[window:] = (cumsum[window:] - cumsum[:-window]) / window
# Volatility: rolling standard deviation
# Using the cumsum trick for rolling variance
cumsum_sq = np.cumsum(prices ** 2)
mean_sq = np.zeros_like(prices)
mean_sq[window:] = (cumsum_sq[window:] - cumsum_sq[:-window]) / window
sq_mean = sma ** 2
volatility = np.zeros_like(prices)
variance = mean_sq - sq_mean
# Clamp negative values from floating point errors
variance = np.maximum(variance, 0)
volatility[window:] = np.sqrt(variance[window:])
# Z-score: vectorized division with safe divide
z_score = np.zeros_like(prices)
mask = volatility > 0
z_score[mask] = (prices[mask] - sma[mask]) / volatility[mask]
return {
'returns': returns,
'sma': sma,
'volatility': volatility,
'z_score': z_score,
}
# Benchmark
import time
np.random.seed(42)
prices = np.cumsum(np.random.randn(1_000_000)) + 100
prices_list = prices.tolist()
start = time.perf_counter()
features_py = create_features_python(prices_list)
t_py = time.perf_counter() - start
start = time.perf_counter()
features_np = create_features_numpy(prices)
t_np = time.perf_counter() - start
print(f"Python: {t_py:.3f}s")
print(f"NumPy: {t_np:.3f}s")
print(f"Speedup: {t_py / t_np:.0f}x")
# Python: 8.500s
# NumPy: 0.035s
# Speedup: 243x
Batch Data Processing
import numpy as np
def process_sensor_batch_vectorized(
timestamps: np.ndarray,
values: np.ndarray,
sensor_ids: np.ndarray,
min_value: float = -100,
max_value: float = 100,
) -> dict:
"""
Process a batch of sensor readings using vectorized operations.
No Python loops.
"""
# Step 1: Filter invalid readings
valid_mask = (values >= min_value) & (values <= max_value) & np.isfinite(values)
valid_values = values[valid_mask]
valid_sensors = sensor_ids[valid_mask]
valid_timestamps = timestamps[valid_mask]
# Step 2: Per-sensor statistics using np.unique + np.bincount
unique_sensors, inverse = np.unique(valid_sensors, return_inverse=True)
n_sensors = len(unique_sensors)
# Sum per sensor using bincount
sensor_sums = np.bincount(inverse, weights=valid_values, minlength=n_sensors)
sensor_counts = np.bincount(inverse, minlength=n_sensors)
sensor_means = np.divide(
sensor_sums, sensor_counts,
out=np.zeros(n_sensors),
where=sensor_counts > 0,
)
# Step 3: Detect anomalies (values > 3 std from sensor mean)
expected_means = sensor_means[inverse]
# Compute per-sensor std
sq_diff = (valid_values - expected_means) ** 2
sensor_sq_sums = np.bincount(inverse, weights=sq_diff, minlength=n_sensors)
sensor_stds = np.sqrt(np.divide(
sensor_sq_sums, sensor_counts,
out=np.zeros(n_sensors),
where=sensor_counts > 0,
))
expected_stds = sensor_stds[inverse]
anomaly_mask = np.abs(valid_values - expected_means) > 3 * expected_stds
return {
'n_valid': int(valid_mask.sum()),
'n_invalid': int((~valid_mask).sum()),
'n_sensors': n_sensors,
'n_anomalies': int(anomaly_mask.sum()),
'sensor_means': dict(zip(unique_sensors.tolist(), sensor_means.tolist())),
}
# Test with 10 million readings
n = 10_000_000
timestamps = np.arange(n, dtype=np.float64)
values = np.random.randn(n) * 10 + 25 # ~N(25, 10)
sensor_ids = np.random.randint(0, 1000, n).astype(np.int32)
# Inject some anomalies and invalid values
values[::10000] = np.nan
values[::7777] = 999
import time
start = time.perf_counter()
result = process_sensor_batch_vectorized(timestamps, values, sensor_ids)
elapsed = time.perf_counter() - start
print(f"Processed {n:,} readings in {elapsed:.3f}s")
print(f" Valid: {result['n_valid']:,}")
print(f" Invalid: {result['n_invalid']:,}")
print(f" Sensors: {result['n_sensors']}")
print(f" Anomalies: {result['n_anomalies']:,}")
Key Takeaways
- Python loops are slow because of per-element overhead: type checking, boxing/unboxing, dict lookups - not the arithmetic itself. NumPy eliminates all of this.
- Vectorization means using array operations instead of loops:
a * 2 + 1instead of[x * 2 + 1 for x in a]. The speedup is typically 10-100x. - Broadcasting lets you avoid explicit loops over dimensions: learn the three rules and you can replace nested loops with single array expressions.
- Views share memory; copies allocate new memory: slices are views (free but dangerous), fancy indexing creates copies (safe but expensive). Beware of views keeping large arrays alive.
- Memory layout affects cache performance: C-order is fast for row operations, Fortran-order for column operations. Match your layout to your access pattern.
- np.vectorize is not vectorization: it wraps a Python loop. Use np.where, boolean indexing, or ufuncs for actual vectorized code.
- Use cumulative sum tricks for rolling operations:
np.cumsumlets you compute rolling means and variances in O(n) instead of O(n * window).
Graded Practice Challenges
Level 1 - Predict the Output
Question 1: What does this print?
import numpy as np
a = np.arange(12).reshape(3, 4)
b = a[1:3, ::2]
b[0, 0] = 99
print(a[1, 0])
Answer
It prints 99. b = a[1:3, ::2] creates a view (slicing always creates views). b[0, 0] corresponds to a[1, 0]. Modifying the view modifies the original array.
Question 2: What shape does this broadcast produce?
import numpy as np
a = np.ones((5, 1, 3))
b = np.ones((4, 1))
result = a + b
print(result.shape)
Answer
(5, 4, 3). Align from the right:
a: 5 x 1 x 3
b: 4 x 1
-----------
5 x 4 x 3
- Dimension 3: a has 3, b has 1 (from padding) → broadcast b to 3
- Dimension 2: a has 1, b has 4 → broadcast a to 4
- Dimension 1: a has 5, b has 1 (from padding) → broadcast b to 5
Question 3: Why is sum(numpy_array) slower than sum(python_list) for the same data?
Answer
sum(numpy_array) uses Python's built-in sum, which iterates through the array using Python's iterator protocol. Each element access on a NumPy array requires:
- Computing the memory offset from the index
- Reading the raw bytes from the buffer
- Creating a new Python scalar object (e.g.,
np.float64)
This per-element overhead is higher than reading a pointer from a Python list (which just dereferences a pointer to an existing PyFloat). The proper way is numpy_array.sum() or np.sum(numpy_array), which iterates in C over the raw buffer.
Level 2 - Debug Challenge
This vectorized code is supposed to compute the moving average but produces incorrect results. Find and fix the bug:
import numpy as np
def moving_average(data, window):
cumsum = np.cumsum(data)
result = np.zeros_like(data)
result[window:] = (cumsum[window:] - cumsum[:-window]) / window
return result
data = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
print(moving_average(data, 3))
# Expected: [0, 0, 0, 2.0, 3.0, 4.0] for window=3
# (average of [1,2,3]=2.0, average of [2,3,4]=3.0, average of [3,4,5]=4.0)
Answer
The bug is an off-by-one error. The formula cumsum[window:] - cumsum[:-window] computes the sum of elements from index i-window+1 to i+1 (exclusive). But we want the average of the window elements ending at index i.
The correct formula needs to account for the fact that cumsum[i] is the sum of elements data[0] through data[i]. The sum of elements from data[i-window+1] to data[i] is cumsum[i] - cumsum[i-window].
def moving_average(data, window):
cumsum = np.concatenate([[0], np.cumsum(data)]) # Prepend 0
result = np.zeros_like(data)
result[window-1:] = (cumsum[window:] - cumsum[:-window]) / window
return result
data = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
print(moving_average(data, 3))
# [0. 0. 2. 3. 4.]
The fix: (1) prepend a 0 to cumsum so that cumsum[0] = 0, cumsum[1] = data[0], etc., and (2) start filling at index window-1 instead of window.
Level 3 - Design Challenge
You have a dataset of 100 million user events (timestamp, user_id, event_type, value). The data is stored in a CSV file (~5 GB). You need to compute:
- Hourly event counts per event type
- Per-user rolling 7-day average value
- Top 100 users by total value
Design a processing pipeline that:
- Fits in 16 GB RAM
- Completes in under 5 minutes
- Uses NumPy vectorization (not Pandas, not Spark)
Solution Sketch
Step 1: Load data efficiently
Don't load 5 GB CSV into Python strings. Use np.loadtxt or np.genfromtxt with specific dtypes, or chunk-read with np.fromfile after converting to binary format.
import numpy as np
# Pre-process: convert CSV to binary format once
# Then load as structured array
dt = np.dtype([
('timestamp', np.int64), # Unix timestamp
('user_id', np.int32),
('event_type', np.int16), # Encoded as integer
('value', np.float32),
])
# ~14 bytes per record * 100M = 1.4 GB - fits in RAM
data = np.fromfile('events.bin', dtype=dt)
Step 2: Hourly event counts (vectorized)
# Convert timestamps to hour buckets
hours = data['timestamp'] // 3600
# Use np.unique + np.bincount for grouping
event_hours = hours * 1000 + data['event_type'] # Composite key
unique_keys, counts = np.unique(event_hours, return_counts=True)
# Decode: hour = key // 1000, event_type = key % 1000
Step 3: Per-user rolling 7-day average
# Sort by (user_id, timestamp)
sort_idx = np.lexsort((data['timestamp'], data['user_id']))
sorted_data = data[sort_idx]
# For each user, compute rolling average
# Use np.split or np.searchsorted to find user boundaries
user_boundaries = np.searchsorted(
sorted_data['user_id'],
np.unique(sorted_data['user_id']),
side='left'
)
# Process each user's data using cumsum trick for rolling average
Step 4: Top 100 users by total value
unique_users, inverse = np.unique(data['user_id'], return_inverse=True)
user_totals = np.bincount(inverse, weights=data['value'].astype(np.float64))
top_100_idx = np.argpartition(user_totals, -100)[-100:]
top_100_idx = top_100_idx[np.argsort(user_totals[top_100_idx])[::-1]]
top_100 = [(unique_users[i], user_totals[i]) for i in top_100_idx]
Memory budget: ~1.4 GB for data + ~1 GB for intermediate arrays = ~2.4 GB peak. Well within 16 GB.
Performance: All operations are vectorized with O(n) or O(n log n) complexity. Expected runtime: 30-90 seconds for 100M records.
What's Next
NumPy vectorization handles the majority of numerical performance needs. But sometimes even C-level loops over arrays are not fast enough - you need custom C code. In C Extensions and FFI, you will learn to write C extensions, use ctypes and cffi, and accelerate hot paths with Cython and pybind11.
