Tensors for Deep Learning - From Matrices to Multi-Dimensional Arrays
Reading time: ~24 minutes | Level: Mathematical Foundations → ML Engineering
A batch of 32 images, each 224×224 pixels with 3 color channels, is a 4-dimensional tensor of shape (32, 224, 224, 3). A transformer's attention weight matrix is a 4D tensor of shape (batch, heads, seq_len, seq_len). Everything in deep learning is tensor algebra.
Matrices are 2D arrays. Tensors generalize this to N dimensions. But tensors are not just "bigger matrices" - they come with a precise algebra of contractions, Einstein summation notation, and broadcasting rules that determine how every layer in every neural network performs its computation.
Understanding tensors means understanding exactly what a convolutional layer does (it is a tensor contraction), what scaled dot-product attention computes (it is a batch matrix multiply expressed with einsum), and why vectorized code runs hundreds of times faster than Python loops (it is SIMD and GPU parallelism over tensor operations).
What You Will Learn
- Tensors as the generalization of scalars (0D), vectors (1D), matrices (2D) to N dimensions
- How to read and reason about tensor shapes in deep learning
- Tensor contractions: generalizing matrix multiplication to arbitrary axes
- Einstein summation notation and
np.einsum - Broadcasting: how NumPy and PyTorch extend operations across dimensions
- Vectorization: why tensor ops are fast (SIMD, GPU parallelism)
- NumPy:
einsum, broadcasting,reshapevsview - PyTorch: tensor operations, gradient tracking,
.view()vs.reshape() - ML: scaled dot-product attention from scratch using
einsum
Prerequisites
Part 1 - Tensors as Generalization: 0D to N-D
The hierarchy
| Name | Rank | Shape example | DL example |
|---|---|---|---|
| Scalar | 0D | () | Loss value, learning rate |
| Vector | 1D | (512,) | Embedding, bias vector |
| Matrix | 2D | (512, 256) | Linear layer weights |
| 3D tensor | 3D | (32, 100, 512) | Batch of token sequences |
| 4D tensor | 4D | (32, 3, 224, 224) | Batch of images (NCHW) |
| 5D tensor | 5D | (32, 8, 100, 100, 64) | Multi-head attention scores |
A tensor is simply a multi-dimensional array. Every scalar, vector, and matrix is a special case of a tensor. The number of dimensions is the tensor's rank (also called order or degree).
import numpy as np
# Scalar: rank 0
s = np.array(3.14)
print(f"Scalar: shape={s.shape}, ndim={s.ndim}") # (), 0
# Vector: rank 1
v = np.array([1.0, 2.0, 3.0, 4.0])
print(f"Vector: shape={v.shape}, ndim={v.ndim}") # (4,), 1
# Matrix: rank 2
M = np.array([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
print(f"Matrix: shape={M.shape}, ndim={M.ndim}") # (3, 2), 2
# 3D tensor: batch of sequences
T3 = np.random.randn(32, 100, 512) # (batch, seq_len, d_model)
print(f"3D tensor: shape={T3.shape}")
# 4D tensor: batch of images (NCHW format)
T4 = np.random.randn(32, 3, 224, 224) # (batch, channels, height, width)
print(f"4D tensor: shape={T4.shape}")
# Number of elements
print(f"\n3D tensor elements: {T3.size:,}") # 32 * 100 * 512 = 1,638,400
print(f"4D tensor elements: {T4.size:,}") # 32 * 3 * 224 * 224 = 4,816,896
print(f"4D tensor memory (float32): {T4.nbytes / 1e6:.1f} MB")
# Indexing: select from each dimension
first_image = T4[0] # shape (3, 224, 224) - first image in batch
red_channel = T4[:, 0, :, :] # shape (32, 224, 224) - red channel of all images
top_left_3x3 = T4[:, :, :3, :3] # shape (32, 3, 3, 3) - 3x3 corner of all images
Axes and their semantic meaning
In deep learning, each tensor axis has a semantic meaning. The most important conventions:
NCHW (PyTorch default for images):
N = batch size
C = channels (color / feature maps)
H = height (spatial)
W = width (spatial)
NHWC (TensorFlow default, also faster on some hardware):
N = batch size
H = height
W = width
C = channels
Transformer tensors:
(batch, seq_len, d_model) - token embeddings
(batch, heads, seq_len, d_head) - multi-head attention
(batch, heads, seq_len, seq_len) - attention weights
Always know which axis is which before writing tensor operations - shape bugs are the most common source of incorrect deep learning code.
Part 2 - Tensor Shapes and How to Read Them
import numpy as np
# Reading tensor shapes in deep learning
# A single forward pass through a transformer layer:
batch_size = 4
seq_len = 64
d_model = 256
n_heads = 8
d_head = d_model // n_heads # 32
# Input token embeddings
x = np.random.randn(batch_size, seq_len, d_model)
print(f"Input: {x.shape}") # (4, 64, 256)
# Q, K, V projections: linear layer applied to last dim
# W_Q: (d_model, d_model) → but multi-head splits it as (d_model, n_heads, d_head)
W_Q = np.random.randn(d_model, n_heads, d_head) # (256, 8, 32)
W_K = np.random.randn(d_model, n_heads, d_head)
W_V = np.random.randn(d_model, n_heads, d_head)
# After projection + reshape: (batch, n_heads, seq_len, d_head)
Q = np.einsum('bsd,dnh->bnsh', x, W_Q) # (4, 8, 64, 32)
K = np.einsum('bsd,dnh->bnsh', x, W_K)
V = np.einsum('bsd,dnh->bnsh', x, W_V)
print(f"Q shape: {Q.shape}") # (4, 8, 64, 32)
print(f"K shape: {K.shape}") # (4, 8, 64, 32)
# Attention scores: (batch, heads, seq_len, seq_len)
scores = np.einsum('bnid,bnjd->bnij', Q, K) / np.sqrt(d_head)
print(f"Attention scores: {scores.shape}") # (4, 8, 64, 64)
# Shape arithmetic: always verify
expected_params = d_model * d_model * 3 # Q, K, V projections
print(f"\nExpected QKV params: {expected_params:,}") # 196,608
print(f"Actual QKV params: {(W_Q.size + W_K.size + W_V.size):,}")
# Common shape operations
T = np.random.randn(4, 64, 256)
# Reshape: change shape, preserve total elements
T_flat = T.reshape(4, -1) # (4, 16384) - flatten seq and embedding dims
T_split = T.reshape(4, 64, 8, 32) # (4, 64, 8, 32) - split embedding into heads
print(f"\nOriginal: {T.shape}") # (4, 64, 256)
print(f"Flattened: {T_flat.shape}") # (4, 16384)
print(f"Head-split: {T_split.shape}") # (4, 64, 8, 32)
# Transpose axes
T_transposed = np.transpose(T_split, (0, 2, 1, 3)) # (batch, heads, seq, d_head)
print(f"Transposed: {T_transposed.shape}") # (4, 8, 64, 32)
Part 3 - Tensor Contractions: Generalizing Matrix Multiplication
Matrix multiplication is a contraction over one pair of indices. Tensor contractions generalize this to contracting over any pair (or multiple pairs) of indices between two tensors.
From matrix multiply to general contraction
Matrix multiply: C = A @ B
Cᵢₖ = Σⱼ Aᵢⱼ Bⱼₖ (sum over shared index j)
General tensor contraction: contract over specified shared axes.
Example 1: Dot product of vectors (contract over all indices)
result = Σᵢ uᵢ vᵢ
Example 2: Outer product (no contraction - all indices are free)
Cᵢⱼ = uᵢ vⱼ
Example 3: Batch matrix multiply (contract over last/second-to-last axes)
Cᵦᵢₖ = Σⱼ Aᵦᵢⱼ Bᵦⱼₖ (b = batch dimension, not contracted)
Example 4: Convolution (contract over kernel spatial dimensions)
Outputₙ,ₒ,ₕ,w = Σ_ᵢ Σ_kh Σ_kw Input_{n,i,h+kh,w+kw} * Kernel_{o,i,kh,kw}
This is a contraction over input channels (i), kernel height (kh), and kernel width (kw).
import numpy as np
# Tensor contractions via np.einsum
# 1. Vector dot product
u = np.array([1.0, 2.0, 3.0])
v = np.array([4.0, 5.0, 6.0])
dot = np.einsum('i,i->', u, v) # sum over all i
print(f"Dot product: {dot}") # 1*4 + 2*5 + 3*6 = 32.0
# 2. Outer product
outer = np.einsum('i,j->ij', u, v) # no summation - all indices free
print(f"Outer product shape: {outer.shape}") # (3, 3)
# 3. Matrix multiply
A = np.random.randn(3, 4)
B = np.random.randn(4, 5)
C = np.einsum('ij,jk->ik', A, B) # contract over j
print(f"MatMul shape: {C.shape}") # (3, 5)
assert np.allclose(C, A @ B)
# 4. Batch matrix multiply
# Each element in the batch multiplies its own pair of matrices
Abatch = np.random.randn(8, 3, 4) # 8 matrices of shape (3,4)
Bbatch = np.random.randn(8, 4, 5) # 8 matrices of shape (4,5)
Cbatch = np.einsum('bij,bjk->bik', Abatch, Bbatch) # b is free (not contracted)
print(f"Batch MatMul: {Cbatch.shape}") # (8, 3, 5)
assert np.allclose(Cbatch, Abatch @ Bbatch) # np.matmul handles batch dims
# 5. Multi-head attention score (the most important deep learning contraction)
batch, heads, seq, d_head = 4, 8, 64, 32
Q = np.random.randn(batch, heads, seq, d_head)
K = np.random.randn(batch, heads, seq, d_head)
# Compute Q @ K.T for all batches and heads simultaneously
# "for each batch b and head h, compute Q[b,h] @ K[b,h].T"
scores = np.einsum('bhid,bhjd->bhij', Q, K) # contract over d_head (d)
print(f"Attention scores: {scores.shape}") # (4, 8, 64, 64)
# Each score[b,h,i,j] = dot product of Q[b,h,i,:] and K[b,h,j,:]
# 6. Convolutional contraction (simplified 1D)
# Input: (batch, in_channels, length)
# Kernel: (out_channels, in_channels, kernel_size)
# Output: (batch, out_channels, length - kernel_size + 1)
batch, in_c, length = 2, 3, 10
out_c, k_size = 4, 3
inp = np.random.randn(batch, in_c, length)
kernel = np.random.randn(out_c, in_c, k_size)
# Naive convolution via einsum (pedagogical - not efficient in practice)
output = np.zeros((batch, out_c, length - k_size + 1))
for pos in range(length - k_size + 1):
patch = inp[:, :, pos:pos + k_size] # (batch, in_c, k_size)
output[:, :, pos] = np.einsum('bik,oik->bo', patch, kernel)
print(f"Conv1D output: {output.shape}") # (2, 4, 8)
Part 4 - Einstein Summation Notation
Einstein summation (einsum) is a compact notation for all tensor contractions. It was invented by Einstein for general relativity - and then adopted wholesale by the deep learning community.
The rule
np.einsum('index_string', tensor1, tensor2)
The index string has the form: 'input1_indices,input2_indices->output_indices'
- Repeated indices with no arrow: summed over (contracted)
- Indices that appear in inputs but not output: summed over
- Indices that appear in output: kept (free indices)
import numpy as np
# Einstein summation reference
A = np.random.randn(3, 4)
B = np.random.randn(4, 5)
v = np.random.randn(4)
u = np.random.randn(3)
# Matrix-vector multiply: Av
print(np.einsum('ij,j->i', A, v).shape) # (3,) - j is contracted
# Matrix-matrix multiply: AB
print(np.einsum('ij,jk->ik', A, B).shape) # (3,5) - j contracted
# Element-wise multiply: A * A (Hadamard)
print(np.einsum('ij,ij->ij', A, A).shape) # (3,4) - all indices free
# Trace: sum of diagonal
sq = np.random.randn(4, 4)
trace = np.einsum('ii->', sq)
print(f"Trace: {trace:.4f}")
assert np.isclose(trace, np.trace(sq))
# Outer product: u ⊗ v
outer = np.einsum('i,j->ij', u, v)
print(f"Outer product: {outer.shape}") # (3, 4)
# Batch outer product: for each batch element, compute outer product
batch_u = np.random.randn(8, 3) # 8 vectors of dim 3
batch_v = np.random.randn(8, 5) # 8 vectors of dim 5
batch_outer = np.einsum('bi,bj->bij', batch_u, batch_v)
print(f"Batch outer: {batch_outer.shape}") # (8, 3, 5)
# Transpose
AT = np.einsum('ij->ji', A)
print(f"Transpose: {AT.shape}") # (4, 3)
assert np.allclose(AT, A.T)
# Diagonal extraction
diag = np.einsum('ii->i', sq)
assert np.allclose(diag, np.diag(sq))
# Frobenius norm squared = sum of squared entries
frob_sq = np.einsum('ij,ij->', A, A)
assert np.isclose(frob_sq, np.linalg.norm(A, 'fro')**2)
# Quadratic form: uᵀ A u
M = np.random.randn(3, 3)
M = M.T @ M # make symmetric
u3 = np.random.randn(3)
quad = np.einsum('i,ij,j->', u3, M, u3)
assert np.isclose(quad, u3 @ M @ u3)
print("\nEinsum covers all standard linear algebra operations.")
Performance: einsum is optimized
Modern implementations of np.einsum (and torch.einsum) use optimized BLAS/cuBLAS kernels under the hood. When the optimize=True argument is passed, NumPy finds the optimal contraction order (which can be exponentially faster for multi-tensor einsum chains).
import numpy as np
import time
# Multi-tensor einsum: contraction order matters
A = np.random.randn(100, 200)
B = np.random.randn(200, 300)
C = np.random.randn(300, 50)
# Without optimization: may pick a suboptimal contraction order
t0 = time.perf_counter()
for _ in range(100):
result_naive = np.einsum('ij,jk,kl->il', A, B, C)
t1 = time.perf_counter()
print(f"Naive einsum: {(t1-t0)*10:.2f} ms")
# With optimization: NumPy finds the best contraction path
t0 = time.perf_counter()
for _ in range(100):
result_opt = np.einsum('ij,jk,kl->il', A, B, C, optimize=True)
t1 = time.perf_counter()
print(f"Optimized einsum: {(t1-t0)*10:.2f} ms")
assert np.allclose(result_naive, result_opt)
# Optimized is often 2-10x faster for multi-tensor contractions
Part 5 - Broadcasting: Extending Operations Across Dimensions
Broadcasting allows NumPy and PyTorch to perform operations between tensors of different shapes - without copying data.
The broadcasting rules
When operating on two tensors, NumPy aligns shapes from the rightmost dimension and applies these rules:
- Same size: dimensions match - proceed normally
- One is size 1: the size-1 dimension is "stretched" to match the other
- One is absent: the missing dimension is treated as size 1 and stretched
If any dimension is neither equal nor 1, the operation fails.
import numpy as np
# Rule demonstration: align from right
A = np.ones((3, 4)) # shape (3, 4)
b = np.ones((4,)) # shape (4,) - treated as (1, 4)
print(f"A + b: {(A + b).shape}") # (3, 4) - b broadcast along axis 0
B = np.ones((3, 1)) # shape (3, 1)
print(f"A + B: {(A + B).shape}") # (3, 4) - B broadcast along axis 1
# Both broadcast simultaneously
x = np.ones((5, 1, 3)) # shape (5, 1, 3)
y = np.ones((1, 4, 3)) # shape (1, 4, 3)
print(f"x + y: {(x + y).shape}") # (5, 4, 3)
# Rule: (5,1,3) + (1,4,3) → (5,4,3)
# Broadcasting failure
try:
bad = np.ones((3, 4)) + np.ones((3, 5))
except ValueError as e:
print(f"Expected error: {e}")
# ── ML examples of broadcasting ──────────────────────────────────────────
# 1. Batch bias addition in linear layer
# W: (d_in, d_out) = (512, 256), bias: (256,)
W = np.random.randn(512, 256)
bias = np.random.randn(256)
# Input batch: (batch, d_in) = (32, 512)
x_batch = np.random.randn(32, 512)
output = x_batch @ W + bias # (32, 256) + (256,) → broadcasts to (32, 256)
print(f"\nLinear layer output: {output.shape}") # (32, 256)
# 2. Scaling attention scores by sqrt(d_head)
d_head = 64
scores = np.random.randn(4, 8, 100, 100) # (batch, heads, seq, seq)
scores_scaled = scores / np.sqrt(d_head) # scalar broadcast to all dims
print(f"Scaled scores: {scores_scaled.shape}") # (4, 8, 100, 100)
# 3. Subtracting mean per feature (batch normalization step)
X = np.random.randn(32, 512) # (batch, features)
mean = X.mean(axis=0, keepdims=True) # (1, 512) - keepdims ensures (1,512) not (512,)
std = X.std(axis=0, keepdims=True) # (1, 512)
X_norm = (X - mean) / (std + 1e-8) # (32,512) - (1,512) → broadcasts
print(f"Normalized: {X_norm.shape}") # (32, 512)
# 4. Pairwise Euclidean distances (the NumPy broadcasting trick)
A = np.random.randn(100, 512) # 100 query embeddings
B = np.random.randn(200, 512) # 200 document embeddings
# Expand dims to enable broadcasting:
# A[:,np.newaxis,:] has shape (100, 1, 512)
# B[np.newaxis,:,:] has shape (1, 200, 512)
# Difference: (100, 200, 512)
diff = A[:, np.newaxis, :] - B[np.newaxis, :, :] # (100, 200, 512)
dists = np.sqrt(np.sum(diff**2, axis=-1)) # (100, 200)
print(f"Pairwise distances: {dists.shape}") # (100, 200)
# dists[i,j] = Euclidean distance between A[i] and B[j]
keepdims: the common broadcasting bug
import numpy as np
X = np.random.randn(32, 512)
# WRONG: mean has shape (512,) - not (1,512)
mean_wrong = X.mean(axis=0) # shape: (512,)
print(f"Without keepdims: {mean_wrong.shape}")
# Still works due to broadcasting, but fragile for chaining
X_centered_wrong = X - mean_wrong # (32,512) - (512,) → works, but shape of mean is ambiguous
# RIGHT: keepdims=True preserves axis structure
mean_right = X.mean(axis=0, keepdims=True) # shape: (1, 512)
print(f"With keepdims: {mean_right.shape}")
# Now the subtraction intent is unambiguous:
X_centered_right = X - mean_right # (32,512) - (1,512) → explicit broadcast
Part 6 - Vectorization: Why Tensor Ops Are Fast
The problem with Python loops
Python is an interpreted language. Every line of a Python for loop requires:
- Bytecode interpretation overhead
- Python object creation for each intermediate result
- No opportunity for SIMD (Single Instruction, Multiple Data) vectorization
- No GPU parallelism
Why tensor operations are fast
When you call A @ B in NumPy, it dispatches to BLAS (Basic Linear Algebra Subprograms), a highly optimized C/Fortran library that:
- Uses SIMD instructions (AVX-512 on modern CPUs processes 16 float32s simultaneously)
- Exploits cache hierarchy (blocks computations to maximize L1/L2/L3 cache hits)
- Can use multiple CPU cores in parallel (via OpenBLAS or Intel MKL)
On GPU (via PyTorch CUDA), the same operation uses thousands of parallel cores in cuBLAS, achieving throughput that is orders of magnitude higher.
import numpy as np
import time
n = 1000
A = np.random.randn(n, n)
B = np.random.randn(n, n)
# Python loop: O(n³) operations, all interpreted
def matmul_python_loop(A: np.ndarray, B: np.ndarray) -> np.ndarray:
n, m = A.shape[0], B.shape[1]
k = A.shape[1]
C = np.zeros((n, m))
for i in range(n):
for j in range(m):
for l in range(k):
C[i, j] += A[i, l] * B[l, j]
return C
# NumPy vectorized (BLAS)
t0 = time.perf_counter()
C_numpy = A @ B
t_numpy = time.perf_counter() - t0
print(f"NumPy @ (BLAS): {t_numpy*1000:.2f} ms")
# Partially vectorized (inner loop is vectorized, outer is not)
def matmul_partial(A: np.ndarray, B: np.ndarray) -> np.ndarray:
C = np.zeros((A.shape[0], B.shape[1]))
for i in range(A.shape[0]):
for j in range(B.shape[1]):
C[i, j] = A[i, :] @ B[:, j] # vectorized inner product
return C
t0 = time.perf_counter()
C_partial = matmul_partial(A[:50, :50], B[:50, :50]) # smaller for timing
t_partial = time.perf_counter() - t0
print(f"Partial vectorized: {t_partial*1000:.2f} ms (on 50x50 submatrix)")
# Speedup comparison
print(f"\nNumPy BLAS is approximately {50**3 / (t_numpy * 1e9 / n**3):.0f}x faster")
print("(SIMD + cache optimization + multi-threading vs interpreted Python)")
# GPU speedup (if torch is available)
try:
import torch
A_t = torch.randn(n, n)
B_t = torch.randn(n, n)
# CPU PyTorch
t0 = time.perf_counter()
C_cpu = A_t @ B_t
t_cpu = time.perf_counter() - t0
print(f"\nPyTorch CPU: {t_cpu*1000:.2f} ms")
if torch.cuda.is_available():
A_gpu = A_t.cuda()
B_gpu = B_t.cuda()
torch.cuda.synchronize()
t0 = time.perf_counter()
C_gpu = A_gpu @ B_gpu
torch.cuda.synchronize()
t_gpu = time.perf_counter() - t0
print(f"PyTorch GPU: {t_gpu*1000:.2f} ms")
print(f"GPU speedup: {t_cpu/t_gpu:.1f}x")
except ImportError:
print("PyTorch not available")
Memory layout and cache performance
NumPy arrays are stored in contiguous memory (unlike Python lists, which store pointers). Two layouts exist:
- C-order (row-major): row elements are contiguous - A[0,0], A[0,1], A[0,2], ...
- Fortran-order (column-major): column elements are contiguous - A[0,0], A[1,0], A[2,0], ...
import numpy as np
# C-order: iterating over rows is cache-friendly
A_C = np.ascontiguousarray(np.random.randn(1000, 1000)) # C-order
A_F = np.asfortranarray(A_C) # Fortran-order
print(f"C-order flags: {A_C.flags['C_CONTIGUOUS']}, {A_C.flags['F_CONTIGUOUS']}")
print(f"F-order flags: {A_F.flags['C_CONTIGUOUS']}, {A_F.flags['F_CONTIGUOUS']}")
# Row sum: fast for C-order (rows are contiguous)
import time
n_trials = 100
t0 = time.perf_counter()
for _ in range(n_trials):
r_C = A_C.sum(axis=1)
t1 = time.perf_counter()
print(f"Row sum, C-order: {(t1-t0)/n_trials*1000:.3f} ms")
t0 = time.perf_counter()
for _ in range(n_trials):
r_F = A_F.sum(axis=1)
t1 = time.perf_counter()
print(f"Row sum, F-order: {(t1-t0)/n_trials*1000:.3f} ms")
# C-order should be faster for row sums (rows are contiguous in memory)
Part 7 - NumPy: einsum, Broadcasting, reshape vs view
reshape vs view: the critical difference
import numpy as np
A = np.random.randn(4, 6)
# reshape: changes the shape, may or may not copy data
B = A.reshape(8, 3) # same total elements: 4*6 = 8*3 = 24
print(f"reshape: {B.shape}")
print(f"shares memory? {np.shares_memory(A, B)}") # Usually True for contiguous arrays
# A view shares the same underlying data - modifying B modifies A
B[0, 0] = 999.0
print(f"A[0,0] = {A[0,0]}") # also 999.0 if they share memory
# Transpose creates a non-contiguous view
A_T = A.T
print(f"\nA.T contiguous? {A_T.flags['C_CONTIGUOUS']}") # False
# Calling reshape on a non-contiguous array may force a copy
A_T_reshaped = A_T.reshape(-1) # may copy
# -1 in reshape: infer that dimension automatically
A_flat = A.reshape(-1) # shape (24,) - flatten everything
A_batch = A.reshape(2, -1, 6) # shape (2, 2, 6) - infer middle dim
print(f"\nFlattened: {A_flat.shape}") # (24,)
print(f"Batched: {A_batch.shape}") # (2, 2, 6)
# Transpose / permute: reorder axes
T = np.random.randn(4, 8, 64, 32) # (batch, heads, seq, d_head)
# Permute: heads before seq
T_perm = np.transpose(T, (0, 2, 1, 3)) # (batch, seq, heads, d_head)
print(f"\nPermuted: {T_perm.shape}") # (4, 64, 8, 32)
# np.moveaxis: more readable than np.transpose
T_moved = np.moveaxis(T, 1, 2) # move axis 1 (heads) to position 2
print(f"Moved: {T_moved.shape}") # (4, 64, 8, 32)
# ── np.einsum complete examples ───────────────────────────────────────────
print("\n=== einsum reference ===")
a = np.random.randn(3)
b = np.random.randn(4)
M = np.random.randn(3, 4)
N = np.random.randn(4, 5)
operations = {
"dot product": ('i,i->', a, a),
"outer product": ('i,j->ij', a, b),
"matrix-vector": ('ij,j->i', M, b),
"matrix multiply": ('ij,jk->ik', M, N),
"transpose": ('ij->ji', M, None),
"trace": ('ii->', np.random.randn(4,4), None),
"element-wise": ('ij,ij->ij', M, M.copy()),
"sum all": ('ij->', M, None),
"row sums": ('ij->i', M, None),
"col sums": ('ij->j', M, None),
}
for name, (idx, x, y) in operations.items():
if y is None:
result = np.einsum(idx, x)
else:
result = np.einsum(idx, x, y)
shape = result.shape if hasattr(result, 'shape') else ()
print(f" {name:<22}: einsum('{idx}') → {shape}")
Part 8 - PyTorch: Tensors, Gradients, and GPU Operations
# PyTorch tensor operations - GPU-accelerated, gradient-tracked
try:
import torch
import numpy as np
# ── Creating tensors ──────────────────────────────────────────────────
# From Python list
t1 = torch.tensor([[1.0, 2.0], [3.0, 4.0]])
print(f"dtype: {t1.dtype}") # torch.float32
# From NumPy (zero-copy when on CPU)
np_arr = np.random.randn(3, 4)
t_from_np = torch.from_numpy(np_arr) # shares memory!
np_from_t = t1.numpy() # shares memory! (CPU only)
# Initialization patterns
zeros = torch.zeros(3, 4)
ones = torch.ones(3, 4)
rand = torch.randn(3, 4) # standard normal
eye = torch.eye(4) # identity matrix
# ── Device handling ───────────────────────────────────────────────────
device = 'cuda' if torch.cuda.is_available() else 'cpu'
t_gpu = torch.randn(3, 4, device=device)
print(f"Device: {t_gpu.device}")
# ── Gradient tracking ─────────────────────────────────────────────────
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = (x ** 2).sum() # scalar output
y.backward() # compute gradients
print(f"x.grad = {x.grad}") # [2x₁, 2x₂, 2x₃] = [2.0, 4.0, 6.0]
# Gradient does not flow through numpy() - stop gradient explicitly
x_detached = x.detach().numpy() # detach from computation graph first
# ── view vs reshape ───────────────────────────────────────────────────
T = torch.randn(4, 6)
# view: zero-copy, requires contiguous memory
T_view = T.view(8, 3) # shares memory
print(f"view: {T_view.shape}")
# reshape: tries view first, copies if needed
T_reshaped = T.reshape(8, 3) # may or may not copy
# After transpose, view fails - use contiguous() first
T_t = T.t()
try:
T_t_view = T_t.view(-1) # RuntimeError: not contiguous
except RuntimeError as e:
print(f"Expected: {e}")
T_t_view = T_t.contiguous().view(-1) # fix: make contiguous first
print(f"After contiguous().view: {T_t_view.shape}")
# ── PyTorch einsum ────────────────────────────────────────────────────
A = torch.randn(4, 8, 64, 32) # (batch, heads, seq, d_head)
B = torch.randn(4, 8, 64, 32)
# Batch matrix multiply across heads: Q @ K.T
scores = torch.einsum('bhid,bhjd->bhij', A, B)
print(f"\nAttention scores: {scores.shape}") # (4, 8, 64, 64)
# torch.linalg - GPU-accelerated linear algebra
M = torch.randn(10, 8)
U, S, Vh = torch.linalg.svd(M, full_matrices=False)
print(f"SVD: U={U.shape}, S={S.shape}, Vh={Vh.shape}")
vals = torch.linalg.eigvalsh(M.T @ M)
print(f"Eigenvalues of MᵀM: {vals.shape}")
print("\nPyTorch import successful.")
except ImportError:
print("PyTorch not installed. Install with: pip install torch")
Part 9 - ML: Scaled Dot-Product Attention with einsum
Scaled dot-product attention is the core operation in every transformer. It computes:
Attention(Q, K, V) = softmax(QKᵀ / sqrt(d_k)) V
Where:
- Q ∈ ℝ^(batch × heads × seq_len × d_k) - queries
- K ∈ ℝ^(batch × heads × seq_len × d_k) - keys
- V ∈ ℝ^(batch × heads × seq_len × d_v) - values
- Output ∈ ℝ^(batch × heads × seq_len × d_v)
import numpy as np
def scaled_dot_product_attention(
Q: np.ndarray,
K: np.ndarray,
V: np.ndarray,
mask: np.ndarray = None
) -> tuple:
"""
Scaled dot-product attention from scratch using einsum.
Q: (batch, heads, seq_len, d_k)
K: (batch, heads, seq_len, d_k)
V: (batch, heads, seq_len, d_v)
Returns:
- output: (batch, heads, seq_len, d_v) - attended values
- weights: (batch, heads, seq_len, seq_len) - attention weights
"""
d_k = Q.shape[-1]
# Step 1: Compute raw attention scores
# "for each batch b and head h, compute dot product of every query with every key"
# Q[b,h,i,:] · K[b,h,j,:] = score of query i attending to key j
scores = np.einsum('bhid,bhjd->bhij', Q, K) # (batch, heads, seq, seq)
# Step 2: Scale by sqrt(d_k)
# Without scaling, dot products grow large in magnitude as d_k increases,
# pushing softmax into saturation (near-zero gradients)
scores = scores / np.sqrt(d_k)
# Step 3: Optional causal mask (for autoregressive models)
# Prevent position i from attending to positions j > i (future tokens)
if mask is not None:
scores = scores + mask # mask contains 0 for allowed, -inf for forbidden
# Step 4: Softmax over the key dimension (last axis)
# Converts scores to probabilities (sum to 1 over all keys for each query)
scores_max = scores.max(axis=-1, keepdims=True) # for numerical stability
exp_scores = np.exp(scores - scores_max)
weights = exp_scores / exp_scores.sum(axis=-1, keepdims=True) # (batch,h,sq,sq)
# Step 5: Weighted sum of values
# "for each query position i, compute weighted average of all value vectors"
# weights[b,h,i,j] * V[b,h,j,d] summed over j
output = np.einsum('bhij,bhjd->bhid', weights, V) # (batch, heads, seq, d_v)
return output, weights
# Multi-head attention: full implementation
def multi_head_attention(
x: np.ndarray,
W_Q: np.ndarray,
W_K: np.ndarray,
W_V: np.ndarray,
W_O: np.ndarray,
n_heads: int,
causal: bool = False
) -> np.ndarray:
"""
Full multi-head attention.
x: (batch, seq_len, d_model)
W_Q: (d_model, d_model)
W_K: (d_model, d_model)
W_V: (d_model, d_model)
W_O: (d_model, d_model) - output projection
"""
batch, seq_len, d_model = x.shape
d_head = d_model // n_heads
# Linear projections
Q = x @ W_Q # (batch, seq, d_model)
K = x @ W_K
V = x @ W_V
# Reshape to (batch, n_heads, seq_len, d_head)
def split_heads(T: np.ndarray) -> np.ndarray:
# T: (batch, seq, d_model) → (batch, seq, n_heads, d_head)
T = T.reshape(batch, seq_len, n_heads, d_head)
# → (batch, n_heads, seq, d_head)
return np.transpose(T, (0, 2, 1, 3))
Q = split_heads(Q)
K = split_heads(K)
V = split_heads(V)
# Causal mask: upper triangular, -inf above diagonal
mask = None
if causal:
causal_mask = np.triu(np.full((seq_len, seq_len), -1e9), k=1)
mask = causal_mask[np.newaxis, np.newaxis, :, :] # broadcast over batch, heads
# Scaled dot-product attention
attn_output, attn_weights = scaled_dot_product_attention(Q, K, V, mask)
# Merge heads: (batch, n_heads, seq, d_head) → (batch, seq, d_model)
attn_output = np.transpose(attn_output, (0, 2, 1, 3)) # (batch, seq, h, d_head)
attn_output = attn_output.reshape(batch, seq_len, d_model) # (batch, seq, d_model)
# Output projection
output = attn_output @ W_O # (batch, seq, d_model)
return output, attn_weights
# Test the full implementation
np.random.seed(42)
batch_size = 4
seq_len = 16
d_model = 64
n_heads = 4
x = np.random.randn(batch_size, seq_len, d_model)
# Initialize weights (in practice: Xavier / Kaiming initialization)
W_Q = np.random.randn(d_model, d_model) * 0.02
W_K = np.random.randn(d_model, d_model) * 0.02
W_V = np.random.randn(d_model, d_model) * 0.02
W_O = np.random.randn(d_model, d_model) * 0.02
output, weights = multi_head_attention(x, W_Q, W_K, W_V, W_O, n_heads, causal=True)
print(f"Input shape: {x.shape}") # (4, 16, 64)
print(f"Output shape: {output.shape}") # (4, 16, 64)
print(f"Attention weights: {weights.shape}") # (4, 4, 16, 16)
print(f"Weights sum to 1: {np.allclose(weights.sum(axis=-1), 1.0)}") # True
# Verify causal: no position attends to future positions
# weights[b, h, i, j] should be 0 for j > i
causal_check = weights[0, 0] # first batch, first head
future_weight = causal_check[np.triu_indices(seq_len, k=1)]
print(f"Future attention weights (should be ~0): max = {future_weight.max():.2e}")
Interview Questions
Q1: What is Einstein summation notation and how does np.einsum work?
Einstein summation notation is a compact way to write tensor operations. The key convention: any index that appears twice in an expression (once in each operand) is summed over. Any index that appears only once is "free" - it appears in the output.
np.einsum('ij,jk->ik', A, B) means:
- A has indices i, j
- B has indices j, k
- j appears in both inputs but not in output → sum over j
- i and k appear in output → free indices
This is equivalent to matrix multiplication: Cᵢₖ = Σⱼ Aᵢⱼ Bⱼₖ
The power of einsum is that it handles any tensor contraction in a single expression without intermediate transposes or reshapes:
'i,i->'- dot product (contract all)'i,j->ij'- outer product (no contraction)'bhid,bhjd->bhij'- batch multi-head attention scores'bsnd,nd->bsn'- batch matrix-vector product
Internally, np.einsum parses the index string, determines which indices are summed (repeated in inputs, absent in output) and which are free (in output), and generates the optimal sequence of BLAS calls to compute the result. With optimize=True, it uses dynamic programming to find the optimal contraction order for multi-tensor einsum chains.
Q2: What are the broadcasting rules in NumPy/PyTorch?
Broadcasting allows operations between arrays of different shapes by following these rules, applied to shapes aligned from the right:
- If arrays have different numbers of dimensions, prepend 1s to the shape of the smaller array
- Arrays of size 1 along a dimension are stretched to match the other array's size
- If sizes differ and neither is 1, the operation raises an error
Example: (32, 512) + (512,) → prepend: (1, 512) → stretch: (32, 512) Example: (5, 1, 3) + (1, 4, 3) → stretch both → (5, 4, 3)
Key keepdims rule: When reducing an array (e.g., mean, sum), always use keepdims=True if you intend to broadcast the result back against the original. Without keepdims, the result loses the axis, making the broadcast less explicit and potentially wrong in higher-dimensional cases.
X = np.random.randn(32, 512)
mean = X.mean(axis=0, keepdims=True) # (1, 512) - explicit
std = X.std(axis=0, keepdims=True) # (1, 512)
X_normalized = (X - mean) / (std + 1e-8) # (32,512): unambiguous broadcast
Broadcasting never copies data - it uses stride tricks to make arrays appear larger without allocating memory.
Q3: How is a convolutional layer a tensor contraction?
A 2D convolution with:
- Input: (batch N, in_channels C_in, height H, width W)
- Kernel: (out_channels C_out, in_channels C_in, kernel_height KH, kernel_width KW)
Computes:
Output[n, c_out, h, w] = Σ_{c_in} Σ_{kh} Σ_{kw}
Input[n, c_in, h + kh, w + kw] * Kernel[c_out, c_in, kh, kw]
This is a tensor contraction over three indices: c_in, kh, and kw. For each output spatial position (h, w), the convolution contracts over the entire local receptive field in the input.
The implementation strategy (im2col) converts this contraction into a single matrix multiplication:
- Extract all local patches from the input → shape (N·H_out·W_out, C_in·KH·KW)
- Reshape kernel → shape (C_in·KH·KW, C_out)
- Matrix multiply → shape (N·H_out·W_out, C_out)
- Reshape back to (N, C_out, H_out, W_out)
This is why convolutions and matrix multiplications use the same underlying BLAS/cuBLAS kernels - they are both tensor contractions.
Q4: Why is vectorization faster than Python loops?
Python loops have multiple sources of overhead compared to vectorized tensor operations:
- Interpreter overhead: each iteration interprets bytecode, checks types, dispatches function calls
- Object creation: each arithmetic operation creates a new Python object with reference counting
- No SIMD: the processor's SIMD units (AVX-512: 16 float32s per instruction) are unused
- No memory pipelining: cache prefetching and memory coalescing do not trigger for scalar access patterns
- No parallelism: a single Python thread, with the GIL blocking true multi-threading
When you call A @ B in NumPy:
- The call dispatches to BLAS (OpenBLAS / Intel MKL)
- BLAS uses SIMD instructions: one instruction multiplies 16 float32s simultaneously
- Computation is blocked for L1/L2/L3 cache efficiency
- Multiple CPU threads execute in parallel (BLAS releases the GIL)
- On GPU: thousands of CUDA cores execute in parallel via cuBLAS
The practical speedup is typically 100-1000x for medium-sized matrices. For a 1000×1000 matrix multiply:
- Python triple loop: several minutes
- NumPy BLAS: a few milliseconds
- GPU cuBLAS: under 1 millisecond
The rule: eliminate all Python loops over tensor elements. Replace them with einsum, broadcasting, or vectorized array operations. Profile with %timeit when in doubt.
Practice Challenges
Level 1: Predict
Challenge: What is the output shape of each einsum operation?
A = np.random.randn(3, 4, 5)
B = np.random.randn(4, 6)
C = np.random.randn(5, 6)
r1 = np.einsum('ijk,jl->ikl', A, B)
r2 = np.einsum('ijk,kl->ijl', A, C)
r3 = np.einsum('ijk,jl,kl->il', A, B, C)
Answer
-
'ijk,jl->ikl': A has (i=3, j=4, k=5), B has (j=4, l=6). j appears in both inputs but not output → sum over j. Free indices: i, k, l. Output: (3, 5, 6) -
'ijk,kl->ijl': A has (i=3, j=4, k=5), C has (k=5, l=6). k is contracted. Free: i, j, l. Output: (3, 4, 6) -
'ijk,jl,kl->il': j and k both appear in multiple inputs but not output → both summed. Free: i, l. This computes: Σⱼ Σₖ A_{ijk} B_{jl} C_{kl}. Output: (3, 6)
This third operation is equivalent to a 3-way tensor contraction, which would require several intermediate arrays without einsum.
Level 2: Debug
Challenge: The following attention implementation gives incorrect results. Find the bug:
def attention(Q, K, V):
d_k = Q.shape[-1]
scores = Q @ K.T # shape: (seq, seq)
weights = np.exp(scores) / np.exp(scores).sum()
return weights @ V
Answer
Three bugs:
Bug 1: Missing scaling. Scores should be divided by sqrt(d_k) before softmax:
scores = (Q @ K.T) / np.sqrt(d_k)
Without scaling, dot products grow large as d_k increases, pushing softmax toward one-hot distributions and causing vanishing gradients.
Bug 2: Softmax normalizes over wrong axis. sum() sums all elements of the 2D scores matrix. It should normalize over the key dimension (axis=-1) for each query independently:
exp_scores = np.exp(scores - scores.max(axis=-1, keepdims=True)) # stability
weights = exp_scores / exp_scores.sum(axis=-1, keepdims=True)
Bug 3: K.T does not work for batched or multi-head inputs. If Q, K have shape (batch, heads, seq, d_k), K.T transposes ALL dimensions, giving (d_k, seq, heads, batch). Use K.transpose(0,1,3,2) or einsum:
# For batched multi-head: use einsum
scores = np.einsum('bhid,bhjd->bhij', Q, K) / np.sqrt(d_k)
Full corrected version:
def attention_fixed(Q, K, V):
d_k = Q.shape[-1]
scores = (Q @ K.swapaxes(-1, -2)) / np.sqrt(d_k)
scores_max = scores.max(axis=-1, keepdims=True)
exp_scores = np.exp(scores - scores_max)
weights = exp_scores / exp_scores.sum(axis=-1, keepdims=True)
return weights @ V, weights
Level 3: Design
Challenge: Implement a convolutional layer using only np.einsum and np.lib.stride_tricks.as_strided (without scipy.signal.convolve2d or other helpers).
Answer
import numpy as np
def conv2d_einsum(
input: np.ndarray,
weight: np.ndarray,
bias: np.ndarray = None,
stride: int = 1,
padding: int = 0
) -> np.ndarray:
"""
2D convolution as a tensor contraction using einsum.
input: (N, C_in, H, W)
weight: (C_out, C_in, KH, KW)
Returns (N, C_out, H_out, W_out)
"""
N, C_in, H, W = input.shape
C_out, _, KH, KW = weight.shape
if padding > 0:
input = np.pad(input, ((0,0), (0,0), (padding,padding), (padding,padding)))
H += 2 * padding
W += 2 * padding
H_out = (H - KH) // stride + 1
W_out = (W - KW) // stride + 1
# im2col: extract all local patches
# Output shape of patches: (N, H_out, W_out, C_in, KH, KW)
strides = input.strides # (n_stride, c_stride, h_stride, w_stride)
patches = np.lib.stride_tricks.as_strided(
input,
shape=(N, H_out, W_out, C_in, KH, KW),
strides=(strides[0], strides[2] * stride, strides[3] * stride,
strides[1], strides[2], strides[3])
)
# Contraction: for each output position, contract over C_in, KH, KW
# patches: (N, H_out, W_out, C_in, KH, KW)
# weight: (C_out, C_in, KH, KW)
# output: (N, H_out, W_out, C_out) → transpose to (N, C_out, H_out, W_out)
output = np.einsum('nhwckl,ockl->nhwo', patches, weight)
output = np.transpose(output, (0, 3, 1, 2)) # → (N, C_out, H_out, W_out)
if bias is not None:
output = output + bias[np.newaxis, :, np.newaxis, np.newaxis]
return output
# Verify against scipy
from scipy.signal import correlate2d
N, C_in, H, W = 2, 3, 8, 8
C_out, KH, KW = 4, 3, 3
input_t = np.random.randn(N, C_in, H, W)
weight_t = np.random.randn(C_out, C_in, KH, KW)
output = conv2d_einsum(input_t, weight_t)
print(f"Output shape: {output.shape}") # (2, 4, 6, 6)
# Manually verify one element
h, w, co = 0, 0, 0
manual = 0.0
for ci in range(C_in):
for kh in range(KH):
for kw in range(KW):
manual += input_t[0, ci, h + kh, w + kw] * weight_t[co, ci, kh, kw]
print(f"output[0,0,0,0] = {output[0,0,0,0]:.6f}")
print(f"manual = {manual:.6f}")
print(f"Match: {np.isclose(output[0,0,0,0], manual)}") # True
Quick Reference Cheatsheet
| Operation | Math | einsum string | Notes |
|---|---|---|---|
| Dot product | u·v | 'i,i->' | Returns scalar |
| Outer product | uvᵀ | 'i,j->ij' | No contraction |
| Matrix-vector | Av | 'ij,j->i' | Contract j |
| Matrix multiply | AB | 'ij,jk->ik' | Contract j |
| Batch matmul | A[b]B[b] | 'bij,bjk->bik' | b is free |
| Transpose | Aᵀ | 'ij->ji' | No contraction |
| Trace | tr(A) | 'ii->' | Sum diagonal |
| Attention scores | QKᵀ | 'bhid,bhjd->bhij' | Contract d_head |
| Attention output | softmax(s)V | 'bhij,bhjd->bhid' | Contract seq |
Key Takeaways
- Tensors generalize scalars, vectors, and matrices to N dimensions - every deep learning computation is a tensor operation
- Each tensor axis has semantic meaning (batch, channel, spatial, head, sequence) - know your axes before writing operations
- Tensor contractions generalize matrix multiplication: sum over specified shared indices, keep free indices in output
- Einstein summation (
np.einsum) is the universal language for tensor contractions - learn it to write and read any DL operation compactly - Broadcasting rules: align from right, size-1 dimensions stretch - use
keepdims=Truewhen reducing to preserve axis semantics - Vectorization is 100-1000x faster than Python loops because BLAS and GPU kernels exploit SIMD parallelism and cache hierarchy
Next: Linear Algebra in NumPy - A Complete Engineering Reference →
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Tensor Viewer demo on the EngineersOfAI Playground - no code required.
:::
