Tensors for Deep Learning - From Matrices to Multi-Dimensional Arrays

Reading time: ~24 minutes | Level: Mathematical Foundations → ML Engineering

A batch of 32 images, each 224×224 pixels with 3 color channels, is a 4-dimensional tensor of shape (32, 224, 224, 3). A transformer's attention weight matrix is a 4D tensor of shape (batch, heads, seq_len, seq_len). Everything in deep learning is tensor algebra.

Matrices are 2D arrays. Tensors generalize this to N dimensions. But tensors are not just "bigger matrices" - they come with a precise algebra of contractions, Einstein summation notation, and broadcasting rules that determine how every layer in every neural network performs its computation.

Understanding tensors means understanding exactly what a convolutional layer does (it is a tensor contraction), what scaled dot-product attention computes (it is a batch matrix multiply expressed with einsum), and why vectorized code runs hundreds of times faster than Python loops (it is SIMD and GPU parallelism over tensor operations).

What You Will Learn

Tensors as the generalization of scalars (0D), vectors (1D), matrices (2D) to N dimensions
How to read and reason about tensor shapes in deep learning
Tensor contractions: generalizing matrix multiplication to arbitrary axes
Einstein summation notation and np.einsum
Broadcasting: how NumPy and PyTorch extend operations across dimensions
Vectorization: why tensor ops are fast (SIMD, GPU parallelism)
NumPy: einsum, broadcasting, reshape vs view
PyTorch: tensor operations, gradient tracking, .view() vs .reshape()
ML: scaled dot-product attention from scratch using einsum

Prerequisites

Part 1 - Tensors as Generalization: 0D to N-D

The hierarchy

Name	Rank	Shape example	DL example
Scalar	0D	()	Loss value, learning rate
Vector	1D	(512,)	Embedding, bias vector
Matrix	2D	(512, 256)	Linear layer weights
3D tensor	3D	(32, 100, 512)	Batch of token sequences
4D tensor	4D	(32, 3, 224, 224)	Batch of images (NCHW)
5D tensor	5D	(32, 8, 100, 100, 64)	Multi-head attention scores

A tensor is simply a multi-dimensional array. Every scalar, vector, and matrix is a special case of a tensor. The number of dimensions is the tensor's rank (also called order or degree).

import numpy as np

# Scalar: rank 0
s = np.array(3.14)
print(f"Scalar: shape={s.shape}, ndim={s.ndim}")     # (), 0

# Vector: rank 1
v = np.array([1.0, 2.0, 3.0, 4.0])
print(f"Vector: shape={v.shape}, ndim={v.ndim}")     # (4,), 1

# Matrix: rank 2
M = np.array([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
print(f"Matrix: shape={M.shape}, ndim={M.ndim}")     # (3, 2), 2

# 3D tensor: batch of sequences
T3 = np.random.randn(32, 100, 512)   # (batch, seq_len, d_model)
print(f"3D tensor: shape={T3.shape}")

# 4D tensor: batch of images (NCHW format)
T4 = np.random.randn(32, 3, 224, 224)  # (batch, channels, height, width)
print(f"4D tensor: shape={T4.shape}")

# Number of elements
print(f"\n3D tensor elements: {T3.size:,}")      # 32 * 100 * 512 = 1,638,400
print(f"4D tensor elements: {T4.size:,}")        # 32 * 3 * 224 * 224 = 4,816,896
print(f"4D tensor memory (float32): {T4.nbytes / 1e6:.1f} MB")

# Indexing: select from each dimension
first_image = T4[0]           # shape (3, 224, 224) - first image in batch
red_channel = T4[:, 0, :, :]  # shape (32, 224, 224) - red channel of all images
top_left_3x3 = T4[:, :, :3, :3]  # shape (32, 3, 3, 3) - 3x3 corner of all images

Axes and their semantic meaning

In deep learning, each tensor axis has a semantic meaning. The most important conventions:

NCHW (PyTorch default for images):

N = batch size
C = channels (color / feature maps)
H = height (spatial)
W = width (spatial)

NHWC (TensorFlow default, also faster on some hardware):

N = batch size
H = height
W = width
C = channels

Transformer tensors:

(batch, seq_len, d_model)       - token embeddings
(batch, heads, seq_len, d_head) - multi-head attention
(batch, heads, seq_len, seq_len) - attention weights

Always know which axis is which before writing tensor operations - shape bugs are the most common source of incorrect deep learning code.

Part 2 - Tensor Shapes and How to Read Them

import numpy as np

# Reading tensor shapes in deep learning

# A single forward pass through a transformer layer:
batch_size = 4
seq_len = 64
d_model = 256
n_heads = 8
d_head = d_model // n_heads   # 32

# Input token embeddings
x = np.random.randn(batch_size, seq_len, d_model)
print(f"Input:              {x.shape}")   # (4, 64, 256)

# Q, K, V projections: linear layer applied to last dim
# W_Q: (d_model, d_model) → but multi-head splits it as (d_model, n_heads, d_head)
W_Q = np.random.randn(d_model, n_heads, d_head)  # (256, 8, 32)
W_K = np.random.randn(d_model, n_heads, d_head)
W_V = np.random.randn(d_model, n_heads, d_head)

# After projection + reshape: (batch, n_heads, seq_len, d_head)
Q = np.einsum('bsd,dnh->bnsh', x, W_Q)  # (4, 8, 64, 32)
K = np.einsum('bsd,dnh->bnsh', x, W_K)
V = np.einsum('bsd,dnh->bnsh', x, W_V)

print(f"Q shape:            {Q.shape}")   # (4, 8, 64, 32)
print(f"K shape:            {K.shape}")   # (4, 8, 64, 32)

# Attention scores: (batch, heads, seq_len, seq_len)
scores = np.einsum('bnid,bnjd->bnij', Q, K) / np.sqrt(d_head)
print(f"Attention scores:   {scores.shape}")  # (4, 8, 64, 64)

# Shape arithmetic: always verify
expected_params = d_model * d_model * 3   # Q, K, V projections
print(f"\nExpected QKV params: {expected_params:,}")  # 196,608
print(f"Actual QKV params: {(W_Q.size + W_K.size + W_V.size):,}")

# Common shape operations
T = np.random.randn(4, 64, 256)

# Reshape: change shape, preserve total elements
T_flat = T.reshape(4, -1)     # (4, 16384) - flatten seq and embedding dims
T_split = T.reshape(4, 64, 8, 32)  # (4, 64, 8, 32) - split embedding into heads

print(f"\nOriginal:   {T.shape}")    # (4, 64, 256)
print(f"Flattened:  {T_flat.shape}")  # (4, 16384)
print(f"Head-split: {T_split.shape}")  # (4, 64, 8, 32)

# Transpose axes
T_transposed = np.transpose(T_split, (0, 2, 1, 3))  # (batch, heads, seq, d_head)
print(f"Transposed: {T_transposed.shape}")  # (4, 8, 64, 32)

Part 3 - Tensor Contractions: Generalizing Matrix Multiplication

Matrix multiplication is a contraction over one pair of indices. Tensor contractions generalize this to contracting over any pair (or multiple pairs) of indices between two tensors.

From matrix multiply to general contraction

Matrix multiply: C = A @ B

Cᵢₖ = Σⱼ Aᵢⱼ Bⱼₖ   (sum over shared index j)

General tensor contraction: contract over specified shared axes.

Example 1: Dot product of vectors (contract over all indices)

result = Σᵢ uᵢ vᵢ

Example 2: Outer product (no contraction - all indices are free)

Cᵢⱼ = uᵢ vⱼ

Example 3: Batch matrix multiply (contract over last/second-to-last axes)

Cᵦᵢₖ = Σⱼ Aᵦᵢⱼ Bᵦⱼₖ   (b = batch dimension, not contracted)

Example 4: Convolution (contract over kernel spatial dimensions)

Outputₙ,ₒ,ₕ,w = Σ_ᵢ Σ_kh Σ_kw Input_{n,i,h+kh,w+kw} * Kernel_{o,i,kh,kw}

This is a contraction over input channels (i), kernel height (kh), and kernel width (kw).

import numpy as np

# Tensor contractions via np.einsum

# 1. Vector dot product
u = np.array([1.0, 2.0, 3.0])
v = np.array([4.0, 5.0, 6.0])
dot = np.einsum('i,i->', u, v)   # sum over all i
print(f"Dot product: {dot}")     # 1*4 + 2*5 + 3*6 = 32.0

# 2. Outer product
outer = np.einsum('i,j->ij', u, v)   # no summation - all indices free
print(f"Outer product shape: {outer.shape}")  # (3, 3)

# 3. Matrix multiply
A = np.random.randn(3, 4)
B = np.random.randn(4, 5)
C = np.einsum('ij,jk->ik', A, B)    # contract over j
print(f"MatMul shape: {C.shape}")   # (3, 5)
assert np.allclose(C, A @ B)

# 4. Batch matrix multiply
# Each element in the batch multiplies its own pair of matrices
Abatch = np.random.randn(8, 3, 4)   # 8 matrices of shape (3,4)
Bbatch = np.random.randn(8, 4, 5)   # 8 matrices of shape (4,5)
Cbatch = np.einsum('bij,bjk->bik', Abatch, Bbatch)  # b is free (not contracted)
print(f"Batch MatMul: {Cbatch.shape}")  # (8, 3, 5)
assert np.allclose(Cbatch, Abatch @ Bbatch)  # np.matmul handles batch dims

# 5. Multi-head attention score (the most important deep learning contraction)
batch, heads, seq, d_head = 4, 8, 64, 32
Q = np.random.randn(batch, heads, seq, d_head)
K = np.random.randn(batch, heads, seq, d_head)
# Compute Q @ K.T for all batches and heads simultaneously
# "for each batch b and head h, compute Q[b,h] @ K[b,h].T"
scores = np.einsum('bhid,bhjd->bhij', Q, K)   # contract over d_head (d)
print(f"Attention scores: {scores.shape}")     # (4, 8, 64, 64)
# Each score[b,h,i,j] = dot product of Q[b,h,i,:] and K[b,h,j,:]

# 6. Convolutional contraction (simplified 1D)
# Input: (batch, in_channels, length)
# Kernel: (out_channels, in_channels, kernel_size)
# Output: (batch, out_channels, length - kernel_size + 1)
batch, in_c, length = 2, 3, 10
out_c, k_size = 4, 3
inp = np.random.randn(batch, in_c, length)
kernel = np.random.randn(out_c, in_c, k_size)

# Naive convolution via einsum (pedagogical - not efficient in practice)
output = np.zeros((batch, out_c, length - k_size + 1))
for pos in range(length - k_size + 1):
    patch = inp[:, :, pos:pos + k_size]  # (batch, in_c, k_size)
    output[:, :, pos] = np.einsum('bik,oik->bo', patch, kernel)

print(f"Conv1D output: {output.shape}")   # (2, 4, 8)

Part 4 - Einstein Summation Notation

Einstein summation (einsum) is a compact notation for all tensor contractions. It was invented by Einstein for general relativity - and then adopted wholesale by the deep learning community.

The rule

np.einsum('index_string', tensor1, tensor2)

The index string has the form: 'input1_indices,input2_indices->output_indices'

Repeated indices with no arrow: summed over (contracted)
Indices that appear in inputs but not output: summed over
Indices that appear in output: kept (free indices)

import numpy as np

# Einstein summation reference

A = np.random.randn(3, 4)
B = np.random.randn(4, 5)
v = np.random.randn(4)
u = np.random.randn(3)

# Matrix-vector multiply: Av
print(np.einsum('ij,j->i', A, v).shape)        # (3,) - j is contracted

# Matrix-matrix multiply: AB
print(np.einsum('ij,jk->ik', A, B).shape)      # (3,5) - j contracted

# Element-wise multiply: A * A (Hadamard)
print(np.einsum('ij,ij->ij', A, A).shape)      # (3,4) - all indices free

# Trace: sum of diagonal
sq = np.random.randn(4, 4)
trace = np.einsum('ii->', sq)
print(f"Trace: {trace:.4f}")
assert np.isclose(trace, np.trace(sq))

# Outer product: u ⊗ v
outer = np.einsum('i,j->ij', u, v)
print(f"Outer product: {outer.shape}")         # (3, 4)

# Batch outer product: for each batch element, compute outer product
batch_u = np.random.randn(8, 3)   # 8 vectors of dim 3
batch_v = np.random.randn(8, 5)   # 8 vectors of dim 5
batch_outer = np.einsum('bi,bj->bij', batch_u, batch_v)
print(f"Batch outer: {batch_outer.shape}")    # (8, 3, 5)

# Transpose
AT = np.einsum('ij->ji', A)
print(f"Transpose: {AT.shape}")               # (4, 3)
assert np.allclose(AT, A.T)

# Diagonal extraction
diag = np.einsum('ii->i', sq)
assert np.allclose(diag, np.diag(sq))

# Frobenius norm squared = sum of squared entries
frob_sq = np.einsum('ij,ij->', A, A)
assert np.isclose(frob_sq, np.linalg.norm(A, 'fro')**2)

# Quadratic form: uᵀ A u
M = np.random.randn(3, 3)
M = M.T @ M   # make symmetric
u3 = np.random.randn(3)
quad = np.einsum('i,ij,j->', u3, M, u3)
assert np.isclose(quad, u3 @ M @ u3)

print("\nEinsum covers all standard linear algebra operations.")

Performance: einsum is optimized

Modern implementations of np.einsum (and torch.einsum) use optimized BLAS/cuBLAS kernels under the hood. When the optimize=True argument is passed, NumPy finds the optimal contraction order (which can be exponentially faster for multi-tensor einsum chains).

import numpy as np
import time

# Multi-tensor einsum: contraction order matters
A = np.random.randn(100, 200)
B = np.random.randn(200, 300)
C = np.random.randn(300, 50)

# Without optimization: may pick a suboptimal contraction order
t0 = time.perf_counter()
for _ in range(100):
    result_naive = np.einsum('ij,jk,kl->il', A, B, C)
t1 = time.perf_counter()
print(f"Naive einsum:     {(t1-t0)*10:.2f} ms")

# With optimization: NumPy finds the best contraction path
t0 = time.perf_counter()
for _ in range(100):
    result_opt = np.einsum('ij,jk,kl->il', A, B, C, optimize=True)
t1 = time.perf_counter()
print(f"Optimized einsum: {(t1-t0)*10:.2f} ms")

assert np.allclose(result_naive, result_opt)
# Optimized is often 2-10x faster for multi-tensor contractions

Part 5 - Broadcasting: Extending Operations Across Dimensions

Broadcasting allows NumPy and PyTorch to perform operations between tensors of different shapes - without copying data.

The broadcasting rules

When operating on two tensors, NumPy aligns shapes from the rightmost dimension and applies these rules:

Same size: dimensions match - proceed normally
One is size 1: the size-1 dimension is "stretched" to match the other
One is absent: the missing dimension is treated as size 1 and stretched

If any dimension is neither equal nor 1, the operation fails.

import numpy as np

# Rule demonstration: align from right
A = np.ones((3, 4))    # shape (3, 4)
b = np.ones((4,))      # shape    (4,) - treated as (1, 4)
print(f"A + b: {(A + b).shape}")   # (3, 4) - b broadcast along axis 0

B = np.ones((3, 1))    # shape (3, 1)
print(f"A + B: {(A + B).shape}")   # (3, 4) - B broadcast along axis 1

# Both broadcast simultaneously
x = np.ones((5, 1, 3))  # shape (5, 1, 3)
y = np.ones((1, 4, 3))  # shape (1, 4, 3)
print(f"x + y: {(x + y).shape}")   # (5, 4, 3)
# Rule: (5,1,3) + (1,4,3) → (5,4,3)

# Broadcasting failure
try:
    bad = np.ones((3, 4)) + np.ones((3, 5))
except ValueError as e:
    print(f"Expected error: {e}")

# ── ML examples of broadcasting ──────────────────────────────────────────

# 1. Batch bias addition in linear layer
# W: (d_in, d_out) = (512, 256), bias: (256,)
W = np.random.randn(512, 256)
bias = np.random.randn(256)

# Input batch: (batch, d_in) = (32, 512)
x_batch = np.random.randn(32, 512)
output = x_batch @ W + bias   # (32, 256) + (256,) → broadcasts to (32, 256)
print(f"\nLinear layer output: {output.shape}")  # (32, 256)

# 2. Scaling attention scores by sqrt(d_head)
d_head = 64
scores = np.random.randn(4, 8, 100, 100)  # (batch, heads, seq, seq)
scores_scaled = scores / np.sqrt(d_head)   # scalar broadcast to all dims
print(f"Scaled scores: {scores_scaled.shape}")  # (4, 8, 100, 100)

# 3. Subtracting mean per feature (batch normalization step)
X = np.random.randn(32, 512)      # (batch, features)
mean = X.mean(axis=0, keepdims=True)  # (1, 512) - keepdims ensures (1,512) not (512,)
std  = X.std(axis=0, keepdims=True)   # (1, 512)
X_norm = (X - mean) / (std + 1e-8)   # (32,512) - (1,512) → broadcasts
print(f"Normalized: {X_norm.shape}")  # (32, 512)

# 4. Pairwise Euclidean distances (the NumPy broadcasting trick)
A = np.random.randn(100, 512)  # 100 query embeddings
B = np.random.randn(200, 512)  # 200 document embeddings

# Expand dims to enable broadcasting:
# A[:,np.newaxis,:] has shape (100, 1, 512)
# B[np.newaxis,:,:] has shape (1, 200, 512)
# Difference: (100, 200, 512)
diff = A[:, np.newaxis, :] - B[np.newaxis, :, :]   # (100, 200, 512)
dists = np.sqrt(np.sum(diff**2, axis=-1))           # (100, 200)
print(f"Pairwise distances: {dists.shape}")  # (100, 200)
# dists[i,j] = Euclidean distance between A[i] and B[j]

keepdims: the common broadcasting bug

import numpy as np

X = np.random.randn(32, 512)

# WRONG: mean has shape (512,) - not (1,512)
mean_wrong = X.mean(axis=0)           # shape: (512,)
print(f"Without keepdims: {mean_wrong.shape}")

# Still works due to broadcasting, but fragile for chaining
X_centered_wrong = X - mean_wrong    # (32,512) - (512,) → works, but shape of mean is ambiguous

# RIGHT: keepdims=True preserves axis structure
mean_right = X.mean(axis=0, keepdims=True)  # shape: (1, 512)
print(f"With keepdims:    {mean_right.shape}")

# Now the subtraction intent is unambiguous:
X_centered_right = X - mean_right           # (32,512) - (1,512) → explicit broadcast

Part 6 - Vectorization: Why Tensor Ops Are Fast

The problem with Python loops

Python is an interpreted language. Every line of a Python for loop requires:

Bytecode interpretation overhead
Python object creation for each intermediate result
No opportunity for SIMD (Single Instruction, Multiple Data) vectorization
No GPU parallelism

Why tensor operations are fast

When you call A @ B in NumPy, it dispatches to BLAS (Basic Linear Algebra Subprograms), a highly optimized C/Fortran library that:

Uses SIMD instructions (AVX-512 on modern CPUs processes 16 float32s simultaneously)
Exploits cache hierarchy (blocks computations to maximize L1/L2/L3 cache hits)
Can use multiple CPU cores in parallel (via OpenBLAS or Intel MKL)

On GPU (via PyTorch CUDA), the same operation uses thousands of parallel cores in cuBLAS, achieving throughput that is orders of magnitude higher.

import numpy as np
import time

n = 1000
A = np.random.randn(n, n)
B = np.random.randn(n, n)

# Python loop: O(n³) operations, all interpreted
def matmul_python_loop(A: np.ndarray, B: np.ndarray) -> np.ndarray:
    n, m = A.shape[0], B.shape[1]
    k = A.shape[1]
    C = np.zeros((n, m))
    for i in range(n):
        for j in range(m):
            for l in range(k):
                C[i, j] += A[i, l] * B[l, j]
    return C

# NumPy vectorized (BLAS)
t0 = time.perf_counter()
C_numpy = A @ B
t_numpy = time.perf_counter() - t0
print(f"NumPy @ (BLAS):      {t_numpy*1000:.2f} ms")

# Partially vectorized (inner loop is vectorized, outer is not)
def matmul_partial(A: np.ndarray, B: np.ndarray) -> np.ndarray:
    C = np.zeros((A.shape[0], B.shape[1]))
    for i in range(A.shape[0]):
        for j in range(B.shape[1]):
            C[i, j] = A[i, :] @ B[:, j]  # vectorized inner product
    return C

t0 = time.perf_counter()
C_partial = matmul_partial(A[:50, :50], B[:50, :50])  # smaller for timing
t_partial = time.perf_counter() - t0
print(f"Partial vectorized:  {t_partial*1000:.2f} ms (on 50x50 submatrix)")

# Speedup comparison
print(f"\nNumPy BLAS is approximately {50**3 / (t_numpy * 1e9 / n**3):.0f}x faster")
print("(SIMD + cache optimization + multi-threading vs interpreted Python)")

# GPU speedup (if torch is available)
try:
    import torch
    A_t = torch.randn(n, n)
    B_t = torch.randn(n, n)

    # CPU PyTorch
    t0 = time.perf_counter()
    C_cpu = A_t @ B_t
    t_cpu = time.perf_counter() - t0
    print(f"\nPyTorch CPU: {t_cpu*1000:.2f} ms")

    if torch.cuda.is_available():
        A_gpu = A_t.cuda()
        B_gpu = B_t.cuda()
        torch.cuda.synchronize()
        t0 = time.perf_counter()
        C_gpu = A_gpu @ B_gpu
        torch.cuda.synchronize()
        t_gpu = time.perf_counter() - t0
        print(f"PyTorch GPU: {t_gpu*1000:.2f} ms")
        print(f"GPU speedup: {t_cpu/t_gpu:.1f}x")
except ImportError:
    print("PyTorch not available")

Memory layout and cache performance

NumPy arrays are stored in contiguous memory (unlike Python lists, which store pointers). Two layouts exist:

C-order (row-major): row elements are contiguous - A[0,0], A[0,1], A[0,2], ...
Fortran-order (column-major): column elements are contiguous - A[0,0], A[1,0], A[2,0], ...

import numpy as np

# C-order: iterating over rows is cache-friendly
A_C = np.ascontiguousarray(np.random.randn(1000, 1000))  # C-order
A_F = np.asfortranarray(A_C)  # Fortran-order

print(f"C-order flags: {A_C.flags['C_CONTIGUOUS']}, {A_C.flags['F_CONTIGUOUS']}")
print(f"F-order flags: {A_F.flags['C_CONTIGUOUS']}, {A_F.flags['F_CONTIGUOUS']}")

# Row sum: fast for C-order (rows are contiguous)
import time
n_trials = 100

t0 = time.perf_counter()
for _ in range(n_trials):
    r_C = A_C.sum(axis=1)
t1 = time.perf_counter()
print(f"Row sum, C-order: {(t1-t0)/n_trials*1000:.3f} ms")

t0 = time.perf_counter()
for _ in range(n_trials):
    r_F = A_F.sum(axis=1)
t1 = time.perf_counter()
print(f"Row sum, F-order: {(t1-t0)/n_trials*1000:.3f} ms")
# C-order should be faster for row sums (rows are contiguous in memory)

Part 7 - NumPy: einsum, Broadcasting, reshape vs view

reshape vs view: the critical difference

import numpy as np

A = np.random.randn(4, 6)

# reshape: changes the shape, may or may not copy data
B = A.reshape(8, 3)   # same total elements: 4*6 = 8*3 = 24
print(f"reshape: {B.shape}")
print(f"shares memory? {np.shares_memory(A, B)}")  # Usually True for contiguous arrays

# A view shares the same underlying data - modifying B modifies A
B[0, 0] = 999.0
print(f"A[0,0] = {A[0,0]}")  # also 999.0 if they share memory

# Transpose creates a non-contiguous view
A_T = A.T
print(f"\nA.T contiguous? {A_T.flags['C_CONTIGUOUS']}")  # False
# Calling reshape on a non-contiguous array may force a copy
A_T_reshaped = A_T.reshape(-1)  # may copy

# -1 in reshape: infer that dimension automatically
A_flat = A.reshape(-1)           # shape (24,) - flatten everything
A_batch = A.reshape(2, -1, 6)   # shape (2, 2, 6) - infer middle dim
print(f"\nFlattened: {A_flat.shape}")     # (24,)
print(f"Batched:   {A_batch.shape}")     # (2, 2, 6)

# Transpose / permute: reorder axes
T = np.random.randn(4, 8, 64, 32)  # (batch, heads, seq, d_head)

# Permute: heads before seq
T_perm = np.transpose(T, (0, 2, 1, 3))  # (batch, seq, heads, d_head)
print(f"\nPermuted: {T_perm.shape}")     # (4, 64, 8, 32)

# np.moveaxis: more readable than np.transpose
T_moved = np.moveaxis(T, 1, 2)   # move axis 1 (heads) to position 2
print(f"Moved: {T_moved.shape}")  # (4, 64, 8, 32)

# ── np.einsum complete examples ───────────────────────────────────────────
print("\n=== einsum reference ===")
a = np.random.randn(3)
b = np.random.randn(4)
M = np.random.randn(3, 4)
N = np.random.randn(4, 5)

operations = {
    "dot product":           ('i,i->',   a, a),
    "outer product":         ('i,j->ij', a, b),
    "matrix-vector":         ('ij,j->i', M, b),
    "matrix multiply":       ('ij,jk->ik', M, N),
    "transpose":             ('ij->ji', M, None),
    "trace":                 ('ii->',  np.random.randn(4,4), None),
    "element-wise":          ('ij,ij->ij', M, M.copy()),
    "sum all":               ('ij->',  M, None),
    "row sums":              ('ij->i', M, None),
    "col sums":              ('ij->j', M, None),
}

for name, (idx, x, y) in operations.items():
    if y is None:
        result = np.einsum(idx, x)
    else:
        result = np.einsum(idx, x, y)
    shape = result.shape if hasattr(result, 'shape') else ()
    print(f"  {name:<22}: einsum('{idx}') → {shape}")

Part 8 - PyTorch: Tensors, Gradients, and GPU Operations

# PyTorch tensor operations - GPU-accelerated, gradient-tracked

try:
    import torch
    import numpy as np

    # ── Creating tensors ──────────────────────────────────────────────────
    # From Python list
    t1 = torch.tensor([[1.0, 2.0], [3.0, 4.0]])
    print(f"dtype: {t1.dtype}")          # torch.float32

    # From NumPy (zero-copy when on CPU)
    np_arr = np.random.randn(3, 4)
    t_from_np = torch.from_numpy(np_arr)     # shares memory!
    np_from_t = t1.numpy()                   # shares memory! (CPU only)

    # Initialization patterns
    zeros = torch.zeros(3, 4)
    ones  = torch.ones(3, 4)
    rand  = torch.randn(3, 4)                # standard normal
    eye   = torch.eye(4)                     # identity matrix

    # ── Device handling ───────────────────────────────────────────────────
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    t_gpu = torch.randn(3, 4, device=device)
    print(f"Device: {t_gpu.device}")

    # ── Gradient tracking ─────────────────────────────────────────────────
    x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
    y = (x ** 2).sum()   # scalar output
    y.backward()          # compute gradients
    print(f"x.grad = {x.grad}")  # [2x₁, 2x₂, 2x₃] = [2.0, 4.0, 6.0]

    # Gradient does not flow through numpy() - stop gradient explicitly
    x_detached = x.detach().numpy()   # detach from computation graph first

    # ── view vs reshape ───────────────────────────────────────────────────
    T = torch.randn(4, 6)

    # view: zero-copy, requires contiguous memory
    T_view = T.view(8, 3)          # shares memory
    print(f"view:    {T_view.shape}")

    # reshape: tries view first, copies if needed
    T_reshaped = T.reshape(8, 3)   # may or may not copy

    # After transpose, view fails - use contiguous() first
    T_t = T.t()
    try:
        T_t_view = T_t.view(-1)    # RuntimeError: not contiguous
    except RuntimeError as e:
        print(f"Expected: {e}")
        T_t_view = T_t.contiguous().view(-1)  # fix: make contiguous first
        print(f"After contiguous().view: {T_t_view.shape}")

    # ── PyTorch einsum ────────────────────────────────────────────────────
    A = torch.randn(4, 8, 64, 32)   # (batch, heads, seq, d_head)
    B = torch.randn(4, 8, 64, 32)

    # Batch matrix multiply across heads: Q @ K.T
    scores = torch.einsum('bhid,bhjd->bhij', A, B)
    print(f"\nAttention scores: {scores.shape}")  # (4, 8, 64, 64)

    # torch.linalg - GPU-accelerated linear algebra
    M = torch.randn(10, 8)
    U, S, Vh = torch.linalg.svd(M, full_matrices=False)
    print(f"SVD: U={U.shape}, S={S.shape}, Vh={Vh.shape}")

    vals = torch.linalg.eigvalsh(M.T @ M)
    print(f"Eigenvalues of MᵀM: {vals.shape}")

    print("\nPyTorch import successful.")

except ImportError:
    print("PyTorch not installed. Install with: pip install torch")

Part 9 - ML: Scaled Dot-Product Attention with einsum

Scaled dot-product attention is the core operation in every transformer. It computes:

Attention(Q, K, V) = softmax(QKᵀ / sqrt(d_k)) V

Where:

Q ∈ ℝ^(batch × heads × seq_len × d_k) - queries
K ∈ ℝ^(batch × heads × seq_len × d_k) - keys
V ∈ ℝ^(batch × heads × seq_len × d_v) - values
Output ∈ ℝ^(batch × heads × seq_len × d_v)

import numpy as np

def scaled_dot_product_attention(
    Q: np.ndarray,
    K: np.ndarray,
    V: np.ndarray,
    mask: np.ndarray = None
) -> tuple:
    """
    Scaled dot-product attention from scratch using einsum.

    Q: (batch, heads, seq_len, d_k)
    K: (batch, heads, seq_len, d_k)
    V: (batch, heads, seq_len, d_v)

    Returns:
    - output: (batch, heads, seq_len, d_v) - attended values
    - weights: (batch, heads, seq_len, seq_len) - attention weights
    """
    d_k = Q.shape[-1]

    # Step 1: Compute raw attention scores
    # "for each batch b and head h, compute dot product of every query with every key"
    # Q[b,h,i,:] · K[b,h,j,:] = score of query i attending to key j
    scores = np.einsum('bhid,bhjd->bhij', Q, K)  # (batch, heads, seq, seq)

    # Step 2: Scale by sqrt(d_k)
    # Without scaling, dot products grow large in magnitude as d_k increases,
    # pushing softmax into saturation (near-zero gradients)
    scores = scores / np.sqrt(d_k)

    # Step 3: Optional causal mask (for autoregressive models)
    # Prevent position i from attending to positions j > i (future tokens)
    if mask is not None:
        scores = scores + mask   # mask contains 0 for allowed, -inf for forbidden

    # Step 4: Softmax over the key dimension (last axis)
    # Converts scores to probabilities (sum to 1 over all keys for each query)
    scores_max = scores.max(axis=-1, keepdims=True)  # for numerical stability
    exp_scores = np.exp(scores - scores_max)
    weights = exp_scores / exp_scores.sum(axis=-1, keepdims=True)  # (batch,h,sq,sq)

    # Step 5: Weighted sum of values
    # "for each query position i, compute weighted average of all value vectors"
    # weights[b,h,i,j] * V[b,h,j,d] summed over j
    output = np.einsum('bhij,bhjd->bhid', weights, V)  # (batch, heads, seq, d_v)

    return output, weights

# Multi-head attention: full implementation
def multi_head_attention(
    x: np.ndarray,
    W_Q: np.ndarray,
    W_K: np.ndarray,
    W_V: np.ndarray,
    W_O: np.ndarray,
    n_heads: int,
    causal: bool = False
) -> np.ndarray:
    """
    Full multi-head attention.

    x:   (batch, seq_len, d_model)
    W_Q: (d_model, d_model)
    W_K: (d_model, d_model)
    W_V: (d_model, d_model)
    W_O: (d_model, d_model)  - output projection
    """
    batch, seq_len, d_model = x.shape
    d_head = d_model // n_heads

    # Linear projections
    Q = x @ W_Q   # (batch, seq, d_model)
    K = x @ W_K
    V = x @ W_V

    # Reshape to (batch, n_heads, seq_len, d_head)
    def split_heads(T: np.ndarray) -> np.ndarray:
        # T: (batch, seq, d_model) → (batch, seq, n_heads, d_head)
        T = T.reshape(batch, seq_len, n_heads, d_head)
        # → (batch, n_heads, seq, d_head)
        return np.transpose(T, (0, 2, 1, 3))

    Q = split_heads(Q)
    K = split_heads(K)
    V = split_heads(V)

    # Causal mask: upper triangular, -inf above diagonal
    mask = None
    if causal:
        causal_mask = np.triu(np.full((seq_len, seq_len), -1e9), k=1)
        mask = causal_mask[np.newaxis, np.newaxis, :, :]  # broadcast over batch, heads

    # Scaled dot-product attention
    attn_output, attn_weights = scaled_dot_product_attention(Q, K, V, mask)

    # Merge heads: (batch, n_heads, seq, d_head) → (batch, seq, d_model)
    attn_output = np.transpose(attn_output, (0, 2, 1, 3))          # (batch, seq, h, d_head)
    attn_output = attn_output.reshape(batch, seq_len, d_model)      # (batch, seq, d_model)

    # Output projection
    output = attn_output @ W_O   # (batch, seq, d_model)

    return output, attn_weights

# Test the full implementation
np.random.seed(42)
batch_size = 4
seq_len    = 16
d_model    = 64
n_heads    = 4

x = np.random.randn(batch_size, seq_len, d_model)

# Initialize weights (in practice: Xavier / Kaiming initialization)
W_Q = np.random.randn(d_model, d_model) * 0.02
W_K = np.random.randn(d_model, d_model) * 0.02
W_V = np.random.randn(d_model, d_model) * 0.02
W_O = np.random.randn(d_model, d_model) * 0.02

output, weights = multi_head_attention(x, W_Q, W_K, W_V, W_O, n_heads, causal=True)

print(f"Input shape:           {x.shape}")           # (4, 16, 64)
print(f"Output shape:          {output.shape}")       # (4, 16, 64)
print(f"Attention weights:     {weights.shape}")      # (4, 4, 16, 16)
print(f"Weights sum to 1:      {np.allclose(weights.sum(axis=-1), 1.0)}")  # True

# Verify causal: no position attends to future positions
# weights[b, h, i, j] should be 0 for j > i
causal_check = weights[0, 0]  # first batch, first head
future_weight = causal_check[np.triu_indices(seq_len, k=1)]
print(f"Future attention weights (should be ~0): max = {future_weight.max():.2e}")

Interview Questions

Q1: What is Einstein summation notation and how does np.einsum work?

Einstein summation notation is a compact way to write tensor operations. The key convention: any index that appears twice in an expression (once in each operand) is summed over. Any index that appears only once is "free" - it appears in the output.

np.einsum('ij,jk->ik', A, B) means:

A has indices i, j
B has indices j, k
j appears in both inputs but not in output → sum over j
i and k appear in output → free indices

This is equivalent to matrix multiplication: Cᵢₖ = Σⱼ Aᵢⱼ Bⱼₖ

The power of einsum is that it handles any tensor contraction in a single expression without intermediate transposes or reshapes:

'i,i->' - dot product (contract all)
'i,j->ij' - outer product (no contraction)
'bhid,bhjd->bhij' - batch multi-head attention scores
'bsnd,nd->bsn' - batch matrix-vector product

Internally, np.einsum parses the index string, determines which indices are summed (repeated in inputs, absent in output) and which are free (in output), and generates the optimal sequence of BLAS calls to compute the result. With optimize=True, it uses dynamic programming to find the optimal contraction order for multi-tensor einsum chains.

Q2: What are the broadcasting rules in NumPy/PyTorch?

Broadcasting allows operations between arrays of different shapes by following these rules, applied to shapes aligned from the right:

If arrays have different numbers of dimensions, prepend 1s to the shape of the smaller array
Arrays of size 1 along a dimension are stretched to match the other array's size
If sizes differ and neither is 1, the operation raises an error

Example: (32, 512) + (512,) → prepend: (1, 512) → stretch: (32, 512) Example: (5, 1, 3) + (1, 4, 3) → stretch both → (5, 4, 3)

Key keepdims rule: When reducing an array (e.g., mean, sum), always use keepdims=True if you intend to broadcast the result back against the original. Without keepdims, the result loses the axis, making the broadcast less explicit and potentially wrong in higher-dimensional cases.

X = np.random.randn(32, 512)
mean = X.mean(axis=0, keepdims=True)  # (1, 512) - explicit
std  = X.std(axis=0, keepdims=True)   # (1, 512)
X_normalized = (X - mean) / (std + 1e-8)  # (32,512): unambiguous broadcast

Broadcasting never copies data - it uses stride tricks to make arrays appear larger without allocating memory.

Q3: How is a convolutional layer a tensor contraction?

A 2D convolution with:

Input: (batch N, in_channels C_in, height H, width W)
Kernel: (out_channels C_out, in_channels C_in, kernel_height KH, kernel_width KW)

Computes:

Output[n, c_out, h, w] = Σ_{c_in} Σ_{kh} Σ_{kw}
    Input[n, c_in, h + kh, w + kw] * Kernel[c_out, c_in, kh, kw]

This is a tensor contraction over three indices: c_in, kh, and kw. For each output spatial position (h, w), the convolution contracts over the entire local receptive field in the input.

The implementation strategy (im2col) converts this contraction into a single matrix multiplication:

Extract all local patches from the input → shape (N·H_out·W_out, C_in·KH·KW)
Reshape kernel → shape (C_in·KH·KW, C_out)
Matrix multiply → shape (N·H_out·W_out, C_out)
Reshape back to (N, C_out, H_out, W_out)

This is why convolutions and matrix multiplications use the same underlying BLAS/cuBLAS kernels - they are both tensor contractions.

Q4: Why is vectorization faster than Python loops?

Python loops have multiple sources of overhead compared to vectorized tensor operations:

Interpreter overhead: each iteration interprets bytecode, checks types, dispatches function calls
Object creation: each arithmetic operation creates a new Python object with reference counting
No SIMD: the processor's SIMD units (AVX-512: 16 float32s per instruction) are unused
No memory pipelining: cache prefetching and memory coalescing do not trigger for scalar access patterns
No parallelism: a single Python thread, with the GIL blocking true multi-threading

When you call A @ B in NumPy:

The call dispatches to BLAS (OpenBLAS / Intel MKL)
BLAS uses SIMD instructions: one instruction multiplies 16 float32s simultaneously
Computation is blocked for L1/L2/L3 cache efficiency
Multiple CPU threads execute in parallel (BLAS releases the GIL)
On GPU: thousands of CUDA cores execute in parallel via cuBLAS

The practical speedup is typically 100-1000x for medium-sized matrices. For a 1000×1000 matrix multiply:

Python triple loop: several minutes
NumPy BLAS: a few milliseconds
GPU cuBLAS: under 1 millisecond

The rule: eliminate all Python loops over tensor elements. Replace them with einsum, broadcasting, or vectorized array operations. Profile with %timeit when in doubt.

Practice Challenges

Level 1: Predict

Challenge: What is the output shape of each einsum operation?

A = np.random.randn(3, 4, 5)
B = np.random.randn(4, 6)
C = np.random.randn(5, 6)

r1 = np.einsum('ijk,jl->ikl', A, B)
r2 = np.einsum('ijk,kl->ijl', A, C)
r3 = np.einsum('ijk,jl,kl->il', A, B, C)

Answer

'ijk,jl->ikl': A has (i=3, j=4, k=5), B has (j=4, l=6). j appears in both inputs but not output → sum over j. Free indices: i, k, l. Output: (3, 5, 6)
'ijk,kl->ijl': A has (i=3, j=4, k=5), C has (k=5, l=6). k is contracted. Free: i, j, l. Output: (3, 4, 6)
'ijk,jl,kl->il': j and k both appear in multiple inputs but not output → both summed. Free: i, l. This computes: Σⱼ Σₖ A_{ijk} B_{jl} C_{kl}. Output: (3, 6)

This third operation is equivalent to a 3-way tensor contraction, which would require several intermediate arrays without einsum.

Level 2: Debug

Challenge: The following attention implementation gives incorrect results. Find the bug:

def attention(Q, K, V):
    d_k = Q.shape[-1]
    scores = Q @ K.T  # shape: (seq, seq)
    weights = np.exp(scores) / np.exp(scores).sum()
    return weights @ V

Answer

Three bugs:

Bug 1: Missing scaling. Scores should be divided by sqrt(d_k) before softmax:

scores = (Q @ K.T) / np.sqrt(d_k)

Without scaling, dot products grow large as d_k increases, pushing softmax toward one-hot distributions and causing vanishing gradients.

Bug 2: Softmax normalizes over wrong axis. sum() sums all elements of the 2D scores matrix. It should normalize over the key dimension (axis=-1) for each query independently:

exp_scores = np.exp(scores - scores.max(axis=-1, keepdims=True))  # stability
weights = exp_scores / exp_scores.sum(axis=-1, keepdims=True)

Bug 3: K.T does not work for batched or multi-head inputs. If Q, K have shape (batch, heads, seq, d_k), K.T transposes ALL dimensions, giving (d_k, seq, heads, batch). Use K.transpose(0,1,3,2) or einsum:

# For batched multi-head: use einsum
scores = np.einsum('bhid,bhjd->bhij', Q, K) / np.sqrt(d_k)

Full corrected version:

def attention_fixed(Q, K, V):
    d_k = Q.shape[-1]
    scores = (Q @ K.swapaxes(-1, -2)) / np.sqrt(d_k)
    scores_max = scores.max(axis=-1, keepdims=True)
    exp_scores = np.exp(scores - scores_max)
    weights = exp_scores / exp_scores.sum(axis=-1, keepdims=True)
    return weights @ V, weights

Level 3: Design

Challenge: Implement a convolutional layer using only np.einsum and np.lib.stride_tricks.as_strided (without scipy.signal.convolve2d or other helpers).

Answer

import numpy as np

def conv2d_einsum(
    input: np.ndarray,
    weight: np.ndarray,
    bias: np.ndarray = None,
    stride: int = 1,
    padding: int = 0
) -> np.ndarray:
    """
    2D convolution as a tensor contraction using einsum.

    input:  (N, C_in, H, W)
    weight: (C_out, C_in, KH, KW)
    Returns (N, C_out, H_out, W_out)
    """
    N, C_in, H, W = input.shape
    C_out, _, KH, KW = weight.shape

    if padding > 0:
        input = np.pad(input, ((0,0), (0,0), (padding,padding), (padding,padding)))
        H += 2 * padding
        W += 2 * padding

    H_out = (H - KH) // stride + 1
    W_out = (W - KW) // stride + 1

    # im2col: extract all local patches
    # Output shape of patches: (N, H_out, W_out, C_in, KH, KW)
    strides = input.strides   # (n_stride, c_stride, h_stride, w_stride)
    patches = np.lib.stride_tricks.as_strided(
        input,
        shape=(N, H_out, W_out, C_in, KH, KW),
        strides=(strides[0], strides[2] * stride, strides[3] * stride,
                 strides[1], strides[2], strides[3])
    )

    # Contraction: for each output position, contract over C_in, KH, KW
    # patches: (N, H_out, W_out, C_in, KH, KW)
    # weight:  (C_out, C_in, KH, KW)
    # output:  (N, H_out, W_out, C_out) → transpose to (N, C_out, H_out, W_out)
    output = np.einsum('nhwckl,ockl->nhwo', patches, weight)
    output = np.transpose(output, (0, 3, 1, 2))  # → (N, C_out, H_out, W_out)

    if bias is not None:
        output = output + bias[np.newaxis, :, np.newaxis, np.newaxis]

    return output

# Verify against scipy
from scipy.signal import correlate2d

N, C_in, H, W = 2, 3, 8, 8
C_out, KH, KW = 4, 3, 3
input_t = np.random.randn(N, C_in, H, W)
weight_t = np.random.randn(C_out, C_in, KH, KW)

output = conv2d_einsum(input_t, weight_t)
print(f"Output shape: {output.shape}")  # (2, 4, 6, 6)

# Manually verify one element
h, w, co = 0, 0, 0
manual = 0.0
for ci in range(C_in):
    for kh in range(KH):
        for kw in range(KW):
            manual += input_t[0, ci, h + kh, w + kw] * weight_t[co, ci, kh, kw]
print(f"output[0,0,0,0] = {output[0,0,0,0]:.6f}")
print(f"manual          = {manual:.6f}")
print(f"Match: {np.isclose(output[0,0,0,0], manual)}")  # True

Quick Reference Cheatsheet

Operation	Math	einsum string	Notes
Dot product	u·v	`'i,i->'`	Returns scalar
Outer product	uvᵀ	`'i,j->ij'`	No contraction
Matrix-vector	Av	`'ij,j->i'`	Contract j
Matrix multiply	AB	`'ij,jk->ik'`	Contract j
Batch matmul	A[b]B[b]	`'bij,bjk->bik'`	b is free
Transpose	Aᵀ	`'ij->ji'`	No contraction
Trace	tr(A)	`'ii->'`	Sum diagonal
Attention scores	QKᵀ	`'bhid,bhjd->bhij'`	Contract d_head
Attention output	softmax(s)V	`'bhij,bhjd->bhid'`	Contract seq

Key Takeaways

Tensors generalize scalars, vectors, and matrices to N dimensions - every deep learning computation is a tensor operation
Each tensor axis has semantic meaning (batch, channel, spatial, head, sequence) - know your axes before writing operations
Tensor contractions generalize matrix multiplication: sum over specified shared indices, keep free indices in output
Einstein summation (np.einsum) is the universal language for tensor contractions - learn it to write and read any DL operation compactly
Broadcasting rules: align from right, size-1 dimensions stretch - use keepdims=True when reducing to preserve axis semantics
Vectorization is 100-1000x faster than Python loops because BLAS and GPU kernels exploit SIMD parallelism and cache hierarchy

Next: Linear Algebra in NumPy - A Complete Engineering Reference →

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Tensor Viewer demo on the EngineersOfAI Playground - no code required.

:::

What You Will Learn​

Prerequisites​

Part 1 - Tensors as Generalization: 0D to N-D​

The hierarchy​

Axes and their semantic meaning​

Part 2 - Tensor Shapes and How to Read Them​

Part 3 - Tensor Contractions: Generalizing Matrix Multiplication​

From matrix multiply to general contraction​

Part 4 - Einstein Summation Notation​

The rule​

Performance: einsum is optimized​

Part 5 - Broadcasting: Extending Operations Across Dimensions​

The broadcasting rules​

keepdims: the common broadcasting bug​

Part 6 - Vectorization: Why Tensor Ops Are Fast​

The problem with Python loops​

Why tensor operations are fast​

Memory layout and cache performance​

Part 7 - NumPy: einsum, Broadcasting, reshape vs view​

reshape vs view: the critical difference​

Part 8 - PyTorch: Tensors, Gradients, and GPU Operations​

Part 9 - ML: Scaled Dot-Product Attention with einsum​

Interview Questions​

Practice Challenges​

Level 1: Predict​

Level 2: Debug​

Level 3: Design​

Quick Reference Cheatsheet​

Key Takeaways​

What You Will Learn

Prerequisites

Part 1 - Tensors as Generalization: 0D to N-D

The hierarchy

Axes and their semantic meaning

Part 2 - Tensor Shapes and How to Read Them

Part 3 - Tensor Contractions: Generalizing Matrix Multiplication

From matrix multiply to general contraction

Part 4 - Einstein Summation Notation

The rule

Performance: einsum is optimized

Part 5 - Broadcasting: Extending Operations Across Dimensions

The broadcasting rules

keepdims: the common broadcasting bug

Part 6 - Vectorization: Why Tensor Ops Are Fast

The problem with Python loops

Why tensor operations are fast

Memory layout and cache performance

Part 7 - NumPy: einsum, Broadcasting, reshape vs view

reshape vs view: the critical difference

Part 8 - PyTorch: Tensors, Gradients, and GPU Operations

Part 9 - ML: Scaled Dot-Product Attention with einsum

Interview Questions

Practice Challenges

Level 1: Predict

Level 2: Debug

Level 3: Design

Quick Reference Cheatsheet

Key Takeaways