Skip to main content

64 docs tagged with "foundational-cs"

View all tags

ARM vs x86 for AI Workloads

Comprehensive comparison of ARM and x86 architectures for ML workloads - ISA design, power efficiency, Apple Silicon unified memory, AWS Graviton3 inference, and performance-per-watt analysis for production AI systems.

Build Systems and CI/CD for ML

How build systems and CI/CD pipelines keep ML projects reproducible, tested, and safely deployable - covering Make, Bazel, DVC, MLflow, GitHub Actions, and canary deployments.

C and C++ for ML Systems

Learn why C and C++ form the foundation of every major ML framework, and how to read, write, and debug C++ code as an ML systems engineer.

Complexity Analysis for ML Engineers

Learn how Big-O notation, time and space complexity, and amortized analysis apply directly to ML systems - from understanding why O(n^2) attention broke transformers to profiling GPU kernels.

Concurrency Primitives

Master mutexes, condition variables, atomics, lock-free programming, and thread pools - the concurrency building blocks behind every high-throughput ML data pipeline and inference server.

Containers and Namespaces

How Linux namespaces, cgroups, and overlay filesystems power container isolation for multi-tenant ML serving, GPU workloads, and reproducible training environments.

CPU Pipeline and Instruction Execution

Learn how modern CPUs execute billions of instructions per second through pipelining, out-of-order execution, branch prediction, and superscalar design - and why these details matter for every ML engineer.

Cython and C Extensions

Learn how Cython bridges Python and C to deliver C-level performance in Python projects, covering type declarations, typed memoryviews, OpenMP parallelism, and raw C extension modules.

Data Structures for ML Systems

Data structures for ML infrastructure - trie for tokenizers, HNSW for vector search, inverted index for retrieval, LSM trees for feature stores, and product quantization for memory-efficient vector storage.

Dependency Management and Packaging

Master Python packaging from pyproject.toml and uv to Docker layer caching, private registries, and the CUDA version compatibility matrix that determines whether your ML environment actually works.

DNS, Service Discovery, and Consul

Master DNS and service discovery for distributed ML systems - DNS resolution chains, Kubernetes CoreDNS, Consul service mesh, etcd coordination, and how ML serving clusters register and find model endpoints dynamically.

Dynamic Programming for ML

Dynamic programming patterns in ML - edit distance for NLP evaluation, Viterbi decoding for sequence labeling, CTC for speech recognition, dynamic time warping, beam search, Bellman equations in reinforcement learning, and DP in autoregressive generation.

File Systems and IO Patterns

Master Linux file systems for ML workloads - VFS, ext4/XFS, page cache, direct I/O, mmap, io_uring, and how to tune I/O for maximum training data throughput and checkpoint speed.

Foundational CS for ML Engineers

The computer science foundations that make ML engineers dangerous - CPU and GPU architecture, operating systems, compilers, memory management, networking, algorithms, and systems programming.

Garbage Collection Algorithms

How Python's reference counting and generational garbage collector work, why GC pauses hurt ML serving latency, and how to tune or disable GC for performance-critical workloads.

Graph Algorithms and GNNs

Master graph representations, classical graph algorithms, and graph neural networks - from BFS/DFS and PageRank to GCN, GraphSAGE, and GAT with PyTorch Geometric.

gRPC and Protocol Buffers

Learn gRPC and Protocol Buffers for high-performance ML inference APIs - from protobuf wire format to bidirectional streaming, interceptors, health checks, and production deployment patterns.

Hardware Acceleration Beyond GPU

FPGA, ASIC, TPU systolic arrays, neuromorphic chips, photonic computing, and processing-in-memory for ML - when to use each, economic analysis, and the emerging hardware landscape beyond NVIDIA GPUs.

Hardware Performance Counters

Master hardware performance counters, the PMU, and Linux perf to diagnose CPU bottlenecks, optimize cache behavior, and profile ML workloads with surgical precision.

Hash Tables and Bloom Filters

Deep dive into hash table internals, consistent hashing for distributed ML, Bloom filters for training data deduplication, MinHash LSH for near-duplicate detection, and fingerprinting for dataset versioning.

Heap and Stack Memory

Learn how stack frames, heap allocation, and Python's memory model work under the hood - from C struct padding to pymalloc arenas, with production debugging techniques.

How Python Works Internally

A deep dive into CPython's architecture - from source code to bytecode execution, the GIL, memory management, and the Python object model that every serious Python engineer should understand.

HTTP/3 and QUIC

Understand HTTP/3 and QUIC - how QUIC solves TCP head-of-line blocking with UDP-based multiplexing, 0-RTT connection establishment, TLS 1.3 integration, and what it means for ML inference serving latency.

Infrastructure as Code for ML

IaC for ML infrastructure - Terraform GPU clusters on AWS/GCP/Azure, Helm charts for model serving, Pulumi Python IaC, Ansible for GPU node setup, GitOps with ArgoCD, spot instance handling, and infrastructure cost optimization.

JIT Compilation and numba

Just-in-time compilation principles from first principles, numba's LLVM backend and type inference system, GPU kernels with numba CUDA, and when JIT compilation delivers real performance gains.

Kernel Bypass and DPDK

Kernel bypass networking for ML clusters - DPDK architecture, RDMA and InfiniBand for GPU-to-GPU communication, NCCL's bypass path, io_uring, eBPF, and when these techniques matter for AllReduce latency.

Large-Scale Memory Optimization

Master the memory math behind training and serving large language models - from mixed precision and gradient checkpointing to ZeRO optimizer stages, KV cache management, and PagedAttention.

Linux Performance Tuning

Systematic Linux performance tuning for ML workloads - sysctl parameters, CPU governors, NUMA balancing, transparent huge pages, IRQ affinity, NIC tuning, and grub options that matter for training throughput and inference latency.

Linux Process Scheduling

Understand Linux CFS scheduler, nice values, CPU affinity, real-time scheduling, cgroups, NUMA, and how Kubernetes CPU throttling destroys ML training throughput - with concrete fixes.

LLVM and MLIR

LLVM compiler infrastructure and MLIR multi-level IR for ML - how they power PyTorch, JAX, TensorFlow, Triton, and IREE, with SSA form, optimization passes, dialect design, and practical code generation for ML workloads.

Memory Allocators for ML

How glibc malloc, jemalloc, tcmalloc, and PyTorch's CUDA caching allocator work - with production techniques for eliminating memory fragmentation in ML training and serving.

Memory Hierarchy and Cache Design

Learn how CPU cache hierarchy works - L1/L2/L3 structure, associativity, eviction policies, MESI coherence, NUMA topology, and how to write cache-friendly code that runs 10x to 100x faster for ML workloads.

Memory Models and Concurrency

Hardware memory models, memory barriers, atomic operations, lock-free data structures, and how memory ordering affects concurrent ML data pipelines and distributed training implementations.

Memory Profiling and Debugging

A systematic toolkit for finding and fixing memory leaks in Python ML systems - from tracemalloc snapshots to GPU memory debugging, DataLoader leaks, and long-running service monitoring.

Memory Safety and Rust

Understand memory safety bugs in C/C++, how Rust's ownership model eliminates them at compile time, and why Rust is becoming the language of choice for high-performance ML infrastructure components.

Message Queues and Kafka

Master Apache Kafka for ML data pipelines - topics, partitions, consumer groups, exactly-once semantics, real-time feature computation, prediction logging, and production patterns for ML platforms.

Module 2: Operating Systems for ML

Virtual memory, process scheduling, huge pages, memory-mapped files, and OS-level tuning - the operating system layer that determines whether your ML workload runs fast or fights the kernel.

Module 3: Compilers and Runtimes for ML

How compilers work, JIT compilation, MLIR, XLA, torch.compile, and TensorRT - understanding the compilation stack that turns your Python model into fast machine code.

Module 4: Memory Management for ML

Stack and heap allocation, Python memory model, GPU memory patterns, memory profiling, and zero-copy data transfer - debugging OOM errors and building memory-efficient pipelines.

Module 5: Networking for Distributed AI

TCP/IP fundamentals, RDMA, AllReduce algorithms, gRPC for model serving, and network bottlenecks in distributed training - the networking layer that determines whether your training job scales.

Module 6: Algorithms for ML Engineers

Algorithmic complexity in the context of ML - hash maps for embeddings, approximate nearest neighbor data structures, sampling at scale, and the algorithmic foundations of attention.

Module 7: Systems Programming for ML Engineers

C++ basics for ML engineers, Python C extensions, Cython, Pybind11, and writing custom PyTorch operators - bridging the gap between Python ML code and high-performance native implementations.

Multicore and NUMA Architecture

Learn how multicore CPUs and NUMA topology affect ML workload performance - cache coherence overhead, CPU affinity, NUMA-aware memory allocation, hyperthreading, and configuring PyTorch DataLoader for optimal hardware utilization.

Network Debugging for Distributed Training

Master distributed training network debugging - NCCL error diagnosis, AllReduce communication patterns, bandwidth testing with iperf3 and nccl-tests, RDMA diagnostics, and profiler-based timeline analysis for PyTorch DDP.

Network Security for ML Platforms

Comprehensive network security for ML infrastructure - mTLS service authentication, Kubernetes network policies, eBPF with Cilium, secrets management with Vault, zero-trust networking, and ML-specific threats including model theft and prompt injection.

Observability and Logging

Observability for ML systems - structured logging with structlog, distributed tracing with OpenTelemetry, Prometheus metrics for inference servers, Grafana dashboards, ML-specific alerting, and production profiling.

Optimization Algorithms Deep Dive

Optimization algorithms in depth - SGD, momentum, Nesterov, AdaGrad, RMSProp, Adam derivation, AdamW, learning rate schedules, second-order methods, convergence theory, and why Adam beats SGD for transformers.

Processes, Threads, and Coroutines

Learn how processes, threads, and coroutines work at the OS level, and how to choose the right concurrency model for ML workloads - data loading, inference, and async API calls.

Profiling Python and C Code

Master the complete profiling toolkit - cProfile, line_profiler, py-spy, Scalene, Valgrind, and PyTorch Profiler - to find and eliminate bottlenecks in Python and ML training code.

Randomized Algorithms and Sketching

Randomized algorithms in ML - reservoir sampling for streaming data, Johnson-Lindenstrauss projections, Count-Min Sketch, HyperLogLog, randomized SVD, and locality-sensitive hashing for approximate nearest neighbor search.

Serialization and Data Formats

Master serialization formats for ML systems - Protocol Buffers, Apache Arrow, safetensors, Parquet, HDF5, MessagePack, and pickle - with performance benchmarks, security considerations, and schema evolution strategies.

Service Mesh and Load Balancing

Master service mesh architecture and load balancing for ML serving - Istio, Envoy, traffic management, mTLS, canary deployments, circuit breaking, and Kubernetes networking for production AI systems.

Shell Scripting for ML Workflows

Bash scripting for ML engineers - automating training launches, multi-node coordination, GPU monitoring, checkpoint management, parallel data downloads, and writing robust production-grade shell scripts.

Signals and IPC for ML

Unix signals, graceful shutdown patterns, shared memory, pipes, Unix domain sockets, and ZeroMQ for building reliable multi-process ML training and serving systems.

SIMD and Vectorization

Learn how SIMD instruction sets (SSE, AVX2, AVX-512) enable CPUs to process 8 to 16 floating-point operations per cycle, why NumPy and PyTorch use them by default, and how to write code that compilers can auto-vectorize.

Sorting and Search for ML

Sorting algorithms and search techniques for ML engineers - from timsort internals and top-k selection to binary search for hyperparameter tuning, FAISS IVF indexes, and beam search with priority queues.

Static Analysis and Type Systems

Build type-safe ML codebases using Python type hints, mypy strict mode, pydantic v2 validation, Protocol types, jaxtyping tensor shape annotations, and ruff for fast linting.

Storage Hierarchy: SSD and NVMe

Deep dive into SSD and NVMe storage architecture for ML workloads - NAND flash physics, NVMe protocol, io_uring async I/O, memory-mapped datasets, and designing storage systems for large-scale training.

System Calls and Linux API

Learn how Linux system calls underpin every ML workload - from dataset loading with mmap to epoll-based inference servers, seccomp sandboxing, and io_uring async I/O.

TCP/IP Fundamentals for ML

Master the networking layer that underpins every distributed training run and ML serving system - from TCP handshakes to jumbo frames and congestion control algorithms used in modern GPU clusters.

torch.compile and XLA

Deep dive into PyTorch's torch.compile architecture - TorchDynamo graph capture, AOTAutograd, TorchInductor code generation, XLA for TPU/GPU, and when compiler-based optimization delivers real ML performance gains.

Virtual Memory and Page Faults

Understand virtual memory layout, page tables, TLB, huge pages, and page faults - and how these OS mechanisms directly affect PyTorch training, large model loading, and ML dataset memory mapping.

Zero-Copy and Data Transfer

How to eliminate unnecessary memory copies in ML data pipelines - from sendfile() and mmap() to NumPy views, PyTorch pinned memory, and Apache Arrow Flight for zero-copy data serving.