64 docs tagged with "foundational-cs"

ARM vs x86 for AI Workloads

Comprehensive comparison of ARM and x86 architectures for ML workloads - ISA design, power efficiency, Apple Silicon unified memory, AWS Graviton3 inference, and performance-per-watt analysis for production AI systems.

Build Systems and CI/CD for ML

How build systems and CI/CD pipelines keep ML projects reproducible, tested, and safely deployable - covering Make, Bazel, DVC, MLflow, GitHub Actions, and canary deployments.

C and C++ for ML Systems

Learn why C and C++ form the foundation of every major ML framework, and how to read, write, and debug C++ code as an ML systems engineer.

Complexity Analysis for ML Engineers

Learn how Big-O notation, time and space complexity, and amortized analysis apply directly to ML systems - from understanding why O(n^2) attention broke transformers to profiling GPU kernels.

Concurrency Primitives

Master mutexes, condition variables, atomics, lock-free programming, and thread pools - the concurrency building blocks behind every high-throughput ML data pipeline and inference server.

Containers and Namespaces

How Linux namespaces, cgroups, and overlay filesystems power container isolation for multi-tenant ML serving, GPU workloads, and reproducible training environments.

CPU Pipeline and Instruction Execution

Learn how modern CPUs execute billions of instructions per second through pipelining, out-of-order execution, branch prediction, and superscalar design - and why these details matter for every ML engineer.

Cython and C Extensions

Learn how Cython bridges Python and C to deliver C-level performance in Python projects, covering type declarations, typed memoryviews, OpenMP parallelism, and raw C extension modules.

Data Structures for ML Systems

Data structures for ML infrastructure - trie for tokenizers, HNSW for vector search, inverted index for retrieval, LSM trees for feature stores, and product quantization for memory-efficient vector storage.

Dependency Management and Packaging

Master Python packaging from pyproject.toml and uv to Docker layer caching, private registries, and the CUDA version compatibility matrix that determines whether your ML environment actually works.

DNS, Service Discovery, and Consul

Master DNS and service discovery for distributed ML systems - DNS resolution chains, Kubernetes CoreDNS, Consul service mesh, etcd coordination, and how ML serving clusters register and find model endpoints dynamically.

Dynamic Programming for ML

Dynamic programming patterns in ML - edit distance for NLP evaluation, Viterbi decoding for sequence labeling, CTC for speech recognition, dynamic time warping, beam search, Bellman equations in reinforcement learning, and DP in autoregressive generation.

File Systems and IO Patterns

Master Linux file systems for ML workloads - VFS, ext4/XFS, page cache, direct I/O, mmap, io_uring, and how to tune I/O for maximum training data throughput and checkpoint speed.

Foundational CS for ML Engineers

The computer science foundations that make ML engineers dangerous - CPU and GPU architecture, operating systems, compilers, memory management, networking, algorithms, and systems programming.

Garbage Collection Algorithms

How Python's reference counting and generational garbage collector work, why GC pauses hurt ML serving latency, and how to tune or disable GC for performance-critical workloads.

Graph Algorithms and GNNs

Master graph representations, classical graph algorithms, and graph neural networks - from BFS/DFS and PageRank to GCN, GraphSAGE, and GAT with PyTorch Geometric.

gRPC and Protocol Buffers

Learn gRPC and Protocol Buffers for high-performance ML inference APIs - from protobuf wire format to bidirectional streaming, interceptors, health checks, and production deployment patterns.

Hardware Acceleration Beyond GPU

FPGA, ASIC, TPU systolic arrays, neuromorphic chips, photonic computing, and processing-in-memory for ML - when to use each, economic analysis, and the emerging hardware landscape beyond NVIDIA GPUs.

Hardware Performance Counters

Master hardware performance counters, the PMU, and Linux perf to diagnose CPU bottlenecks, optimize cache behavior, and profile ML workloads with surgical precision.

Hash Tables and Bloom Filters

Deep dive into hash table internals, consistent hashing for distributed ML, Bloom filters for training data deduplication, MinHash LSH for near-duplicate detection, and fingerprinting for dataset versioning.

Heap and Stack Memory

Learn how stack frames, heap allocation, and Python's memory model work under the hood - from C struct padding to pymalloc arenas, with production debugging techniques.

How Python Works Internally

A deep dive into CPython's architecture - from source code to bytecode execution, the GIL, memory management, and the Python object model that every serious Python engineer should understand.

HTTP/3 and QUIC

Understand HTTP/3 and QUIC - how QUIC solves TCP head-of-line blocking with UDP-based multiplexing, 0-RTT connection establishment, TLS 1.3 integration, and what it means for ML inference serving latency.

Infrastructure as Code for ML

IaC for ML infrastructure - Terraform GPU clusters on AWS/GCP/Azure, Helm charts for model serving, Pulumi Python IaC, Ansible for GPU node setup, GitOps with ArgoCD, spot instance handling, and infrastructure cost optimization.

JIT Compilation and numba

Just-in-time compilation principles from first principles, numba's LLVM backend and type inference system, GPU kernels with numba CUDA, and when JIT compilation delivers real performance gains.

Kernel Bypass and DPDK

Kernel bypass networking for ML clusters - DPDK architecture, RDMA and InfiniBand for GPU-to-GPU communication, NCCL's bypass path, io_uring, eBPF, and when these techniques matter for AllReduce latency.

Large-Scale Memory Optimization

Master the memory math behind training and serving large language models - from mixed precision and gradient checkpointing to ZeRO optimizer stages, KV cache management, and PagedAttention.

Linux Performance Tuning

Systematic Linux performance tuning for ML workloads - sysctl parameters, CPU governors, NUMA balancing, transparent huge pages, IRQ affinity, NIC tuning, and grub options that matter for training throughput and inference latency.

Linux Process Scheduling

Understand Linux CFS scheduler, nice values, CPU affinity, real-time scheduling, cgroups, NUMA, and how Kubernetes CPU throttling destroys ML training throughput - with concrete fixes.

LLVM and MLIR

LLVM compiler infrastructure and MLIR multi-level IR for ML - how they power PyTorch, JAX, TensorFlow, Triton, and IREE, with SSA form, optimization passes, dialect design, and practical code generation for ML workloads.

Memory Allocators for ML

How glibc malloc, jemalloc, tcmalloc, and PyTorch's CUDA caching allocator work - with production techniques for eliminating memory fragmentation in ML training and serving.

Memory Hierarchy and Cache Design

Learn how CPU cache hierarchy works - L1/L2/L3 structure, associativity, eviction policies, MESI coherence, NUMA topology, and how to write cache-friendly code that runs 10x to 100x faster for ML workloads.

Memory Models and Concurrency

Hardware memory models, memory barriers, atomic operations, lock-free data structures, and how memory ordering affects concurrent ML data pipelines and distributed training implementations.

Memory Profiling and Debugging

A systematic toolkit for finding and fixing memory leaks in Python ML systems - from tracemalloc snapshots to GPU memory debugging, DataLoader leaks, and long-running service monitoring.

Memory Safety and Rust

Understand memory safety bugs in C/C++, how Rust's ownership model eliminates them at compile time, and why Rust is becoming the language of choice for high-performance ML infrastructure components.

Message Queues and Kafka

Master Apache Kafka for ML data pipelines - topics, partitions, consumer groups, exactly-once semantics, real-time feature computation, prediction logging, and production patterns for ML platforms.

Module 1: Computer Architecture for ML Engineers

CPU architecture, memory hierarchy, SIMD vectorization, NUMA, and hardware performance analysis - understanding the machine your ML code runs on.

Module 2: Operating Systems for ML

Virtual memory, process scheduling, huge pages, memory-mapped files, and OS-level tuning - the operating system layer that determines whether your ML workload runs fast or fights the kernel.

Module 3: Compilers and Runtimes for ML

How compilers work, JIT compilation, MLIR, XLA, torch.compile, and TensorRT - understanding the compilation stack that turns your Python model into fast machine code.

Module 4: Memory Management for ML

Stack and heap allocation, Python memory model, GPU memory patterns, memory profiling, and zero-copy data transfer - debugging OOM errors and building memory-efficient pipelines.

Module 5: Networking for Distributed AI

TCP/IP fundamentals, RDMA, AllReduce algorithms, gRPC for model serving, and network bottlenecks in distributed training - the networking layer that determines whether your training job scales.

Module 6: Algorithms for ML Engineers

Algorithmic complexity in the context of ML - hash maps for embeddings, approximate nearest neighbor data structures, sampling at scale, and the algorithmic foundations of attention.

Module 7: Systems Programming for ML Engineers

C++ basics for ML engineers, Python C extensions, Cython, Pybind11, and writing custom PyTorch operators - bridging the gap between Python ML code and high-performance native implementations.

Multicore and NUMA Architecture

Learn how multicore CPUs and NUMA topology affect ML workload performance - cache coherence overhead, CPU affinity, NUMA-aware memory allocation, hyperthreading, and configuring PyTorch DataLoader for optimal hardware utilization.

Network Debugging for Distributed Training

Master distributed training network debugging - NCCL error diagnosis, AllReduce communication patterns, bandwidth testing with iperf3 and nccl-tests, RDMA diagnostics, and profiler-based timeline analysis for PyTorch DDP.

Network Security for ML Platforms

Comprehensive network security for ML infrastructure - mTLS service authentication, Kubernetes network policies, eBPF with Cilium, secrets management with Vault, zero-trust networking, and ML-specific threats including model theft and prompt injection.