01Module 05 - Information TheoryHow Shannon's information theory underpins every loss function, compression algorithm, and generative model in modern ML engineering.02Entropy and InformationShannon entropy, self-information, binary entropy, differential entropy, and why uncertainty quantification drives decision trees, perplexity, and Bayesian ML.03KL DivergenceKullback-Leibler divergence - asymmetry, forward vs reverse KL, Jensen-Shannon divergence, and applications in VAEs and PPO reinforcement learning.04Cross-Entropy and Loss FunctionsCross-entropy loss derived from KL divergence and maximum likelihood estimation - binary cross-entropy, categorical cross-entropy, focal loss, and label smoothing.05Mutual InformationMutual information, feature selection, pointwise mutual information in word2vec, and the information bottleneck principle in deep learning.06Data Compression FundamentalsShannon's source coding theorem, Huffman coding, arithmetic coding, lossless vs lossy compression, and why language model perplexity is a compression measure.07Information GeometryStatistical manifolds, Fisher information matrix, natural gradient descent, and why second-order optimization methods like K-FAC and Shampoo are geometrically principled.08Minimum Description LengthMDL principle, Kolmogorov complexity, regularization as compression, and information-theoretic model selection - Occam's razor formalized.