Deep Learning

Multi-Network Training with Moving Average Target

How do you train interdependent neural networks without them destabilizing each other?

Deep Deterministic Policy Gradient (DDPG) JEPA Self-supervised Learning Reinforcement Learning

Deep Learning

Dropout

Neural networks overfit by co-adapting neurons. Randomly drop units during training to regularize and approximate ensemble averaging.

Deep Learning

Positional Encoding

Self-attention is permutation invariant and has no notion of token order. Inject position information to preserve sequence structure.

Transformers

Deep Learning

Transformers

RNNs process sequences serially and struggle with long-range dependencies. Use self-attention to process all positions in parallel.

Layer Normalization Attention Normalization Backpropagation Through Time (BPTT) Positional Encoding LSTM

Deep Learning

Attention

Fixed-size representations bottleneck sequence-to-sequence models. Dynamically attend to relevant parts of the input at each decoding step.

Transformers Autoregressive Generation and KV Caching in Transformers High-Dimensional Dot Product Normalization

Transformers Deep Learning

Scaling Attention

Self-attention costs O(n²) in sequence length. Use sparse or linear approximations to handle longer sequences.

Kernel Methods Autoregressive Generation and KV Caching in Transformers Attention Convolution Grouped Query Attention (GQA) BERT Transformers Recurrent Neural Networks (RNN) Multi-Head Latent Attention (MLA) Mixture of Experts

Deep Learning Computer Vision

Convolution

Fully connected layers ignore spatial structure and have too many parameters for grid data. Use local weight-sharing filters that exploit translation invariance.

Natural Language Processing Deep Learning Language Models (Classical)

BERT

Language models only use left context, missing bidirectional understanding. Mask random tokens and train to predict them using full context.

Transformers

Deep Learning Graphs

Graph Convolutional Networks (GCN)

Standard neural networks can't operate on graph-structured data. Generalize convolutions to graphs by aggregating neighbor features.

Machine Learning Deep Learning

Loss Functions

How to quantify prediction error to guide optimization? Choose objective functions that align with the task and have good gradient properties.

Activation Functions Polyloss Cross entropy

Deep Learning Uncertainty in Machine Learning

Calibration

Neural networks output confident but unreliable probabilities. Adjust predicted probabilities to match true outcome frequencies.

Depth and Trainability Normalization Uncertainty in Machine Learning

Machine Learning Computer Vision Deep Learning

Convolutional Neural Networks (CNN)

Fully connected networks ignore spatial structure and have too many parameters for images. Use local receptive fields with shared weights for spatial hierarchy.

Convolution Group Equivariant Convolutional Neural Networks

Machine Learning Deep Learning

Normalizing Flows

Most generative models can't compute exact likelihoods. Use invertible transformations to get both exact density evaluation and efficient sampling.

Variational Autoencoders Maximum Likelihood Estimation

Deep Learning

Depth and Trainability

Deeper networks are more expressive but harder to train due to vanishing/exploding gradients and optimization challenges.

Weight Initialization in Deep Neural Networks Adaptive Learning Rate Optimizers

Deep Learning

Adaptive Learning Rate Optimizers

A single global learning rate is suboptimal for all parameters. Adapt learning rates per-parameter based on gradient history.

Deep Learning

Challenges of optimizing deep models

Deep networks face vanishing/exploding gradients, saddle points, and ill-conditioned loss landscapes.

Deep Learning

Normalization

Internal covariate shift slows convergence and requires careful tuning. Normalize activations to stabilize and accelerate training.

Transformers Layer Normalization RMSNorm

Deep Learning Stochastic Gradients

Pathwise Gradient Estimator

The forward pass involves sampling from distributions whose parameters you're optimizing and loss function is differentiable

Variational Autoencoders Policy Gradient Normalizing Flows REINFORCE - Score Function Estimator Stochastic Gradients

Deep Learning

REINFORCE - Score Function Estimator

Need gradients through non-differentiable stochastic operations. Use the log-derivative trick to estimate gradients from samples.

Control Variates

Machine Learning Deep Learning

Variational Autoencoders

Autoencoders don't provide a proper generative model with meaningful latent space. Optimize a variational lower bound for principled generation.

Jensen's Inequality Latenent Variable Models KL Divergence Gaussian Distribution

Deep Learning

Weight Initialization in Deep Neural Networks

Poor initialization causes exploding or vanishing activations. Initialize weights to preserve signal variance across layers.

Deep Learning

Why Generative Models

Discriminative models can't generate new data or capture the full data distribution. Generative models enable sampling, density estimation, and unsupervised learning.

Model Based Reinforcement Learning Probabilistic Generative Models Discriminant Functions

Normalization Deep Learning

Layer Normalization

Batch normalization depends on batch statistics and fails with small batches or recurrent nets. Normalize across features within each example instead.

Normalization

Deep Learning

Topics

Notes

Linked