Deep Learning

Topics

Notes

Linked

Multi-Network Training with Moving Average Target

How do you train interdependent neural networks without them destabilizing each other?

Dropout

Neural networks overfit by co-adapting neurons. Randomly drop units during training to regularize and approximate ensemble averaging.

Positional Encoding

Self-attention is permutation invariant and has no notion of token order. Inject position information to preserve sequence structure.

Transformers

RNNs process sequences serially and struggle with long-range dependencies. Use self-attention to process all positions in parallel.

Attention

Fixed-size representations bottleneck sequence-to-sequence models. Dynamically attend to relevant parts of the input at each decoding step.

Convolution

Fully connected layers ignore spatial structure and have too many parameters for grid data. Use local weight-sharing filters that exploit translation invariance.

BERT

Language models only use left context, missing bidirectional understanding. Mask random tokens and train to predict them using full context.

Graph Convolutional Networks (GCN)

Standard neural networks can't operate on graph-structured data. Generalize convolutions to graphs by aggregating neighbor features.

Loss Functions

How to quantify prediction error to guide optimization? Choose objective functions that align with the task and have good gradient properties.

Calibration

Neural networks output confident but unreliable probabilities. Adjust predicted probabilities to match true outcome frequencies.

Convolutional Neural Networks (CNN)

Fully connected networks ignore spatial structure and have too many parameters for images. Use local receptive fields with shared weights for spatial hierarchy.

Normalizing Flows

Most generative models can't compute exact likelihoods. Use invertible transformations to get both exact density evaluation and efficient sampling.

Depth and Trainability

Deeper networks are more expressive but harder to train due to vanishing/exploding gradients and optimization challenges.

Adaptive Learning Rate Optimizers

A single global learning rate is suboptimal for all parameters. Adapt learning rates per-parameter based on gradient history.

Challenges of optimizing deep models

Deep networks face vanishing/exploding gradients, saddle points, and ill-conditioned loss landscapes.

Normalization

Internal covariate shift slows convergence and requires careful tuning. Normalize activations to stabilize and accelerate training.

Pathwise Gradient Estimator

The forward pass involves sampling from distributions whose parameters you're optimizing and loss function is differentiable

REINFORCE - Score Function Estimator

Need gradients through non-differentiable stochastic operations. Use the log-derivative trick to estimate gradients from samples.

Variational Autoencoders

Autoencoders don't provide a proper generative model with meaningful latent space. Optimize a variational lower bound for principled generation.

Weight Initialization in Deep Neural Networks

Poor initialization causes exploding or vanishing activations. Initialize weights to preserve signal variance across layers.

Why Generative Models

Discriminative models can't generate new data or capture the full data distribution. Generative models enable sampling, density estimation, and unsupervised learning.

Layer Normalization

Batch normalization depends on batch statistics and fails with small batches or recurrent nets. Normalize across features within each example instead.