Deep Learning
Topics
Notes
Linked
How do you train interdependent neural networks without them destabilizing each other?
Neural networks overfit by co-adapting neurons. Randomly drop units during training to regularize and approximate ensemble averaging.
Self-attention is permutation invariant and has no notion of token order. Inject position information to preserve sequence structure.
RNNs process sequences serially and struggle with long-range dependencies. Use self-attention to process all positions in parallel.
Fixed-size representations bottleneck sequence-to-sequence models. Dynamically attend to relevant parts of the input at each decoding step.
Self-attention costs O(n²) in sequence length. Use sparse or linear approximations to handle longer sequences.
Fully connected layers ignore spatial structure and have too many parameters for grid data. Use local weight-sharing filters that exploit translation invariance.
Language models only use left context, missing bidirectional understanding. Mask random tokens and train to predict them using full context.
Standard neural networks can't operate on graph-structured data. Generalize convolutions to graphs by aggregating neighbor features.
How to quantify prediction error to guide optimization? Choose objective functions that align with the task and have good gradient properties.
Neural networks output confident but unreliable probabilities. Adjust predicted probabilities to match true outcome frequencies.
Fully connected networks ignore spatial structure and have too many parameters for images. Use local receptive fields with shared weights for spatial hierarchy.
Most generative models can't compute exact likelihoods. Use invertible transformations to get both exact density evaluation and efficient sampling.
Deeper networks are more expressive but harder to train due to vanishing/exploding gradients and optimization challenges.
A single global learning rate is suboptimal for all parameters. Adapt learning rates per-parameter based on gradient history.
Deep networks face vanishing/exploding gradients, saddle points, and ill-conditioned loss landscapes.
Internal covariate shift slows convergence and requires careful tuning. Normalize activations to stabilize and accelerate training.
The forward pass involves sampling from distributions whose parameters you're optimizing and loss function is differentiable
Need gradients through non-differentiable stochastic operations. Use the log-derivative trick to estimate gradients from samples.
Autoencoders don't provide a proper generative model with meaningful latent space. Optimize a variational lower bound for principled generation.
Poor initialization causes exploding or vanishing activations. Initialize weights to preserve signal variance across layers.
Discriminative models can't generate new data or capture the full data distribution. Generative models enable sampling, density estimation, and unsupervised learning.
Batch normalization depends on batch statistics and fails with small batches or recurrent nets. Normalize across features within each example instead.