Machine Learning
Topics
Notes
Linked
Input distribution changes between training and test time while the conditional label distribution stays the same.
Training and deployment data come from different distributions, degrading model performance.
How to quantify how alike two data points or distributions are? Use distance metrics, kernels, or divergences depending on the setting.
Single models have high variance and limited accuracy. Combine multiple models to reduce error through averaging or boosting.
CNNs lose spatial hierarchy and part-whole relationships through pooling. Use capsules with routing-by-agreement to preserve equivariance.
Cross-entropy loss weighs all polynomial terms equally. Reweight the polynomial expansion of the loss for task-specific improvement.
Training data has highly skewed class distribution, biasing the model toward majority classes.
Model is uncertain or likely wrong on some inputs. Learn when to defer predictions to a human expert.
Models produce point predictions without knowing how confident they are. Estimate prediction uncertainty for safer decision-making.
Complex models overfit while simple models underfit. Use Bayesian model evidence to automatically trade off fit and complexity.
Manual labeling is expensive. Use existing knowledge bases to automatically generate noisy training labels.
How to build an associative memory that stores and retrieves patterns? Use a recurrent network with symmetric weights that minimizes an energy function.
Feedforward networks can't handle variable-length sequences or temporal dependencies. Use recurrent connections to maintain hidden state over time.
Standard learning trains from scratch for each new task. Learn to learn so new tasks can be solved with very few examples.
Learned representations entangle multiple factors of variation. Separate independent generative factors into distinct latent dimensions.
How to quantify prediction error to guide optimization? Choose objective functions that align with the task and have good gradient properties.
Linear transformations alone can't learn nonlinear decision boundaries. Apply nonlinear functions element-wise to enable universal approximation.
Fully connected networks ignore spatial structure and have too many parameters for images. Use local receptive fields with shared weights for spatial hierarchy.
Standard CNNs are only translation equivariant. Generalize convolutions to be equivariant to rotations, reflections, and other symmetry groups.
Computing exact gradients over full datasets is too expensive. Use random mini-batch samples to get unbiased gradient estimates.
Exact posterior inference is intractable for complex models. Approximate the posterior with a simpler distribution by minimizing KL divergence.
Explicit density models are restricted by tractability requirements. Implicit models like GANs generate samples without computing likelihoods.
Generating realistic samples without tractable density estimation. Train a generator and discriminator adversarially to learn the data distribution.
Most generative models can't compute exact likelihoods. Use invertible transformations to get both exact density evaluation and efficient sampling.
How to estimate model parameters from observed data? Find parameters that maximize the probability of the observations.
How to model complex joint distributions tractably? Factor into a product of conditionals and model each sequentially.
Standard backprop doesn't handle recurrent connections over time. Unroll the recurrence and backpropagate through the full sequence.
How to learn an undirected probabilistic generative model? Use stochastic binary units with symmetric connections to model the data distribution.
MCMC-based learning in energy models requires expensive sampling to convergence. Approximate the gradient using only a few Gibbs sampling steps.
Observed data alone may not capture underlying structure. Introduce hidden variables to explain data through simpler latent factors.
Autoencoders don't provide a proper generative model with meaningful latent space. Optimize a variational lower bound for principled generation.
Many phenomena cluster around a mean with known spread. The Gaussian is the maximum entropy distribution for known mean and variance.
How to learn compact representations without labels? Train a network to reconstruct its input through a bottleneck layer.
How to model complex distributions without explicit normalization? Assign low energy to likely configurations and learn the energy function.
Exact computation of MLE is not possible
LSTMs are effective but have many parameters. Simplify the gating mechanism while retaining the ability to capture long-range dependencies.
How to understand the implicit smoothing behavior of a regression model? Express predictions as a kernel-weighted sum of training targets.
Point estimates don't capture parameter uncertainty. Use Bayes' theorem to maintain a full posterior distribution over parameters.
How to model binary classification probabilities? Apply a sigmoid to a linear function and optimize with cross-entropy loss.
Ordinary linear regression gives point estimates with no uncertainty. Place priors on weights to get a full posterior and predictive distribution.
Simple models underfit (high bias) and complex models overfit (high variance). Understanding this tradeoff guides model selection.
Linear models can't capture nonlinear relationships. Transform inputs through nonlinear basis functions to enable nonlinear modeling.
Evaluating on training data overestimates performance. Hold out different data subsets in rotation to get reliable generalization estimates.
MLE overfits with limited data by ignoring prior knowledge. Incorporate a prior and find the mode of the posterior instead.
Ordinary least squares overfits with limited data or many features. Add a penalty on weight magnitudes to constrain model complexity.
Full-batch gradient descent is too slow for large datasets. Update parameters using gradients from random mini-batches.
Many decision boundaries separate the training data. Find the maximum-margin hyperplane for best generalization.
How to optimize a function subject to equality constraints? Introduce multipliers to convert into an unconstrained saddle-point problem.
A single model struggles with heterogeneous data. Route inputs to specialized expert sub-models via a learned gating function.
High-dimensional data is hard to visualize and process. Project onto directions of maximum variance to reduce dimensionality.
Discriminative models can't model the data-generating process. Learn the joint distribution for generation, missing data handling, and outlier detection.
A single Gaussian can't model multi-modal data. Use a weighted mixture of Gaussians to represent complex distributions.
Operating in high-dimensional feature spaces is computationally expensive. Use the kernel trick to compute inner products implicitly.
How to directly assign inputs to classes without modeling probabilities? Learn functions that map inputs to class-specific scores.
How to apply regression to classification? Fit class labels as continuous targets, though it has known limitations for multi-class.
How to learn a linear decision boundary from data? Iteratively adjust weights on misclassified examples.
How to make optimal decisions under uncertainty? Combine probability estimates with loss functions to minimize expected risk.