Machine Learning

Topics

Notes

Linked

Covariate Shift

Input distribution changes between training and test time while the conditional label distribution stays the same.

Distribution Shift

Training and deployment data come from different distributions, degrading model performance.

Similarity Measures

How to quantify how alike two data points or distributions are? Use distance metrics, kernels, or divergences depending on the setting.

Ensemble Methods

Single models have high variance and limited accuracy. Combine multiple models to reduce error through averaging or boosting.

Capsule Networks (CapsNet)

CNNs lose spatial hierarchy and part-whole relationships through pooling. Use capsules with routing-by-agreement to preserve equivariance.

Polyloss

Cross-entropy loss weighs all polynomial terms equally. Reweight the polynomial expansion of the loss for task-specific improvement.

Class Imbalance

Training data has highly skewed class distribution, biasing the model toward majority classes.

Learning to Defer

Model is uncertain or likely wrong on some inputs. Learn when to defer predictions to a human expert.

Uncertainty in Machine Learning

Models produce point predictions without knowing how confident they are. Estimate prediction uncertainty for safer decision-making.

Model Complexity and Occams Razor

Complex models overfit while simple models underfit. Use Bayesian model evidence to automatically trade off fit and complexity.

Distant Supervision

Manual labeling is expensive. Use existing knowledge bases to automatically generate noisy training labels.

Hopfield Networks

How to build an associative memory that stores and retrieves patterns? Use a recurrent network with symmetric weights that minimizes an energy function.

Recurrent Neural Networks (RNN)

Feedforward networks can't handle variable-length sequences or temporal dependencies. Use recurrent connections to maintain hidden state over time.

Meta Learning

Standard learning trains from scratch for each new task. Learn to learn so new tasks can be solved with very few examples.

Disentangled Representations

Learned representations entangle multiple factors of variation. Separate independent generative factors into distinct latent dimensions.

Loss Functions

How to quantify prediction error to guide optimization? Choose objective functions that align with the task and have good gradient properties.

Activation Functions

Linear transformations alone can't learn nonlinear decision boundaries. Apply nonlinear functions element-wise to enable universal approximation.

Convolutional Neural Networks (CNN)

Fully connected networks ignore spatial structure and have too many parameters for images. Use local receptive fields with shared weights for spatial hierarchy.

Group Equivariant Convolutional Neural Networks

Standard CNNs are only translation equivariant. Generalize convolutions to be equivariant to rotations, reflections, and other symmetry groups.

Stochastic Gradients

Computing exact gradients over full datasets is too expensive. Use random mini-batch samples to get unbiased gradient estimates.

Variational Inference

Exact posterior inference is intractable for complex models. Approximate the posterior with a simpler distribution by minimizing KL divergence.

Why implicit density models

Explicit density models are restricted by tractability requirements. Implicit models like GANs generate samples without computing likelihoods.

Generative Adversarial Networks

Generating realistic samples without tractable density estimation. Train a generator and discriminator adversarially to learn the data distribution.

Normalizing Flows

Most generative models can't compute exact likelihoods. Use invertible transformations to get both exact density evaluation and efficient sampling.

Maximum Likelihood Estimation

How to estimate model parameters from observed data? Find parameters that maximize the probability of the observations.

Autoregressive Models

How to model complex joint distributions tractably? Factor into a product of conditionals and model each sequentially.

Backpropagation Through Time (BPTT)

Standard backprop doesn't handle recurrent connections over time. Unroll the recurrence and backpropagate through the full sequence.

Boltzmann Machines

How to learn an undirected probabilistic generative model? Use stochastic binary units with symmetric connections to model the data distribution.

Contrastive Divergence

MCMC-based learning in energy models requires expensive sampling to convergence. Approximate the gradient using only a few Gibbs sampling steps.

Latenent Variable Models

Observed data alone may not capture underlying structure. Introduce hidden variables to explain data through simpler latent factors.

Variational Autoencoders

Autoencoders don't provide a proper generative model with meaningful latent space. Optimize a variational lower bound for principled generation.

Gaussian Distribution

Many phenomena cluster around a mean with known spread. The Gaussian is the maximum entropy distribution for known mean and variance.

Autoencoders

How to learn compact representations without labels? Train a network to reconstruct its input through a bottleneck layer.

Energy based models

How to model complex distributions without explicit normalization? Assign low energy to likely configurations and learn the energy function.

Expectation Maximization

Exact computation of MLE is not possible

GRU

LSTMs are effective but have many parameters. Simplify the gating mechanism while retaining the ability to capture long-range dependencies.

Equivalent Kernel

How to understand the implicit smoothing behavior of a regression model? Express predictions as a kernel-weighted sum of training targets.

Bayesian Estimation

Point estimates don't capture parameter uncertainty. Use Bayes' theorem to maintain a full posterior distribution over parameters.

Logistic Regression

How to model binary classification probabilities? Apply a sigmoid to a linear function and optimize with cross-entropy loss.

Bayesian Linear Regression

Ordinary linear regression gives point estimates with no uncertainty. Place priors on weights to get a full posterior and predictive distribution.

Bias vs Variance in Machine Learning

Simple models underfit (high bias) and complex models overfit (high variance). Understanding this tradeoff guides model selection.

Basis Functions

Linear models can't capture nonlinear relationships. Transform inputs through nonlinear basis functions to enable nonlinear modeling.

Cross Validation

Evaluating on training data overestimates performance. Hold out different data subsets in rotation to get reliable generalization estimates.

Maximum A Posteriori (MAP)

MLE overfits with limited data by ignoring prior knowledge. Incorporate a prior and find the mode of the posterior instead.

Regularized Least Squares

Ordinary least squares overfits with limited data or many features. Add a penalty on weight magnitudes to constrain model complexity.

Stochastic Gradient Descent

Full-batch gradient descent is too slow for large datasets. Update parameters using gradients from random mini-batches.

Support Vector Machines (SVM)

Many decision boundaries separate the training data. Find the maximum-margin hyperplane for best generalization.

Lagrange Multipliers

How to optimize a function subject to equality constraints? Introduce multipliers to convert into an unconstrained saddle-point problem.

Mixture of Experts

A single model struggles with heterogeneous data. Route inputs to specialized expert sub-models via a learned gating function.

Principle Component Analysis (PCA)

High-dimensional data is hard to visualize and process. Project onto directions of maximum variance to reduce dimensionality.

Probabilistic Generative Models

Discriminative models can't model the data-generating process. Learn the joint distribution for generation, missing data handling, and outlier detection.

Gaussian Mixture Model

A single Gaussian can't model multi-modal data. Use a weighted mixture of Gaussians to represent complex distributions.

Kernel Methods

Operating in high-dimensional feature spaces is computationally expensive. Use the kernel trick to compute inner products implicitly.

Discriminant Functions

How to directly assign inputs to classes without modeling probabilities? Learn functions that map inputs to class-specific scores.

Least squares for classification

How to apply regression to classification? Fit class labels as continuous targets, though it has known limitations for multi-class.

Perceptron

How to learn a linear decision boundary from data? Iteratively adjust weights on misclassified examples.

Decision Theory

How to make optimal decisions under uncertainty? Combine probability estimates with loss functions to minimize expected risk.