Machine Learning

Covariate Shift

Input distribution changes between training and test time while the conditional label distribution stays the same.

Machine Learning

Distribution Shift

Training and deployment data come from different distributions, degrading model performance.

Covariate Shift

Machine Learning

Similarity Measures

How to quantify how alike two data points or distributions are? Use distance metrics, kernels, or divergences depending on the setting.

Jensen–Shannon Divergence

Machine Learning

Ensemble Methods

Single models have high variance and limited accuracy. Combine multiple models to reduce error through averaging or boosting.

Machine Learning

Capsule Networks (CapsNet)

CNNs lose spatial hierarchy and part-whole relationships through pooling. Use capsules with routing-by-agreement to preserve equivariance.

Gaussian Mixture Model

Loss Functions Machine Learning

Polyloss

Cross-entropy loss weighs all polynomial terms equally. Reweight the polynomial expansion of the loss for task-specific improvement.

Focal Loss Loss Functions Cross entropy

Machine Learning Classification Metrics and Evaluation

Class Imbalance

Training data has highly skewed class distribution, biasing the model toward majority classes.

Focal Loss

Machine Learning Uncertainty in Machine Learning

Learning to Defer

Model is uncertain or likely wrong on some inputs. Learn when to defer predictions to a human expert.

Uncertainty in Machine Learning

Machine Learning

Uncertainty in Machine Learning

Models produce point predictions without knowing how confident they are. Estimate prediction uncertainty for safer decision-making.

Calibration

Machine Learning Bayesian Model Selection with Model Evidence

Model Complexity and Occams Razor

Complex models overfit while simple models underfit. Use Bayesian model evidence to automatically trade off fit and complexity.

Bayesian Model Selection with Model Evidence

Machine Learning

Distant Supervision

Manual labeling is expensive. Use existing knowledge bases to automatically generate noisy training labels.

Machine Learning

Hopfield Networks

How to build an associative memory that stores and retrieves patterns? Use a recurrent network with symmetric weights that minimizes an energy function.

Transformers Attention

Machine Learning

Recurrent Neural Networks (RNN)

Feedforward networks can't handle variable-length sequences or temporal dependencies. Use recurrent connections to maintain hidden state over time.

Backpropagation Through Time (BPTT) LSTM GRU

Machine Learning

Meta Learning

Standard learning trains from scratch for each new task. Learn to learn so new tasks can be solved with very few examples.

MAML - Model-Agnostic Meta-Learning

Machine Learning Representation Learning

Disentangled Representations

Learned representations entangle multiple factors of variation. Separate independent generative factors into distinct latent dimensions.

Machine Learning Deep Learning

Loss Functions

How to quantify prediction error to guide optimization? Choose objective functions that align with the task and have good gradient properties.

Activation Functions Polyloss Cross entropy

Neural Networks Machine Learning

Activation Functions

Linear transformations alone can't learn nonlinear decision boundaries. Apply nonlinear functions element-wise to enable universal approximation.

ReLU

Machine Learning Computer Vision Deep Learning

Convolutional Neural Networks (CNN)

Fully connected networks ignore spatial structure and have too many parameters for images. Use local receptive fields with shared weights for spatial hierarchy.

Convolution Group Equivariant Convolutional Neural Networks

Machine Learning Convolutional Neural Networks (CNN)

Group Equivariant Convolutional Neural Networks

Standard CNNs are only translation equivariant. Generalize convolutions to be equivariant to rotations, reflections, and other symmetry groups.

Convolutional Neural Networks (CNN)

Machine Learning

Stochastic Gradients

Computing exact gradients over full datasets is too expensive. Use random mini-batch samples to get unbiased gradient estimates.

Policy Gradient Monte-Carlo Estimation Variational Inference REINFORCE - Score Function Estimator Control Variates Variational Autoencoders

Machine Learning Probability Theory

Variational Inference

Exact posterior inference is intractable for complex models. Approximate the posterior with a simpler distribution by minimizing KL divergence.

Jensen's Inequality Latenent Variable Models Importance Sampling

Machine Learning

Why implicit density models

Explicit density models are restricted by tractability requirements. Implicit models like GANs generate samples without computing likelihoods.

Generative Adversarial Networks Variational Inference Boltzmann Machines Contrastive Divergence Variational Autoencoders

Machine Learning

Generative Adversarial Networks

Generating realistic samples without tractable density estimation. Train a generator and discriminator adversarially to learn the data distribution.

Jensen–Shannon Divergence KL Divergence Maximum Likelihood Estimation

Machine Learning Deep Learning

Normalizing Flows

Most generative models can't compute exact likelihoods. Use invertible transformations to get both exact density evaluation and efficient sampling.

Variational Autoencoders Maximum Likelihood Estimation

Machine Learning

Maximum Likelihood Estimation

How to estimate model parameters from observed data? Find parameters that maximize the probability of the observations.

Machine Learning

Autoregressive Models

How to model complex joint distributions tractably? Factor into a product of conditionals and model each sequentially.

Machine Learning

Backpropagation Through Time (BPTT)

Standard backprop doesn't handle recurrent connections over time. Unroll the recurrence and backpropagate through the full sequence.

Backpropagation

Machine Learning

Boltzmann Machines

How to learn an undirected probabilistic generative model? Use stochastic binary units with symmetric connections to model the data distribution.

Hopfield Networks Contrastive Divergence

Machine Learning

Contrastive Divergence

MCMC-based learning in energy models requires expensive sampling to convergence. Approximate the gradient using only a few Gibbs sampling steps.

KL Divergence Maximum Likelihood Estimation

Machine Learning Unsupervised Learning

Latenent Variable Models

Observed data alone may not capture underlying structure. Introduce hidden variables to explain data through simpler latent factors.

Variational Autoencoders Gaussian Mixture Model Expectation Maximization Boltzmann Machines

Machine Learning Deep Learning

Variational Autoencoders

Autoencoders don't provide a proper generative model with meaningful latent space. Optimize a variational lower bound for principled generation.

Jensen's Inequality Latenent Variable Models KL Divergence Gaussian Distribution

Machine Learning

Gaussian Distribution

Many phenomena cluster around a mean with known spread. The Gaussian is the maximum entropy distribution for known mean and variance.

Generative Adversarial Networks Normalizing Flows Gaussian Mixture Model Stochastic Gradients Variational Autoencoders Pathwise Gradient Estimator

Machine Learning Latenent Variable Models

Autoencoders

How to learn compact representations without labels? Train a network to reconstruct its input through a bottleneck layer.

Latenent Variable Models

Machine Learning

Energy based models

How to model complex distributions without explicit normalization? Assign low energy to likely configurations and learn the energy function.

Hopfield Networks Boltzmann Machines

Machine Learning

Expectation Maximization

Exact computation of MLE is not possible

Gaussian Mixture Model K-Means

Machine Learning Recurrent Neural Networks (RNN)

GRU

LSTMs are effective but have many parameters. Simplify the gating mechanism while retaining the ability to capture long-range dependencies.

Recurrent Neural Networks (RNN)

Machine Learning

Equivalent Kernel

How to understand the implicit smoothing behavior of a regression model? Express predictions as a kernel-weighted sum of training targets.

Bayesian Linear Regression

Machine Learning

Bayesian Estimation

Point estimates don't capture parameter uncertainty. Use Bayes' theorem to maintain a full posterior distribution over parameters.

Uncertainty in Machine Learning Maximum A Posteriori (MAP) Maximum Likelihood Estimation

Machine Learning

Logistic Regression

How to model binary classification probabilities? Apply a sigmoid to a linear function and optimize with cross-entropy loss.

Stochastic Gradient Descent Least squares for classification Cross entropy Perceptron Probabilistic Generative Models

Machine Learning

Bayesian Linear Regression

Ordinary linear regression gives point estimates with no uncertainty. Place priors on weights to get a full posterior and predictive distribution.

Bayesian Estimation Maximum Likelihood Estimation

Machine Learning

Bias vs Variance in Machine Learning

Simple models underfit (high bias) and complex models overfit (high variance). Understanding this tradeoff guides model selection.

Bayesian Estimation Maximum Likelihood Estimation

Machine Learning

Basis Functions

Linear models can't capture nonlinear relationships. Transform inputs through nonlinear basis functions to enable nonlinear modeling.

Machine Learning

Cross Validation

Evaluating on training data overestimates performance. Hold out different data subsets in rotation to get reliable generalization estimates.

Machine Learning Maximum Likelihood Estimation

Maximum A Posteriori (MAP)

MLE overfits with limited data by ignoring prior knowledge. Incorporate a prior and find the mode of the posterior instead.

Maximum Likelihood Estimation

Machine Learning

Regularized Least Squares

Ordinary least squares overfits with limited data or many features. Add a penalty on weight magnitudes to constrain model complexity.

Maximum A Posteriori (MAP)

Machine Learning Optimization

Stochastic Gradient Descent

Full-batch gradient descent is too slow for large datasets. Update parameters using gradients from random mini-batches.

Machine Learning

Support Vector Machines (SVM)

Many decision boundaries separate the training data. Find the maximum-margin hyperplane for best generalization.

Gaussian Processes Kernel Methods Lagrange Multipliers

Machine Learning

Lagrange Multipliers

How to optimize a function subject to equality constraints? Introduce multipliers to convert into an unconstrained saddle-point problem.

Machine Learning

Mixture of Experts

A single model struggles with heterogeneous data. Route inputs to specialized expert sub-models via a learned gating function.

Expectation Maximization

Machine Learning

Principle Component Analysis (PCA)

High-dimensional data is hard to visualize and process. Project onto directions of maximum variance to reduce dimensionality.

Lagrange Multipliers

Machine Learning

Probabilistic Generative Models

Discriminative models can't model the data-generating process. Learn the joint distribution for generation, missing data handling, and outlier detection.

Gaussian Distribution

Machine Learning Gaussian Distribution

Gaussian Mixture Model

A single Gaussian can't model multi-modal data. Use a weighted mixture of Gaussians to represent complex distributions.

Gaussian Distribution Expectation Maximization Lagrange Multipliers K-Means

Machine Learning

Kernel Methods

Operating in high-dimensional feature spaces is computationally expensive. Use the kernel trick to compute inner products implicitly.

Equivalent Kernel Basis Functions

Machine Learning

Discriminant Functions

How to directly assign inputs to classes without modeling probabilities? Learn functions that map inputs to class-specific scores.

Machine Learning

Least squares for classification

How to apply regression to classification? Fit class labels as continuous targets, though it has known limitations for multi-class.

Discriminant Functions

Machine Learning

Perceptron

How to learn a linear decision boundary from data? Iteratively adjust weights on misclassified examples.

Least squares for classification Discriminant Functions Stochastic Gradient Descent

Machine Learning

Decision Theory

How to make optimal decisions under uncertainty? Combine probability estimates with loss functions to minimize expected risk.

Machine Learning

Topics

Notes

Linked