Similarity Measures

Cosine similarity

Cosine similarity is the dot product of two vectors, normalized by the product of the length of the vectors. It calculates the angle between two vectors and is therefore length-independent. Its values ranges from 0 to 1.

$$ \cos (\theta)=\frac{\sum v 1_{k} * v 2_{k}}{\sqrt{\sum v 1_{k}^{2}} * \sqrt{\sum v 2_{k}^{2}}} $$

Jaccard Similarity

Jaccard similarity measures how similar two sets are, defined as the ratio of their intersection to their union:

$$ J(A, B)=\frac{|A \cap B|}{|A \cup B|} $$

It ranges from 0 (no overlap) to 1 (identical sets). It's commonly used to compare document similarity, recommendation systems, and clustering applications where you need to measure overlap between categorical or binary features.

KL Divergence Distance

Optimal Transport Based

  • Wasserstein Distance