Collaborative filtering

Created March 29, 2022 · Updated July 20, 2025

Collaborative (Social) filtering
Leverage ratings of a user u + other users in the system
Key Idea: If two users u and v rated an item similarly, and user v has rated an item, user u's rating will be similar
Content of items no longer needed!
- Content may be a bad indicator depending on the domain/circumstances

Methodology

Neighbourhood-based
- Use stored ratings directly
- Nearest neighbours
  Model-based
Learn a predictive model, model user-item interactions (latent factors)
Predict new/incomplete ratings using trained model t

What is being recommended

User based
- Use other users to infer ratings of an item for a user
- 'Neighbours' - users who have similar ratings
Item-based
- Use ratings of similar items (that user has rated) to predict the rating for a given item

Type of ratings:

Explicit Ratings
- Like/Dislike; ImDB ratings
- Might not be available
Implicit
- Time spent on web-page; Clicks
- More available, but intention is unclear (suffers from biases and ambiguity)

Neighbourhood-based recommenders

Explicit ratings, not sparse

\begin{array}{c||c|c|c|c|c} \hline & \begin{array}{c} \text { The } \\ \text { Matrix } \end{array} & \text { Titanic } & \begin{array}{c} \text { Die } \\ \text { Hard } \end{array} & \begin{array}{c} \text { Forrest } \\ \text { Gump } \end{array} & \text { Wall-E } \\ \hline \hline \text { John } & 5 & 1 & & 2 & 2 \\ \text { Lucy } & 1 & 5 & 2 & 5 & 5 \\ \text { Eric } & 2 & ? & 3 & 5 & 4 \\ \text { Diane } & 4 & 3 & 5 & 3 & \\ \hline \end{array}

User representation - Rating vector (dimension = item)
Consider k-nearest neighbours of user $$u$$ who rated item $i: \mathcal{N}_{i}(u)$

r_{u i}=\frac{1}{\left|\mathcal{N}_{i}(u)\right|} \sum_{v \in \mathcal{N}_{i}(u)} r_{v i}

Consider the similarity

r_{u i}=\frac{\sum_{v \in \mathcal{N}_{i}(u)} w_{u v} r_{v i}}{\sum_{v \in \mathcal{N}_{i}(u)}\left|w_{u v}\right|}

Given similarity of < Eric, Lucy >=0.75 and < Eric, Diane >=0.15

r=\frac{0.75 \times 5+0.15 \times 3}{0.75+0.15} \approx 4.67

Users may use different rating values -5 ' for John (who always rates $$1 / 2 / 5$$ ) might not be equal to ' 5 ' from Diane (who rates $$3 / 4 / 5)$$
Solution: Normalise the ratings per user eg. Mean entering or Z-score normalization

Attempt to solve sparsity/coverage problems by projecting user/item vectors to a dense latent space by
- Decompose rating matrix
- Decompose similarity matrix

Given $$R,$$ a $|\mathscr{U}| \times|\mathcal{F}|$ matrix of rank $$n$$

Approximate using Singular Value Decomposition (SVD) by $\hat{R}=P Q^{T}$ of rank $$k<n$$
$\mathrm{P}$ is $\mathrm{a}|\mathscr{U}| \times k$ matrix of users
$\mathrm{Q}$ is a $|\mathscr{F}| \times k$ matrix of items
$\operatorname{err}(P, Q)=|| R-P Q^{T}||_{F}^{2}=\sum_{u, i}\left(r_{u i}-p_{u} q_{i}^{T}\right)^{2}$

Pros:

No feature engineering required—works directly with user-item interactions
Content-independent—doesn't need item descriptions or metadata
Discovers unexpected connections—can recommend items that seem unrelated but appeal to similar users
Addresses filter bubbles by introducing diversity beyond content similarity

Cons:

Sparsity problem—struggles when user-item interaction data is sparse
Cold start problem—cannot recommend to new users or new items without interaction history
Popularity bias—tends to favor popular items over niche ones
Limited help for users with unique preferences—performs poorly for outlier taste profiles