Learning to Rank
"... the task to automatically construct a ranking model using training data, such that the model can sort new objects according to their degrees of relevance, preference, or importance." - Liu [2009]
Representation:
Represent the document and query in a format that a ML model can use:
a numerical vector $\vec{x} \in \mathbb{R}^{n}$
Prediction:
- Then a ranking model $f: \vec{x} \rightarrow \mathbb{R}$ is optimized to score each document-query
combination so that relevant documents are scored higher. - In mathematical terms: $f$ maps vector from a real-valued scores.
Models can be trained on different data:
Offline or Supervised LTR: learn from annotated data.
- Expensive and time-consuming.
- Provides ground-truth.
Online/Counterfactual LTR: learn from user interactions.
- Virtually free and easy to obtain.
- Hard to interpret.
Thus we have:
- Feature representation of document-query pairs: $\vec{x}_{q, d} \in \mathbb{R}$.
- Labels indicating the relevance of document-query pairs: $y_{q, d} \in[0,4]$.
And we want:
-
A function $f: \vec{x} \rightarrow \mathbb{R}$ that scores documents.
-
To get the best ranking by sorting according to $f(\vec{x})$.
How do we find $f$? -
Pointwise Approach
- Predict the relevance per item, simple but very naive.
- Ignores that ordering of items is what matters.
-
Pairwise Approach
- Loss based on document pairs, minimize the number of incorrect inversions.
- Ignores that not all document pairs have the same impact.
- Often used in the industry.
-
Listwise Approach
- Tries to optimize for IR metrics, but they are not differentiable.
- Approximations by heuristics, bounding or probabilistic approaches to ranking.
- Best approach out of the three.
The Pointwise Approach
Regression loss
Given $\langle q, d\rangle$ predict the value of $y_{q, d}$
e.g., square loss for binary or categorical labels,
where, $y_{q, d}$ is the one-hot representation [Fuhr, 1989] or the actual value [Cossock and Zhang, 2006] of the label.
Classification loss
Given $\langle q, d\rangle$ predict the class $y_{q, d}$
E.g., Cross-Entropy with Softmax over categorical labels $Y$ [Li et al., 2008],
where, $s_{y_{q, d}}$ is the model's score for label $y_{q, d}$.
Issues with Pointwise LTR
- Class Imbalance:
- many irrelevant documents and very few relevant documents.
- Query level feature normalization is needed:
- the distribution of features differs greatly per query.
These can be overcome. But what is fundamentally wrong with these methods?
Ranking is not a regression or classification problem.
- A document-level loss does not work for ranking problems because document scores should not be considered independently.
- In other words, pointwise methods do not directly optimize ranking quality. A lower loss does not mean better ranking!
The Pairwise Approach
Instead of looking at document-level, consider pairs of documents.
Naive Pairwise Model
This method would be quadratic in complexity: $O\left(N^{2}\right),$ during inference.
Pair-preferences have to be aggregated somehow. This can lead to paradoxical situations:
Pairwise Loss Functions
The scoring model remains unchanged:
But the loss function is based on document pairs:
Thus we still score documents and then order according to scores.
Pairwise loss minimizes the average number of inversions in ranking:
- $d_{i} \succ_{q} d_{j}$ but $d_{j}$ is ranked higher than $d_{i}$
Pairwise loss generally has the following form [Chen et al., 2009],
$$ \mathcal{L}_{\text {pairwise }}=\phi\left(s_{i}-s_{j}\right) $$
where, $\phi$ can be, - Hinge function [Herbrich et al., 2000]: $\quad \phi(z)=\max (0,1-z)$
- Exponential function [Freund et al., 2003]: $\quad \phi(z)=e^{-z}$
- Logistic function [Burges et al., 2005]: $\phi(z)=\log \left(1+e^{-z}\right)$
- etc.
State of the art Pairwise LTR model - RankNet
Issues with Pairwise LTR
What is the fundamental problem with Pairwise LTR?
-
The scoring model scores documents independently: $f\left(\vec{x}_{d_{i}}\right)=s_{i} .$ The loss is based on document pairs, and minimizes the number of incorrect inversions:
$$ \mathcal{L}_{\text {pairwise }}=\sum_{d_{i} \succ d_{j}} \phi\left(s_{i}-s_{j}\right) $$ -
However, not every document pair is equally important.
-
It is much more important to get the correct ordering of top documents than of the bottom documents.
-
For instance, the order of the top-5 is much more important than the order of
documents after position 10 .
The Listwise Approach
The fundamental problem with the approaches so far is that they did not optimize ranking quality directly.
A LTR method should directly optimize the ranking metric we care about.
Ranking metrics can range from simple:
to much more complex, e.g.Discounted Cumulative Gain:
How do we optimize for these metrics?
- These are non-continuous and non-differentiable. Cannot take gradient.
State of the art listwise LTR models:
- LambdaRank
- Still used as state of the art ranking method.
- ListNet and ListMLE
- Create a probabilistic model for ranking, which is differentiable.
Some remarks - distinction between pairwise and listwise is not helpful:
- The derivatives of listwise losses always reduce to weighted pairwise derivatives (LambdaRank is explicit, ListNet does this implicitly).
- The LambdaRank with the Average Relevant Rank metric is equivalent to the pairwise RankNet loss.
- In the field there are often categorization mistakes, e.g. calling LambdaRank a pairwise method.