Learning to Rank

"... the task to automatically construct a ranking model using training data, such that the model can sort new objects according to their degrees of relevance, preference, or importance." - Liu [2009]

Representation:
Represent the document and query in a format that a ML model can use:
a numerical vector $\vec{x} \in \mathbb{R}^{n}$

Prediction:

  • Then a ranking model $f: \vec{x} \rightarrow \mathbb{R}$ is optimized to score each document-query
    combination so that relevant documents are scored higher.
  • In mathematical terms: $f$ maps vector from a real-valued scores.

Models can be trained on different data:

Offline or Supervised LTR: learn from annotated data.

  • Expensive and time-consuming.
  • Provides ground-truth.

Online/Counterfactual LTR: learn from user interactions.

  • Virtually free and easy to obtain.
  • Hard to interpret.

Thus we have:

  • Feature representation of document-query pairs: $\vec{x}_{q, d} \in \mathbb{R}$.
  • Labels indicating the relevance of document-query pairs: $y_{q, d} \in[0,4]$.

And we want:

  • A function $f: \vec{x} \rightarrow \mathbb{R}$ that scores documents.

  • To get the best ranking by sorting according to $f(\vec{x})$.
    How do we find $f$?

  • Pointwise Approach

    • Predict the relevance per item, simple but very naive.
    • Ignores that ordering of items is what matters.
  • Pairwise Approach

    • Loss based on document pairs, minimize the number of incorrect inversions.
    • Ignores that not all document pairs have the same impact.
    • Often used in the industry.
  • Listwise Approach

    • Tries to optimize for IR metrics, but they are not differentiable.
    • Approximations by heuristics, bounding or probabilistic approaches to ranking.
    • Best approach out of the three.

The Pointwise Approach

Regression loss

Given $\langle q, d\rangle$ predict the value of $y_{q, d}$
e.g., square loss for binary or categorical labels,

$$ \mathcal{L}_{\text {Squared }}\left(q, d, y_{q, d}\right)=\left\|y_{q, d}-f\left(\vec{x}_{q, d}\right)\right\|^{2} $$

where, $y_{q, d}$ is the one-hot representation [Fuhr, 1989] or the actual value [Cossock and Zhang, 2006] of the label.

Classification loss

Given $\langle q, d\rangle$ predict the class $y_{q, d}$
E.g., Cross-Entropy with Softmax over categorical labels $Y$ [Li et al., 2008],

$$ \mathcal{L}_{\mathrm{CE}}\left(q, d, y_{q, d}\right)=-\log \left(p\left(y_{q, d} \mid q, d\right)\right)=-\log \left(\frac{e^{\gamma \cdot s_{y} q, d}}{\sum_{y \in Y} e^{\gamma \cdot s_{y}}}\right) $$

where, $s_{y_{q, d}}$ is the model's score for label $y_{q, d}$.

Issues with Pointwise LTR

  • Class Imbalance:
    • many irrelevant documents and very few relevant documents.
  • Query level feature normalization is needed:
    • the distribution of features differs greatly per query.

These can be overcome. But what is fundamentally wrong with these methods?

Ranking is not a regression or classification problem.

  • A document-level loss does not work for ranking problems because document scores should not be considered independently.
  • In other words, pointwise methods do not directly optimize ranking quality. A lower loss does not mean better ranking!

The Pairwise Approach

Instead of looking at document-level, consider pairs of documents.

$$ y_{q, d_{i}}>y_{q, d_{j}} \rightarrow d_{i} \succ d_{j} $$

Naive Pairwise Model

$$ P\left(d_{i} \succ d_{j}\right)=f\left(\vec{x}_{i}, \vec{x}_{j}\right) $$

This method would be quadratic in complexity: $O\left(N^{2}\right),$ during inference.
Pair-preferences have to be aggregated somehow. This can lead to paradoxical situations:

$$ \begin{array}{l} d_{1} \succ d_{2} \\ d_{2} \succ d_{3} \\ d_{3} \succ d_{1} \end{array} $$

Pairwise Loss Functions

The scoring model remains unchanged:

$$ f\left(\vec{x}_{i}\right)=s_{i} $$

But the loss function is based on document pairs:

$$ \mathcal{L}_{\text {pairwise }}=\sum_{d_{i} \succ d_{j}} \phi\left(s_{i}-s_{j}\right) $$

Thus we still score documents and then order according to scores.

Pairwise loss minimizes the average number of inversions in ranking:

  • $d_{i} \succ_{q} d_{j}$ but $d_{j}$ is ranked higher than $d_{i}$
    Pairwise loss generally has the following form [Chen et al., 2009],
    $$ \mathcal{L}_{\text {pairwise }}=\phi\left(s_{i}-s_{j}\right) $$

    where, $\phi$ can be,
  • Hinge function [Herbrich et al., 2000]: $\quad \phi(z)=\max (0,1-z)$
  • Exponential function [Freund et al., 2003]: $\quad \phi(z)=e^{-z}$
  • Logistic function [Burges et al., 2005]: $\phi(z)=\log \left(1+e^{-z}\right)$
  • etc.

State of the art Pairwise LTR model - RankNet

Issues with Pairwise LTR

What is the fundamental problem with Pairwise LTR?

PariwiseProblem
  • The scoring model scores documents independently: $f\left(\vec{x}_{d_{i}}\right)=s_{i} .$ The loss is based on document pairs, and minimizes the number of incorrect inversions:

    $$ \mathcal{L}_{\text {pairwise }}=\sum_{d_{i} \succ d_{j}} \phi\left(s_{i}-s_{j}\right) $$
  • However, not every document pair is equally important.

  • It is much more important to get the correct ordering of top documents than of the bottom documents.

  • For instance, the order of the top-5 is much more important than the order of
    documents after position 10 .

The Listwise Approach

The fundamental problem with the approaches so far is that they did not optimize ranking quality directly.

A LTR method should directly optimize the ranking metric we care about.

Ranking metrics can range from simple:

$$ \operatorname{precision}(R)=\frac{1}{|R|} \sum_{R_{i}} \operatorname{relevance}\left(R_{i}\right) $$

to much more complex, e.g.Discounted Cumulative Gain:

$$ D C G(R)=\sum_{R_{i}} \frac{2^{\text {relevance }\left(R_{i}\right)-1}}{\log (i+1)} $$

How do we optimize for these metrics?

  • These are non-continuous and non-differentiable. Cannot take gradient.

State of the art listwise LTR models:

  • LambdaRank
    • Still used as state of the art ranking method.
  • ListNet and ListMLE
    • Create a probabilistic model for ranking, which is differentiable.

Some remarks - distinction between pairwise and listwise is not helpful:

  • The derivatives of listwise losses always reduce to weighted pairwise derivatives (LambdaRank is explicit, ListNet does this implicitly).
  • The LambdaRank with the Average Relevant Rank metric is equivalent to the pairwise RankNet loss.
  • In the field there are often categorization mistakes, e.g. calling LambdaRank a pairwise method.