Learning to Rank

Created March 21, 2022 · Updated March 4, 2026

"... the task to automatically construct a ranking model using training data, such that the model can sort new objects according to their degrees of relevance, preference, or importance." - Liu [2009]

Representation:
Represent the document and query in a format that a ML model can use:
a numerical vector $\vec{x} \in \mathbb{R}^{n}$

Prediction:

Then a ranking model $f: \vec{x} \rightarrow \mathbb{R}$ is optimized to score each document-query
combination so that relevant documents are scored higher.
In mathematical terms: $$f$$ maps vector from a real-valued scores.

Models can be trained on different data:

Offline or Supervised LTR: learn from annotated data.

Expensive and time-consuming.
Provides ground-truth.

Online/Counterfactual LTR: learn from user interactions.

Virtually free and easy to obtain.
Hard to interpret.

Thus we have:

Feature representation of document-query pairs: $\vec{x}_{q, d} \in \mathbb{R}$ .
Labels indicating the relevance of document-query pairs: $y_{q, d} \in[0,4]$ .

And we want:

A function $f: \vec{x} \rightarrow \mathbb{R}$ that scores documents.
To get the best ranking by sorting according to $f(\vec{x})$ .
How do we find $$f$$ ?
Pointwise Approach
- Predict the relevance per item, simple but very naive.
- Ignores that ordering of items is what matters.
Pairwise Approach
- Loss based on document pairs, minimize the number of incorrect inversions.
- Ignores that not all document pairs have the same impact.
- Often used in the industry.
Listwise Approach
- Tries to optimize for IR metrics, but they are not differentiable.
- Approximations by heuristics, bounding or probabilistic approaches to ranking.
- Best approach out of the three.

The Pointwise Approach

Regression loss

Given $\langle q, d\rangle$ predict the value of $y_{q, d}$
e.g., square loss for binary or categorical labels,

\mathcal{L}_{\text {Squared }}\left(q, d, y_{q, d}\right)=\left\|y_{q, d}-f\left(\vec{x}_{q, d}\right)\right\|^{2}

where, $y_{q, d}$ is the one-hot representation [Fuhr, 1989] or the actual value [Cossock and Zhang, 2006] of the label.

Classification loss

Given $\langle q, d\rangle$ predict the class $y_{q, d}$
E.g., Cross-Entropy with Softmax over categorical labels $$Y$$ [Li et al., 2008],

\mathcal{L}_{\mathrm{CE}}\left(q, d, y_{q, d}\right)=-\log \left(p\left(y_{q, d} \mid q, d\right)\right)=-\log \left(\frac{e^{\gamma \cdot s_{y} q, d}}{\sum_{y \in Y} e^{\gamma \cdot s_{y}}}\right)

where, $s_{y_{q, d}}$ is the model's score for label $y_{q, d}$ .

Issues with Pointwise LTR

Class Imbalance:
- many irrelevant documents and very few relevant documents.
Query level feature normalization is needed:
- the distribution of features differs greatly per query.

These can be overcome. But what is fundamentally wrong with these methods?

Ranking is not a regression or classification problem.

A document-level loss does not work for ranking problems because document scores should not be considered independently.
In other words, pointwise methods do not directly optimize ranking quality. A lower loss does not mean better ranking!

The Pairwise Approach

Instead of looking at document-level, consider pairs of documents.

y_{q, d_{i}}>y_{q, d_{j}} \rightarrow d_{i} \succ d_{j}

Naive Pairwise Model

P\left(d_{i} \succ d_{j}\right)=f\left(\vec{x}_{i}, \vec{x}_{j}\right)

This method would be quadratic in complexity: $O\left(N^{2}\right),$ during inference.
Pair-preferences have to be aggregated somehow. This can lead to paradoxical situations:

\begin{array}{l} d_{1} \succ d_{2} \\ d_{2} \succ d_{3} \\ d_{3} \succ d_{1} \end{array}

Pairwise Loss Functions

The scoring model remains unchanged:

f\left(\vec{x}_{i}\right)=s_{i}

But the loss function is based on document pairs:

\mathcal{L}_{\text {pairwise }}=\sum_{d_{i} \succ d_{j}} \phi\left(s_{i}-s_{j}\right)

Thus we still score documents and then order according to scores.

Pairwise loss minimizes the average number of inversions in ranking:

$d_{i} \succ_{q} d_{j}$ but $d_{j}$ is ranked higher than $d_{i}$
Pairwise loss generally has the following form [Chen et al., 2009],
$\mathcal{L}_{\text {pairwise }}=\phi\left(s_{i}-s_{j}\right)$
where, $\phi$ can be,
Hinge function [Herbrich et al., 2000]: $\quad \phi(z)=\max (0,1-z)$
Exponential function [Freund et al., 2003]: $\quad \phi(z)=e^{-z}$
Logistic function [Burges et al., 2005]: $\phi(z)=\log \left(1+e^{-z}\right)$
etc.

State of the art Pairwise LTR model - RankNet

Issues with Pairwise LTR

What is the fundamental problem with Pairwise LTR?

The scoring model scores documents independently: $f\left(\vec{x}_{d_{i}}\right)=s_{i} .$ The loss is based on document pairs, and minimizes the number of incorrect inversions:
$\mathcal{L}_{\text {pairwise }}=\sum_{d_{i} \succ d_{j}} \phi\left(s_{i}-s_{j}\right)$
However, not every document pair is equally important.
It is much more important to get the correct ordering of top documents than of the bottom documents.
For instance, the order of the top-5 is much more important than the order of
documents after position 10 .

The Listwise Approach

The fundamental problem with the approaches so far is that they did not optimize ranking quality directly.

A LTR method should directly optimize the ranking metric we care about.

Ranking metrics can range from simple:

\operatorname{precision}(R)=\frac{1}{|R|} \sum_{R_{i}} \operatorname{relevance}\left(R_{i}\right)

to much more complex, e.g.Discounted Cumulative Gain:

D C G(R)=\sum_{R_{i}} \frac{2^{\text {relevance }\left(R_{i}\right)-1}}{\log (i+1)}

How do we optimize for these metrics?

These are non-continuous and non-differentiable. Cannot take gradient.

State of the art listwise LTR models:

LambdaRank
- Still used as state of the art ranking method.
ListNet and ListMLE
- Create a probabilistic model for ranking, which is differentiable.

Some remarks - distinction between pairwise and listwise is not helpful:

The derivatives of listwise losses always reduce to weighted pairwise derivatives (LambdaRank is explicit, ListNet does this implicitly).
The LambdaRank with the Average Relevant Rank metric is equivalent to the pairwise RankNet loss.
In the field there are often categorization mistakes, e.g. calling LambdaRank a pairwise method.