RankNet

RankNet [Burges et al., 2005] is a pairwise loss function-popular choice for training neural LTR models and also an industry favourite [Burges, 2015].

$$ \begin{array}{l} \text { Predicted probabilities: } P_{i j}=P\left(s_{i}>s_{j}\right) \equiv \frac{e^{\gamma \cdot s_{i}}}{e^{\gamma \cdot s_{i}}+e^{\gamma \cdot s_{j}}}=\frac{1}{1+e^{-\gamma\left(s_{i}-s_{j}\right)}} \\ \text { and } P_{j i} \equiv \frac{1}{1+e^{-\gamma\left(s_{j}-s_{i}\right)}} \end{array} $$

Desired probabilities: $\bar{P}_{i j}=1$ and $\bar{P}_{j i}=0$
Computing cross-entropy between $\bar{P}$ and $P$,

$$ \begin{aligned} \mathcal{L}_{\text {RankNet }} &=-\bar{P}_{i j} \log \left(P_{i j}\right)-\bar{P}_{j i} \log \left(P_{j i}\right) \\ &=-\log \left(P_{i j}\right) \\ &=\log \left(1+e^{-\gamma\left(s_{i}-s_{j}\right)}\right) \end{aligned} $$

Let $S_{i j} \in\{-1,0,1\}$ indicate the preference between $d_{i}$ and $d_{j}$. Then the desired probability for a pair is:

$$ \bar{P}\left(d_{i} \succ d_{j}\right)=\frac{1}{2}\left(1-S_{i j}\right) $$

The predicted probability is:

$$ P\left(d_{i} \succ d_{j}\right)=\frac{1}{1+e^{-\gamma\left(s_{i}-s_{j}\right)}} $$

The cross-entropy loss is then:

$$ \mathcal{L}_{i j}=\frac{1}{2}\left(1-S_{i j}\right) \gamma\left(s_{i}-s_{j}\right)+\log \left(1+e^{-\gamma\left(s_{i}-s_{j}\right)}\right) $$

The cross-entropy loss is then:

$$ \mathcal{L}_{i j}=\frac{1}{2}\left(1-S_{i j}\right) \gamma\left(s_{i}-s_{j}\right)+\log \left(1+e^{-\gamma\left(s_{i}-s_{j}\right)}\right) $$

The derivate w.r.t. $s_{i}$

$$ \frac{\delta \mathcal{L}_{i j}}{\delta s_{i}}=\gamma\left(\frac{1}{2}\left(1-S_{i j}\right)-\frac{1}{1+e^{-\gamma\left(s_{i}-s_{j}\right)}}\right)=-\frac{\delta \mathcal{L}_{i j}}{\delta s_{j}} $$

Then we can factorize the loss it so that:

$$ \frac{\delta \mathcal{L}_{i j}}{\delta w}=\frac{\delta \mathcal{L}_{i j}}{\delta s_{i}} \frac{\delta s_{i}}{\delta w}+\frac{\delta \mathcal{L}_{i j}}{\delta s_{j}} \frac{\delta s_{j}}{\delta w}=\gamma\left(\frac{1}{2}\left(1-S_{i j}\right)-\frac{1}{1+e^{-\gamma\left(s_{i}-s_{j}\right)}}\right)\left(\frac{\delta s_{i}}{\delta w}-\frac{\delta s_{j}}{\delta w}\right) $$

We choose $\lambda$ so that:

$$ \frac{\delta \mathcal{L}_{i j}}{\delta w}=\lambda_{i j}\left(\frac{\delta s_{i}}{\delta w}-\frac{\delta s_{j}}{\delta w}\right) $$

where:

$$ \lambda_{i j}=\gamma\left(\frac{1}{2}\left(1-S_{i j}\right)-\frac{1}{1+e^{-\gamma\left(s_{i}-s_{j}\right)}}\right) $$

These lambdas act like forces pushing pairs of documents apart or together.
On document level the same can be done:

$$ \lambda_{i}=\sum_{j} \lambda_{i j} $$

Issues with RankNet:

  • RankNet is based on virtual probabilities: $P\left(d_{i} \succ d_{j}\right)$.
  • In reality the ranking model does not follow these probabilities.
  • Not elegant, but not a big deal.