BLEU

  • Bilingual Evaluation Understudy (BLEU), introduced by Roukos and Papineni (2003)

  • BLEU is a precision-oriented metric

    • How many of the n-grams occurring in the translation occur in any of the reference translations?
    • Considers n-grams of several lengths, typically 1 to 4-grams
  • N-gram precision: $p(i)=\frac{\text { correct }_{i}}{\text { total }_{i}}$

  • For each $\mathrm{n}$-gram occurrence $\bar{w}_{i}$ of order $i$ in the translation,

  • correct $_{i}++$ and total $_{i}++$ if $\bar{w}_{i}$ occurs in any of the reference translations - total $_{i}++$ : else

  • Brevity penalty (BP): Compensates for the tendency of shorter translations having higher precisions

    $$ \operatorname{BP}\left(l_{t}, l_{r}\right)=\left\{\begin{array}{lll} 1 & \text { if } l_{t} \geq l_{r} & \left(l_{r}=\text { reference length }\right) \\ \exp \left(1-\frac{l_{r}}{l_{t}}\right) & \text { if } l_{t}<l_{r} & \left(l_{t}=\text { translation length }\right) \end{array}\right. $$

Example

Translation pair data:
BLEU-example

Candidate translation: the cat sat on a mat.

N-gram Precisions:

  • $p(1)=\frac{6}{7}$ : the; cat; sat; on; a; mat; .
  • $p(2)=\frac{4}{6}$ : the cat; cat sat; sat on; on a; a mat; mat .
  • $p(3)=\frac{3}{5}$ : the cat sat; cat sat on; sat on a; on a mat; a mat .
  • $p(4)=\frac{1}{4}$ : the cat sat on; cat sat on $\mathrm{a}$; sat on a mat; on a mat

So BLEU score is calculated as:

$$ \begin{aligned} &\operatorname{BLEU}\left(t, R_{f}\right)=\mathrm{BP}(7,7) \cdot \prod_{i=1}^{n} p(i)^{\frac{1}{n}}=\mathrm{BP} \cdot \frac{6 \frac{1}{7}} \cdot \frac{4^{\frac{1}{4}}}{6} \cdot \frac{3}{5}^{\frac{1}{4}} \cdot \frac{1}{4}^{\frac{1}{4}} \\ &=\mathrm{BP}(7,7) \cdot 0.5411=1 \cdot 0.5411=0.5411 \end{aligned} $$

Adjustments

  • N-gram precision: $p(i)=\frac{\text { correct }_{i}}{\text { total }_{i}}$

    • Candidate translation $t$ : the the the the the the
    • $p(1)=\frac{\text { correct }_{1}}{\text { total }_{1}}=\frac{7}{7}=1$
  • Use clipped-counts: Let $\text{ref\_count}\left(\bar{w}_{i}\right)$ be the maximum number of times $\bar{w}_{i}$ occurs in any individual reference and $\text{trans\_count}\left(\bar{w}_{i}\right)$ the number of times $\bar{w}_{i}$ occurs in the translation.

  • For each n-gram $\bar{w}_{i}$ of order $i$ in the translation, $\text{correct\_clipped}_{i}+=\min \left(\operatorname{trans\_ count}\left(\bar{w}_{i}\right), \text{ref\_count}\left(\bar{w}_{i}\right)\right)$

  • $p(i)=\frac{\text { correct\_clipped }_{i}}{\text { total}_i}$

The brevity penalty compares the length of the translation candidate with the length of the reference translation
If we have multiple reference translations, there are multiple ways to define $l_{r}:$

  • Shortest reference
  • Average reference length
  • Closest reference: The length of the reference which is closest in length to the translation

Granularity of BLEU

BLEU scores can be computed on the

  • sentence-level
  • document-level
  • corpus-level
    Sentence-level BLEU is rather unstable, translation candidates with minor differences can receive very different sentence-level BLEU scores
  • Better apply BLEU on the document or corpus level
  • BLEU formulation remains unchanged, but all counts and lengths are computed over the entire document or corpus

Advantages and disadvantages of BLEU

  • BLEU generally correlates relatively well with human judgments
  • By far the most commonly used MT evaluation metric
  • Absolute BLEU scores in isolation are not very meaningful
  • Comparison of BLEU scores across language pairs not meaningful
  • BLEU is less useful for translating into morphologically rich languages
  • Good translations that are dissimilar to all reference translations are penalized
  • BLEU is not differentiable!