BLEU
-
Bilingual Evaluation Understudy (BLEU), introduced by Roukos and Papineni (2003)
-
BLEU is a precision-oriented metric
- How many of the n-grams occurring in the translation occur in any of the reference translations?
- Considers n-grams of several lengths, typically 1 to 4-grams
-
N-gram precision: $p(i)=\frac{\text { correct }_{i}}{\text { total }_{i}}$
-
For each $\mathrm{n}$-gram occurrence $\bar{w}_{i}$ of order $i$ in the translation,
-
correct $_{i}++$ and total $_{i}++$ if $\bar{w}_{i}$ occurs in any of the reference translations - total $_{i}++$ : else
-
Brevity penalty (BP): Compensates for the tendency of shorter translations having higher precisions
$$ \operatorname{BP}\left(l_{t}, l_{r}\right)=\left\{\begin{array}{lll} 1 & \text { if } l_{t} \geq l_{r} & \left(l_{r}=\text { reference length }\right) \\ \exp \left(1-\frac{l_{r}}{l_{t}}\right) & \text { if } l_{t}<l_{r} & \left(l_{t}=\text { translation length }\right) \end{array}\right. $$
Example
Translation pair data:

Candidate translation: the cat sat on a mat.
N-gram Precisions:
- $p(1)=\frac{6}{7}$ : the; cat; sat; on; a; mat; .
- $p(2)=\frac{4}{6}$ : the cat; cat sat; sat on; on a; a mat; mat .
- $p(3)=\frac{3}{5}$ : the cat sat; cat sat on; sat on a; on a mat; a mat .
- $p(4)=\frac{1}{4}$ : the cat sat on; cat sat on $\mathrm{a}$; sat on a mat; on a mat
So BLEU score is calculated as:
Adjustments
-
N-gram precision: $p(i)=\frac{\text { correct }_{i}}{\text { total }_{i}}$
- Candidate translation $t$ : the the the the the the
- $p(1)=\frac{\text { correct }_{1}}{\text { total }_{1}}=\frac{7}{7}=1$
-
Use clipped-counts: Let $\text{ref\_count}\left(\bar{w}_{i}\right)$ be the maximum number of times $\bar{w}_{i}$ occurs in any individual reference and $\text{trans\_count}\left(\bar{w}_{i}\right)$ the number of times $\bar{w}_{i}$ occurs in the translation.
-
For each n-gram $\bar{w}_{i}$ of order $i$ in the translation, $\text{correct\_clipped}_{i}+=\min \left(\operatorname{trans\_ count}\left(\bar{w}_{i}\right), \text{ref\_count}\left(\bar{w}_{i}\right)\right)$
-
$p(i)=\frac{\text { correct\_clipped }_{i}}{\text { total}_i}$
The brevity penalty compares the length of the translation candidate with the length of the reference translation
If we have multiple reference translations, there are multiple ways to define $l_{r}:$
- Shortest reference
- Average reference length
- Closest reference: The length of the reference which is closest in length to the translation
Granularity of BLEU
BLEU scores can be computed on the
- sentence-level
- document-level
- corpus-level
Sentence-level BLEU is rather unstable, translation candidates with minor differences can receive very different sentence-level BLEU scores - Better apply BLEU on the document or corpus level
- BLEU formulation remains unchanged, but all counts and lengths are computed over the entire document or corpus
Advantages and disadvantages of BLEU
- BLEU generally correlates relatively well with human judgments
- By far the most commonly used MT evaluation metric
- Absolute BLEU scores in isolation are not very meaningful
- Comparison of BLEU scores across language pairs not meaningful
- BLEU is less useful for translating into morphologically rich languages
- Good translations that are dissimilar to all reference translations are penalized
- BLEU is not differentiable!