BLEU

Created October 27, 2021 · Updated March 4, 2026

Bilingual Evaluation Understudy (BLEU), introduced by Roukos and Papineni (2003)
BLEU is a precision-oriented metric
- How many of the n-grams occurring in the translation occur in any of the reference translations?
- Considers n-grams of several lengths, typically 1 to 4-grams
N-gram precision: $p(i)=\frac{\text { correct }_{i}}{\text { total }_{i}}$
For each $\mathrm{n}$ -gram occurrence $\bar{w}_{i}$ of order $$i$$ in the translation,
correct $_{i}++$ and total $_{i}++$ if $\bar{w}_{i}$ occurs in any of the reference translations - total $_{i}++$ : else
Brevity penalty (BP): Compensates for the tendency of shorter translations having higher precisions

$\operatorname{BP}\left(l_{t}, l_{r}\right)=\left\{\begin{array}{lll} 1 & \text { if } l_{t} \geq l_{r} & \left(l_{r}=\text { reference length }\right) \\ \exp \left(1-\frac{l_{r}}{l_{t}}\right) & \text { if } l_{t}<l_{r} & \left(l_{t}=\text { translation length }\right) \end{array}\right.$

Example

Translation pair data:
BLEU-example

Candidate translation: the cat sat on a mat.

N-gram Precisions:

$p(1)=\frac{6}{7}$ : the; cat; sat; on; a; mat; .
$p(2)=\frac{4}{6}$ : the cat; cat sat; sat on; on a; a mat; mat .
$p(3)=\frac{3}{5}$ : the cat sat; cat sat on; sat on a; on a mat; a mat .
$p(4)=\frac{1}{4}$ : the cat sat on; cat sat on $\mathrm{a}$ ; sat on a mat; on a mat

So BLEU score is calculated as:

\begin{aligned} &\operatorname{BLEU}\left(t, R_{f}\right)=\mathrm{BP}(7,7) \cdot \prod_{i=1}^{n} p(i)^{\frac{1}{n}}=\mathrm{BP} \cdot \frac{6 \frac{1}{7}} \cdot \frac{4^{\frac{1}{4}}}{6} \cdot \frac{3}{5}^{\frac{1}{4}} \cdot \frac{1}{4}^{\frac{1}{4}} \\ &=\mathrm{BP}(7,7) \cdot 0.5411=1 \cdot 0.5411=0.5411 \end{aligned}

Adjustments

N-gram precision: $p(i)=\frac{\text { correct }_{i}}{\text { total }_{i}}$
- Candidate translation $$t$$ : the the the the the the
- $p(1)=\frac{\text { correct }_{1}}{\text { total }_{1}}=\frac{7}{7}=1$
Use clipped-counts: Let $\text{ref\_count}\left(\bar{w}_{i}\right)$ be the maximum number of times $\bar{w}_{i}$ occurs in any individual reference and $\text{trans\_count}\left(\bar{w}_{i}\right)$ the number of times $\bar{w}_{i}$ occurs in the translation.
For each n-gram $\bar{w}_{i}$ of order $$i$$ in the translation, $\text{correct\_clipped}_{i}+=\min \left(\operatorname{trans\_ count}\left(\bar{w}_{i}\right), \text{ref\_count}\left(\bar{w}_{i}\right)\right)$
$p(i)=\frac{\text { correct\_clipped }_{i}}{\text { total}_i}$

The brevity penalty compares the length of the translation candidate with the length of the reference translation
If we have multiple reference translations, there are multiple ways to define $l_{r}:$

Shortest reference
Average reference length
Closest reference: The length of the reference which is closest in length to the translation

Granularity of BLEU

BLEU scores can be computed on the

sentence-level
document-level
corpus-level
Sentence-level BLEU is rather unstable, translation candidates with minor differences can receive very different sentence-level BLEU scores
Better apply BLEU on the document or corpus level
BLEU formulation remains unchanged, but all counts and lengths are computed over the entire document or corpus

Advantages and disadvantages of BLEU

BLEU generally correlates relatively well with human judgments
By far the most commonly used MT evaluation metric
Absolute BLEU scores in isolation are not very meaningful
Comparison of BLEU scores across language pairs not meaningful
BLEU is less useful for translating into morphologically rich languages
Good translations that are dissimilar to all reference translations are penalized
BLEU is not differentiable!