SentencePiece - Unigram LM Encoding

Created January 9, 2022 · Updated March 4, 2026

The unigram LM method, in contrast to the bottom-up construction process of Byte Pair Encoding, begins with a superset of the final vocabulary, pruning it to the desired size.

Unigram LM tokenization takes the vocabulary V and unigram LM parameters $\theta$ and performs Viterbi inference to decode the segmentation with maximum likelihood under $\theta$ .

References

Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates https://arxiv.org/abs/1804.10959
Byte Pair Encoding is Suboptimal for Language Model Pretraining https://arxiv.org/abs/2004.03720
SentencePiece Tokenizer Demystified https://towardsdatascience.com/sentencepiece-tokenizer-demystified-d0a3aac19b15