SentencePiece - Unigram LM Encoding
The unigram LM method, in contrast to the bottom-up construction process of Byte Pair Encoding, begins with a superset of the final vocabulary, pruning it to the desired size.
Unigram LM tokenization takes the vocabulary V and unigram LM parameters $\theta$ and performs Viterbi inference to decode the segmentation with maximum likelihood under $\theta$.
References
- Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates https://arxiv.org/abs/1804.10959
- Byte Pair Encoding is Suboptimal for Language Model Pretraining https://arxiv.org/abs/2004.03720
- SentencePiece Tokenizer Demystified https://towardsdatascience.com/sentencepiece-tokenizer-demystified-d0a3aac19b15