SentencePiece - Unigram LM Encoding

The unigram LM method, in contrast to the bottom-up construction process of Byte Pair Encoding, begins with a superset of the final vocabulary, pruning it to the desired size.

Unigram LM Encoding

Unigram LM tokenization takes the vocabulary V and unigram LM parameters $\theta$ and performs Viterbi inference to decode the segmentation with maximum likelihood under $\theta$.


References

  1. Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates https://arxiv.org/abs/1804.10959
  2. Byte Pair Encoding is Suboptimal for Language Model Pretraining https://arxiv.org/abs/2004.03720
  3. SentencePiece Tokenizer Demystified https://towardsdatascience.com/sentencepiece-tokenizer-demystified-d0a3aac19b15