Natural Language Processing

Tokenization

BERT SentencePiece - Unigram LM Encoding Transformers Byte Pair Encoding

Tokenization Natural Language Processing

Fixed-vocabulary tokenizers can't handle unseen words. Iteratively merge the most frequent character pairs to build a subword vocabulary.

BERT Tokenization

Natural Language Processing

BLEU

How to automatically evaluate machine translation quality? Compare n-gram overlap between generated and reference translations.

Natural Language Processing Convolutional Neural Networks (CNN)

CNNs for NLP

RNNs are slow for text classification due to sequential processing. Apply convolutions over word embeddings to capture local n-gram features in parallel.

Convolutional Neural Networks (CNN) Graph Convolutional Networks (GCN) Recurrent Neural Networks (RNN)

Natural Language Processing Deep Learning Language Models (Classical)

BERT

Language models only use left context, missing bidirectional understanding. Mask random tokens and train to predict them using full context.

Transformers

Natural Language Processing

Compositional semantics and sentence representations

Words combine into sentences with complex meaning. Build structured representations that capture compositional semantics.

LSTM Recurrent Neural Networks (RNN)

Natural Language Processing

Coreference Resolution

Multiple expressions in text refer to the same entity. Identify which mentions correspond to the same real-world referent.

Natural Language Processing

Topics

Notes

Linked