Natural Language Processing

Topics

Notes

Linked

Byte Pair Encoding

Fixed-vocabulary tokenizers can't handle unseen words. Iteratively merge the most frequent character pairs to build a subword vocabulary.

BLEU

How to automatically evaluate machine translation quality? Compare n-gram overlap between generated and reference translations.

CNNs for NLP

RNNs are slow for text classification due to sequential processing. Apply convolutions over word embeddings to capture local n-gram features in parallel.

BERT

Language models only use left context, missing bidirectional understanding. Mask random tokens and train to predict them using full context.

Compositional semantics and sentence representations

Words combine into sentences with complex meaning. Build structured representations that capture compositional semantics.

Coreference Resolution

Multiple expressions in text refer to the same entity. Identify which mentions correspond to the same real-world referent.