Natural Language Processing
Topics
Notes
Linked
Fixed-vocabulary tokenizers can't handle unseen words. Iteratively merge the most frequent character pairs to build a subword vocabulary.
How to automatically evaluate machine translation quality? Compare n-gram overlap between generated and reference translations.
RNNs are slow for text classification due to sequential processing. Apply convolutions over word embeddings to capture local n-gram features in parallel.
Language models only use left context, missing bidirectional understanding. Mask random tokens and train to predict them using full context.
Words combine into sentences with complex meaning. Build structured representations that capture compositional semantics.
Multiple expressions in text refer to the same entity. Identify which mentions correspond to the same real-world referent.