CNNs for NLP

In NLP, the input is 1D instead of 2D like in Computer Vision.

  • The patches correspond to n-grams
  • Input is not of fixed size
  • input channels correspond to word embeddings instead of RGB

Convolutions with different kernel sizes i.e. n-gram orders can be applied in parallel.

  • resulting features can vary in size, depending on padding and respective strides
  • by max-over-time pooling, we obtain a fixed size representation that does not depend on the input length.
CNN for NLP

CNNs for sequence classification

CNN for NLP Yoon Kim
  • 2-gram and 3-gram convolutions
  • 2 types of input channels: learnable and fixed (pre-trained) embeddings
  • Results: Simple CNN model does very well!

CNNs for Morphology

  • What about units smaller than words?

    • relevant when modeling unseen/rare inflections of words
    • relevant for robustness with respect to noisy input with typos
  • CNNs are popular for modeling sub-word units, especially at the character level

  • Instead of learning a word embedding directly

    • split input into tokens
    • split each token into characters
    • apply convolutions at character level
    • for each token: combine by pooling
      CNNs for morphology
  • Here we assume that word boundaries are given

  • Each token is represented by a fixed number of character embeddings

  • Results in one representation layer per word, which can feed further into network of choice (CNNs, RNNs,...)

CNNs and Contexts

  • CNNs allow us to model local context using neighboring words

    • neighborhood limited by kernel size
  • I disagree with the other reviewers who say that this camera is not great.

    • is max-over-time pooling sufficient to model this dependency?
      How about:
    • I disagree with most reviews but I agree with the reviewers who say that this camera is not great.
    • I agree with most reviews but I disagree with the reviewers who say that this camera is not great.
  • Here both agree and diagree are outside of (reasonably sized) kernels

    • max-over-time pooling basically summarizes bags of convolutions
  • How to model larger contexts?

  • Increase kernel size?

    • downside: becomes more sensitive to positional information
  • Stack (many) CNNs?

    • downside: input sequences of different length (fixed network topology)
  • Instead of using flat n-grams, use linguistic structure to define contexts

Advantages and disadvantages

Advantages of CNNs for sequence classification

  • simple architecture performs very well for many sequence classification tasks
  • captures local context
  • no feature engineering
  • can benefit from pre-trained embeddings

Disadvantages of CNNs for sequence classification

  • CNNs have limited receptive fields that only capture local patterns initially, requiring many layers to model long-range dependencies. Max-pooling operations lose fine-grained temporal information and positional details that are often crucial for sequence understanding.

Why CNNs not ideal for Structural Modeling of Language

  • To model language properly we need to be able to...
  1. Consider unbounded histories
    • n-grams, fixed limited kernels cannot achieve this
    • stacking of convolutions expands the fixed-sized context, but is still fixed
    • max-over-time pooling is unbounded, but ...
  2. Consider structural/hierarchical properties of language
    • n-grams, fixed limited kernels can only model fixed-sized, local structural properties
    • max-over-time pooling is flat and order-insensitive
  3. Input can be enriched by adding syntactic information, e.g., syntactic dependency information
    • syntactic parsers are only available for some languages
    • not clear which syntactic information is really required for which task (feature engineering)