CNNs for NLP

Created October 27, 2021 · Updated March 4, 2026

In NLP, the input is 1D instead of 2D like in Computer Vision.

Convolutions with different kernel sizes i.e. n-gram orders can be applied in parallel.

resulting features can vary in size, depending on padding and respective strides
by max-over-time pooling, we obtain a fixed size representation that does not depend on the input length.

CNNs for sequence classification

What about units smaller than words?
- relevant when modeling unseen/rare inflections of words
- relevant for robustness with respect to noisy input with typos
CNNs are popular for modeling sub-word units, especially at the character level
Instead of learning a word embedding directly
- split input into tokens
- split each token into characters
- apply convolutions at character level
- for each token: combine by pooling
Here we assume that word boundaries are given
Each token is represented by a fixed number of character embeddings
Results in one representation layer per word, which can feed further into network of choice (CNNs, RNNs,...)

CNNs allow us to model local context using neighboring words
- neighborhood limited by kernel size
I disagree with the other reviewers who say that this camera is not great.
- is max-over-time pooling sufficient to model this dependency?
  How about:
- I disagree with most reviews but I agree with the reviewers who say that this camera is not great.
- I agree with most reviews but I disagree with the reviewers who say that this camera is not great.
Here both agree and diagree are outside of (reasonably sized) kernels
- max-over-time pooling basically summarizes bags of convolutions
How to model larger contexts?
Increase kernel size?
- downside: becomes more sensitive to positional information
Stack (many) CNNs?
- downside: input sequences of different length (fixed network topology)
Instead of using flat n-grams, use linguistic structure to define contexts
- Tree-structured Convolution (Ma et al. 2015)
- Recurrent Neural Networks (RNN)
- Graph Convolutional Networks (GCN)
  - generalize CNNs to arbitrary neighborhoods
  - Standard CNNs define neighborhoods as $\lfloor k / 2\rfloor$ words to left and right

Advantages of CNNs for sequence classification

Disadvantages of CNNs for sequence classification

CNNs have limited receptive fields that only capture local patterns initially, requiring many layers to model long-range dependencies. Max-pooling operations lose fine-grained temporal information and positional details that are often crucial for sequence understanding.

Consider unbounded histories
- n-grams, fixed limited kernels cannot achieve this
- stacking of convolutions expands the fixed-sized context, but is still fixed
- max-over-time pooling is unbounded, but ...
Consider structural/hierarchical properties of language
- n-grams, fixed limited kernels can only model fixed-sized, local structural properties
- max-over-time pooling is flat and order-insensitive
Input can be enriched by adding syntactic information, e.g., syntactic dependency information
- syntactic parsers are only available for some languages
- not clear which syntactic information is really required for which task (feature engineering)