BERT

  • BERT (Bidirectional Encoder Representations from Transformers) is a big Transformer model trained on two unsupervised tasks:
    • Masked language modeling
    • Next sentence prediction
  • General NLP model that can be used for
    • fine-tuning task-specific models
    • Create contextualized word embeddings like ELMo or create sentence embeddings

Architecture

  • Includes only the encoder stack of the originally proposed Transformer
  • Can take sequence length of 512
    BERT Architecture

BERT Base

  • Comparable in size of the OpenAI Transformer in order to compare performance
  • 12 Transformer layers, 12 self-attention heads, and 768 hidden dimensions
  • 110 million parameters

BERT Large

  • Model that achieved the state of the art results reported in the paper
  • 24 Transformer layers, 16 self-attention heads, and 1024 hidden dimensions
  • 340 million parameters

Training

Pre-training

  • Fairly expensive (4 days on 16 TPUs) but one-time procedure for each language
  • Masked language modelling
    • Mask 15% of the input, and also randomly replace a word with another word
  • Next sentence prediction
    • Given two sentences A and B, is B likely to be the sentence that follows A or not?

Fine-tuning

  • Inexpensive, all results in paper can be replicated in at most 1 hour on a single TPU
  • Can be used in multiple ways to train task-specific models
  • Task Specific BERT

References

  1. The Illustrated BERT http://jalammar.github.io/illustrated-bert/
  2. Original Tensorflow implementation https://github.com/google-research/bert