BERT
Created May 28, 2021 ยท Updated March 4, 2026
- BERT (Bidirectional Encoder Representations from Transformers) is a big Transformer model trained on two unsupervised tasks:
- Masked language modeling
- Next sentence prediction
- General NLP model that can be used for
- fine-tuning task-specific models
- Create contextualized word embeddings like ELMo or create sentence embeddings
Architecture
- Includes only the encoder stack of the originally proposed Transformer
- Can take sequence length of 512

BERT Base
- Comparable in size of the OpenAI Transformer in order to compare performance
- 12 Transformer layers, 12 self-attention heads, and 768 hidden dimensions
- 110 million parameters
BERT Large
- Model that achieved the state of the art results reported in the paper
- 24 Transformer layers, 16 self-attention heads, and 1024 hidden dimensions
- 340 million parameters
Training
Pre-training
- Fairly expensive (4 days on 16 TPUs) but one-time procedure for each language
- Masked language modelling
- Mask 15% of the input, and also randomly replace a word with another word
- Next sentence prediction
- Given two sentences A and B, is B likely to be the sentence that follows A or not?
Fine-tuning
- Inexpensive, all results in paper can be replicated in at most 1 hour on a single TPU
- Can be used in multiple ways to train task-specific models
-
References
- The Illustrated BERT http://jalammar.github.io/illustrated-bert/
- Original Tensorflow implementation https://github.com/google-research/bert