BERT

Created May 28, 2021 · Updated March 4, 2026

BERT (Bidirectional Encoder Representations from Transformers) is a big Transformer model trained on two unsupervised tasks:
- Masked language modeling
- Next sentence prediction
General NLP model that can be used for
- fine-tuning task-specific models
- Create contextualized word embeddings like ELMo or create sentence embeddings

Architecture

Includes only the encoder stack of the originally proposed Transformer
Can take sequence length of 512

BERT Base

Comparable in size of the OpenAI Transformer in order to compare performance
12 Transformer layers, 12 self-attention heads, and 768 hidden dimensions
110 million parameters

BERT Large

Model that achieved the state of the art results reported in the paper
24 Transformer layers, 16 self-attention heads, and 1024 hidden dimensions
340 million parameters

Training

Pre-training

Fairly expensive (4 days on 16 TPUs) but one-time procedure for each language
Masked language modelling
- Mask 15% of the input, and also randomly replace a word with another word
Next sentence prediction
- Given two sentences A and B, is B likely to be the sentence that follows A or not?

Fine-tuning

Inexpensive, all results in paper can be replicated in at most 1 hour on a single TPU
Can be used in multiple ways to train task-specific models

References

The Illustrated BERT http://jalammar.github.io/illustrated-bert/
Original Tensorflow implementation https://github.com/google-research/bert