Capsule Networks (CapsNet)

Created July 26, 2023 · Updated March 4, 2026

Capsnet are good at scene parsing (disentangling hierarchy) and viewpoint equivariance unlike CNNs.

Capsules

Each capsule is a vector of N dimensions that represents an entity in the input as a "coordinate frame".
Orientation of the capsule vector represents the properties of the entity, also called "instantiation parameters". The basis of capsule vector space represent each factor of this coordinate frame.
The length (norm) of the capsule vector represents existence probability of the entity.
- But this can be problematic because if the coordinate frame is representing scaling, it should have a larger norm as well.
- Alternatively, we can separate the notion of existence into a different parameter. This brings complications with optimization though.

Use cosine distance as the agreement measure.
To keep the norm i.e. length as proper probability, renormalize and squash them to be below 1.

For each parent capsule:
- From the set of child capsules, find the capsule with highest dot product (equivalent to cosine similarity for normalized vectors).
- Sum the parent capsule and the child capsule.
- Remove this child capsule from the set of child capsules.

Introduced in Matrix Capsules with EM routing, Frosst and Hinton, ICLR 2018
Separate the existence probability from the norm of the capsule vector into a separate scalar.
- Capsules can then be thought of blobs where their position is determined by the vectors and spread by existence probabilities.
Similar to Gaussian Mixture Model > Expectation Maximization for Gaussian mixtures

Each parent capsule is initialized with the centers of the gaussian from M step.
For each parent capsule:
- From the set of child capsules, find the capsule that has the maximum probability under the gaussian of parent capsule.
- Add the parent gaussian with the child gaussian.
- Remove the child capsule from the set of child capsules.
Gaussians with few children and large SD gets deactivated.

The number of iterations in dynamic routing controls the sparsity of connections from lower capsules to higher level capsules. Zhao at al introduced a $\lambda$ parameter with softmax to adjust the sparse extent of connections without doing large number of iterations. Large value would act like max pooling but smaller values act like average pooling. They set it to 5 using cross validation.