Hierarchical Reasoning Model (HRM)

Created January 5, 2026 · Updated March 19, 2026

Note

2026-01-12: ARC people did ablations and analysis, citing major gains comes just from plain old recurrence: The Hidden Drivers of HRM's Performance on ARC-AGI Refining your predictions works, who knew!
2026-01-12: Tiny Reasoning Model (TRM) refines and simplifies this further without having to rely on fishy biological inspirations, while achieving significant scores on the same benchmarks!

Interesting work that achieved pretty impressive results on ARC-AGI (reasoning/generalization) with a novel architecture and training method. They cite brain as being inspiration for few key components — hierarchy of processing happening at different timescales, iterative refinements of the representations, bypassing Backpropagation Through Time (BPTT) with a single update (no unrolling), and dynamic compute allocation through a Q-learning mechanism. Noticeably, there is no ablation study.

The human brain provides a compelling blueprint for achieving the effective computational depth that contemporary artificial models lack. It organizes computation hierarchically across corti- cal regions operating at different timescales, enabling deep, multi-stage reasoning20,21,22. Recur- rent feedback loops iteratively refine internal representations, allowing slow, higher-level areas to guide, and fast, lower-level circuits to execute—subordinate processing while preserving global coherence23,24,25. Notably, the brain achieves such depth without incurring the prohibitive credit- assignment costs that typically hamper recurrent networks from backpropagation through time19,26.

Four learnable networks:

input network
low-level recurrent module that operates at input timesteps T and maintains a hidden state (L-module)
high-level recurrent module that operates at "high-level cycles" N and also maintains a hidden state (H-module)
output network

At each timestep i, the L-module updates its state conditioned on its own previous state, the H- module’s current state (which remains fixed throughout the cycle), and the input representation. The H-module only updates once per cycle (i.e., every T timesteps) using the L-module’s final state at the end of that cycle:

Finally, after N full cycles, a prediction ˆy is extracted from the hidden state of the H-module:

They call the entire NT-timestep process a single forward pass. This iterative process of refinement, which looks a lot like Expectation Maximization, they term as hierarchical convergence.

They also get rid of BPTT saying if RNNs converge to a fixed point, unrolling of its state sequence is not necessary, also citing some brain research that "cortical credit assignment relies on local mechanisms". Seems to be based off of Deep Equilibrium Models.

we propose a one-step approximation of the HRM gradient–using the gradient of the last state of each module and treating other states as constant. The gradient path is, therefore,

Output head → final state of the H-module → final state of the L-module → input embedding

Now they introduce their deep supervision, which runs multiple forward passes, but crucially they detach the z before param update. They say this provides more frequent feedback to the H-module and serves as a regularization mechanism, but probably mostly superior empirical performance.

The Q learning mechanism takes the final state of H-module to predict Q values for "halt" and "continue" actions through a linear layer. Whenever the N_supervision exceeds maximum threshold, or when halting Q value exceeds that of continue, the loop is stopped. Not sure if this is strictly RL or just plain supervised learning.

The final loss is a sequence to sequence loss with typical softmax output, combined with BCE for training the Q values.

$L_{\mathrm{ACT}}^m=\operatorname{LOSS}\left(\hat{y}^m, y\right)+\operatorname{BINARYCrOSSENTROPY}\left(\hat{Q}^m, \hat{G}^m\right)$ .