Dreamer
Dreamer trains an actor-critic entirely on imagined rollouts from a learned world model, never touching real environment transitions during policy learning. This a MAJOR difference from Policy Gradient: because the world model (dynamics + reward) is fully differentiable, Dreamer backpropagates analytically through the imagined trajectories to update the actor, letting the action fully learn about causal dynamics of its actions rather learning correlation of policy and reward.
In comparison to actor critic algorithms that learn online or by experience replay, world models can interpolate past experience and offer analytic gradients of multi-step returns for efficient policy optimization
This means credit assignment can happen theoretically all the way back to the start of the episode! Thats the real win of World Models, they are differentiable whereas real world is not, which means you can learn with full causal understanding.
However, the entirety of the world model still relies on environment rewards, so its not just learning the dynamics.
We use dense neural networks for the action and value models with parameters φ and ψ, respectively. The action model outputs a tanh-transformed Gaussian (Haarnoja et al., 2018) with sufficient statistics predicted by the neural network. This allows for reparameterized sampling (Kingma and Welling, 2013; Rezende et al., 2014) that views sampled actions as deterministically dependent on the neural network output, allowing us to backpropagate analytic gradients through the sampling operation
The world-model learning process itself is "active", in a sense it performs actions in the latent space and predicts rewards of those actions. It's not just conditioning on the actions taken in the dataset. It's still doing Reinforcement Learning, but in the latent space.
World model learning objective is still reconstruction, reward prediction and contrastive methods didn't help.
shows clear differences in task performance for different representation learning approaches, with pixel reconstruction outperform- ing contrastive estimation on most tasks. This suggests that future improvements in representation learning are likely to translate to higher task performance with Dreamer. Reward prediction alone was not sufficient in our experiments.