Deep-Q-Network (DQN)
Approximates the Q function using a deep neural network. This introduces instability in training. Solves these issues with following additions:
Experience Replay Buffer
DQN stores its experiences in Experience Replay Buffer and learns on randomly sampled batches from this buffer. This buffer stores the tuple of (state, action, reward, done, next_state) and allows sampling given a batch_size.
DQN Loss function
The "target" is calculated using the Bellman equation:
Then optimization is done using Stochastic Gradient Descent in a familiar supervised-learning fashion with Loss Functions > Mean-Squared-Error Loss:
Target Network
To avoid problems with convergence, a separate network with same structure is created. The target network maintains a fixed value during the learning process, and perodically resets it to the orginal Q-network value.