Deep-Q-Network (DQN)

Approximates the Q function using a deep neural network. This introduces instability in training. Solves these issues with following additions:

Experience Replay Buffer

DQN stores its experiences in Experience Replay Buffer and learns on randomly sampled batches from this buffer. This buffer stores the tuple of (state, action, reward, done, next_state) and allows sampling given a batch_size.

DQN Loss function

The "target" is calculated using the Bellman equation:

$$ Q(s, a)<-\left(r+\gamma \max _{a^{\prime} \in A} Q\left(s^{\prime}, a^{\prime}\right)\right)^{2} $$

Then optimization is done using Stochastic Gradient Descent in a familiar supervised-learning fashion with Loss Functions > Mean-Squared-Error Loss:

$$ L=\left(Q(s, a)-\left(r+\gamma \max _{a^{\prime} \in A} Q\left(s^{\prime}, a^{\prime}\right)\right)^{2}\right. $$

Target Network

To avoid problems with convergence, a separate network with same structure is created. The target network maintains a fixed value during the learning process, and perodically resets it to the orginal Q-network value.

target network q

Image Credit