This chapter discusses function approximation methods for scaling RL to large state spaces, including Deep Q-Networks (DQN) for learning value functions with neural networks and the Expected SARSA algorithm for stable policy evaluation.
For continuous state-Action pairs, you would need infinite memory. But most importantly, Functional Approximations are better for generalization (Value Functions, Policies, Models). To get the least squares estimate, we need the actual target. But we don’t know the target since that is what we are trying to estimate. So we use TD target instead:
\[\begin{align*} w_{t+1} &= w_t - \frac{1}{2} \alpha \nabla_{w_t} [q_\star(S_t, a_t) - Q(S_t, a_t)]^2 \text{ we don't know } q_\star \\ &= w_t - \frac{1}{2} \alpha \nabla_{w_t} \left[ R_{t+1} + \gamma \max_{a'} Q(s_{t+1}, a) - Q(s_t, a_t) \right]^2 \end{align*}\]Although $R_{t+1} + \gamma \max_{a’} Q(s’, a’)$ is a function of f, we will ignore it and treat it as a constant. If we are using a linear Functional Approximation for Q:
\[Q(s, a) = \phi(s, a)^\top w_t\]i.e., if $\phi$ is a linear combination of w then,
\[\begin{align*} \nabla_{w_t} \left[ r_{t+1} + \max_{a} Q(s_{t+1}, a) - Q(s_t, a_t) \right]^2 \\ &= \nabla_{w_t} \left[ r_{t+1} + \max_{a} Q(s_{t+1}, a) - \phi(s_t, a_t)^\top w_t \right]^2 \\ &= -2 \phi(s_t, a_t) \left[ r_{t+1} + \max_{a} Q(s_{t+1}, a) - Q(s_t, a_t) \right] \\ &= -2 \phi(s_t, a_t) \delta_t \end{align*}\]Note that in the second equation, $r_{t+1} + \max_{a} Q(s_{t+1}, a)$ is a constant. Then we have,
\[w_{t+1} = w_t + \alpha \delta_t \phi(s_t, a_t)\]-Linear function approximators are restrictive and can only model linear functions. Non-linear approximators can model complex functions and are very powerful. Whilst they can generalize more effectively to unseen tasks, they also require a lot of data and compute.
The neural network takes the state as input and produces the Q-values for each possible action
\[w_{t+1} = w_t - \frac{1}{2} \nabla w_t \left[ r_{t+1} + \gamma \max_{a} \hat{q}(s_{t+1}, a) - \hat{q}(s_t, a_t) \right]^2\]-Since we are using semi-gradient methods, as treated as a constant. But it is not a constant target because $\hat{q}$ depends on $w_t$ i.e., targets are non-stationary. During training, sequential experiences in the environment might be highly correlated leading to training on biased data (Consecutive frames of a game) $\rightarrow$ thus Gradient estimates can get skewed.
Freeze Targets The target network has the same architecture as Q-network but is updated less frequently / copied from Q every n steps
\[\left[ r_{t+1} + \gamma \max_{a} \hat{q}(s_{t+1}, a; W^-) - Q(s_t, a_t; W) \right]\]therefore $max_{a} \hat{q}(s_{t+1}, a; W^-)$ becomes fixed for n steps.
Unlike Q-learning, where both networks are updated with equal probability, in Double DQN, only the online network will be updated since the target network is updated towards the online network as according to the DQN Algorithm.
\[\gamma_t^{Double DQN} = R_{t+1} + \gamma Q(S_{t+1}, argmax_a Q(S_{t+1}, a; \theta_t); \theta_t^-)\]where $\theta_t$ are the online network parameters, and $\theta_t^-$ are the target network parameters.
\[\begin{align*} Q(S,a) \text{ TD-Target} \\ &= r(s,a) + \gamma Q(s', argmax_a Q(s', a)) \end{align*}\]Here the The argmax is the DQN that chooses the action for the next state, and $\gamma Q$ is the target network that calculates the Q value of the taking the action at that state.
Sometimes it is not necessary to estimate each S-a pair. So define the relative improvement of taking action a over $\pi$:
\[\pi(S, a) = Q^{\pi}(S, a) - V^{\pi}(S)\]Note that in the optimum policy $V_\star = \max_a q_\star$, so the maximum value $A_\star$ can take is 0. Hence $Q_\pi(s,a) = V_{\pi}(s) + A_\pi(s,a)$. Problem: We can’t recover V and A from Q in the aggregation layer. You can scale up the V to scale down the A to get the same Q value. But we need $Q=V$ in optimality. So instead, we use:
\[Q(s, a; \theta, \alpha, \beta) = V(S; \theta, \beta) + (A(s,a,\theta, \alpha) - \frac{1}{A}\sum A)\]where $\theta$ is the common network parameters, $\alpha$ is the advantage stream parameters and $\beta$ is the value stream parameters.
SARSA update rule is:
\[Q(s_t, a_t) = Q(s_t, a_t) + \alpha \left[ r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t) \right]\]$a_{t+1}$ introduces variances that slows down convergence.
Expected SARSA:
\[\begin{align*} Q(s_t, a_t) &= Q(s_t, a_t) + \alpha \left[ r_{t+1} + \gamma E_\pi \big[ Q(S_{t+1}, A_{t+1}) \mid S_{t+1} \big] - Q(s_t, a_t) \right] \\ &= Q(s_t, a_t) + \alpha \left[ r_{t+1} + \gamma \sum_{a'} \pi(a' \mid s_{t+1}) Q(s_{t+1}, a') - Q(s_t, a_t) \right] \end{align*}\]This is still on-policy because expectation is wrt to $\pi$ so you’re learning the value function corresponding to the behaviour policy.