This section explains n-step bootstrapping techniques, which generalize TD learning by updating value estimates using returns accumulated over multiple steps, balancing bias and variance in learning.
What lies in the space of methods between TD(0) vs MC? The 1-step truncated return is given by:
\[G_t^{(1)} = r_{t+1} + \gamma V_t(S_{t+1})\]is what we use in a TD(0). Instead, consider an n-step truncated return:
\[G_t^{(n)} = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + ... + \gamma^{n-1} r_{t+n} + \gamma^n V_t(S_{t+n})\]The reason we are bootstrapping with the value function is because we don’t want to wait until the end of the episode. However, the value function estimate could be incorrect and therefore the return estimate could be incorrect.
They indicate the degree to which each state is eligible for undergoing learning changes:
\[e_t(s) = \begin{cases} \gamma \lambda e_{t-1}(s) & \text{if } s \ne s_t \\ 1 + \gamma \lambda e_{t-1}(s) & \text{if } s = s_t \end{cases}\]For Each Episode
Initialize S
For each step
Choose A from epsilon-greedy Q1 + Q2
Take Am Observe R, S'
with prob 0.5
Q1(S,A) = Q1(S,A) + alpha(R + Q2(s, argmaxs Q1(s', x) - Q1))
else:
Q2(S,A)
S=S'
Q1, Q2 converge to q*