Chapter 5: n-step Bootstrapping

This section explains n-step bootstrapping techniques, which generalize TD learning by updating value estimates using returns accumulated over multiple steps, balancing bias and variance in learning.

§5.01: n-step TD-Prediction

What lies in the space of methods between TD(0) vs MC? The 1-step truncated return is given by:
\[G_t^{(1)} = r_{t+1} + \gamma V_t(S_{t+1})\]
is what we use in a TD(0). Instead, consider an n-step truncated return:
\[G_t^{(n)} = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + ... + \gamma^{n-1} r_{t+n} + \gamma^n V_t(S_{t+n})\]
The reason we are bootstrapping with the value function is because we don’t want to wait until the end of the episode. However, the value function estimate could be incorrect and therefore the return estimate could be incorrect.
In this case, since $\gamma < 1, \gamma^n < < 1$ and hence you will be giving much less weightage to your estimate of the value function i.e. reduce error.
Downsides:
1. You will have to wait n steps.
2. Because you’re drawing a much larger sample, there could be significant variance in the actual return. The value function is a kind of an average over various returns and hence reduces variance. Large variance can also affect convergence.

§5.02: TD-lambda(Forward View)

Instead of n-step backups, we can consider an average of n-step returns.
The average contains all the n-step backups each weighted proportional to $\lambda^{n-1}$ where $0 \leq \lambda \leq 1$
$\lambda-$return: $\G_t^\lambda = (1-\lambda) \sum_{n=1}^\infty \lambda^{n-1} G_t^{(n)}$
- $(1-\lambda)$ to the one step return. If $\lambda$ is large then 1-step gets a very small weight.
- $(1-\lambda) \lambda$ to the two step return
- $\ambda^{T-t-1}$ for the MC return

§5.03: Eligibility Traces (Backward View)

They indicate the degree to which each state is eligible for undergoing learning changes:

\[e_t(s) = \begin{cases} \gamma \lambda e_{t-1}(s) & \text{if } s \ne s_t \\ 1 + \gamma \lambda e_{t-1}(s) & \text{if } s = s_t \end{cases}\]

In Accumulating Eligibility trace, if a state was visited many times, it might have accumulated a lot of time to receive more reward than the state visited right before the reward. Thus some use replacing Eligibility Trace.
Implement MC incrementally using ES possible.

§5.04: Maximization Bias & Double Q-Learning

If the estimated Q is more than the true expected Q, we suffer from Maximization bias. This also affects SARSA, as $\epsilon$-greedy is software are kind of like max actions.
You are picking max action according to some estimate from some set of samples. But you are also determining the value of the max action from the same samples.

For Each Episode
  Initialize S
    For each step
      Choose A from epsilon-greedy Q1 + Q2
      Take Am Observe R, S'
      with prob 0.5
        Q1(S,A) = Q1(S,A) + alpha(R + Q2(s, argmaxs Q1(s', x) - Q1))
      else:
        Q2(S,A)
      S=S'
Q1, Q2 converge to q*