Chapter 5: n-step Bootstrapping

This section explains n-step bootstrapping techniques, which generalize TD learning by updating value estimates using returns accumulated over multiple steps, balancing bias and variance in learning.

§5.01: n-step TD-Prediction


§5.02: TD-lambda(Forward View)


§5.03: Eligibility Traces (Backward View)

They indicate the degree to which each state is eligible for undergoing learning changes:

\[e_t(s) = \begin{cases} \gamma \lambda e_{t-1}(s) & \text{if } s \ne s_t \\ 1 + \gamma \lambda e_{t-1}(s) & \text{if } s = s_t \end{cases}\]

§5.04: Maximization Bias & Double Q-Learning

For Each Episode
  Initialize S
    For each step
      Choose A from epsilon-greedy Q1 + Q2
      Take Am Observe R, S'
      with prob 0.5
        Q1(S,A) = Q1(S,A) + alpha(R + Q2(s, argmaxs Q1(s', x) - Q1))
      else:
        Q2(S,A)
      S=S'
Q1, Q2 converge to q*