Chapter 4: Temporal-Difference Learning

This chapter covers Temporal-Difference (TD) learning methods that combine ideas from Monte Carlo and dynamic programming, enabling agents to learn directly from raw experience without a model of the environment.

§4.01: Recap of Dynamic Programming and Monte-Carlo

Monte-Carlo: Dynamic Programming
Learn Directly from experience
model-free Needs complete model
No bootstrapping Bootstraps i.e. updated using estimates
uses Samples DP does not Sample
Sample Backups Full backups
Deep backups i.e. entire Trajectory!

§4.02: Temporal Difference

TD-Prediction. TD-Control
Goal: estimate Value Function Goal: Estimate Value Function and Improve Policy
Algorithms: TO(O), TD(1) QLearning, SARSA
Usage: Evaluate Policy Usage: Find optional Policy.
No knowledge of p & r but it samples Estimate action values

§4.03: SARSA: On-Policy TD-Control

\[Q(S_t, A_t) = Q(S_t, A_t) + \alpha \left[ R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t) \right]\]

Pseudocode for SARSA

For each episode  
    - initialize s, choose A from S using policy Q
    For each step
        - Take A, observe R, S'
        - Choose A' from S' using policy Q (Improvement)
        - Update Q(S,A) (Evaluation)
        - S=S', A=A'

§4.04: Q-Learning: Off-Policy TD Control

\[Q(S_t, A_t) = Q(S_{t+}, A_t) + \alpha \left[ R_{t+1} + \gamma \max_a Q(S_{t+1}, \alpha) - Q(S_{t}, A_t) \right]\]

Pseudocode

For each episode
    - Initialise S
        For each step
            - Choose A from S using Q (eps-greedy)
            - Take A, Observe R, S'
            - Update Q(S, A)
            - S = S'
Convergence: All pairs should continue to get updated