Chapter 4: Temporal-Difference Learning

This chapter covers Temporal-Difference (TD) learning methods that combine ideas from Monte Carlo and dynamic programming, enabling agents to learn directly from raw experience without a model of the environment.

§4.01: Recap of Dynamic Programming and Monte-Carlo

Monte-Carlo:	Dynamic Programming
Learn Directly from experience
model-free	Needs complete model
No bootstrapping	Bootstraps i.e. updated using estimates
uses Samples	DP does not Sample
Sample Backups	Full backups
Deep backups i.e. entire Trajectory!

§4.02: Temporal Difference

Shallow backup (If you do full backup it becomes like MC)
Sample backups
Bootstraps
Model Free, Less memory & peak computation
Combine Sampling of MC with bootstrapping of DP
TD error: $\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$ where the first two terms is basicallt the $V(S_t)$ at time $t+1$.
Since TD errors depends on the next state & next reward, it is only available one timestep later.

TD-Prediction.	TD-Control
Goal: estimate Value Function	Goal: Estimate Value Function and Improve Policy
Algorithms: TO(O), TD(1)	QLearning, SARSA
Usage: Evaluate Policy	Usage: Find optional Policy.
No knowledge of p & r but it samples	Estimate action values

Monte Carlo: on Finite Data, MC will converge to Minimum Least Squares Estimate (MLE).
Temportal Difference: on Finite Sample, TD gives the Certainty- Equivalence Estimate (Estimated MDP)
For Infinite Data, both converge to the same answer.
If the process is Markov, we expect TD to produce lower error on future data even though MC is better on existing data.

§4.03: SARSA: On-Policy TD-Control

\[Q(S_t, A_t) = Q(S_t, A_t) + \alpha \left[ R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t) \right]\]

Continually estimate $q_\pi$ for $\pi$ and at the same time change $\pi$ toward greediness wrt $\q_\pi$.
SARSA converges with probability 1 to an optimal policy and action-value function if:
1. Steps Large enough to overcome random fluctuations: $\sum \alpha_n(\alpha) = \infty$
2. Steps become small enough to assure convergence: $ \sum \alpha_n^2(\alpha) \leq \infty$
All state-action pairs are visited an infinite number of times
Policy converges in the limit to greedy policy.

Pseudocode for SARSA

For each episode  
    - initialize s, choose A from S using policy Q
    For each step
        - Take A, observe R, S'
        - Choose A' from S' using policy Q (Improvement)
        - Update Q(S,A) (Evaluation)
        - S=S', A=A'

§4.04: Q-Learning: Off-Policy TD Control

\[Q(S_t, A_t) = Q(S_{t+}, A_t) + \alpha \left[ R_{t+1} + \gamma \max_a Q(S_{t+1}, \alpha) - Q(S_{t}, A_t) \right]\]

The learned action-value function directly approximates $q_\star$
We have two policies
- The behaviour policy is used to generate behaviour ($\epsilon-$greedy)
- Estimation policy that is being evaluated and improved (greedy)

Pseudocode

For each episode
    - Initialise S
        For each step
            - Choose A from S using Q (eps-greedy)
            - Take A, Observe R, S'
            - Update Q(S, A)
            - S = S'
Convergence: All pairs should continue to get updated