This chapter covers Temporal-Difference (TD) learning methods that combine ideas from Monte Carlo and dynamic programming, enabling agents to learn directly from raw experience without a model of the environment.
Monte-Carlo: | Dynamic Programming |
---|---|
Learn Directly from experience | |
model-free | Needs complete model |
No bootstrapping | Bootstraps i.e. updated using estimates |
uses Samples | DP does not Sample |
Sample Backups | Full backups |
Deep backups i.e. entire Trajectory! |
TD-Prediction. | TD-Control |
---|---|
Goal: estimate Value Function | Goal: Estimate Value Function and Improve Policy |
Algorithms: TO(O), TD(1) | QLearning, SARSA |
Usage: Evaluate Policy | Usage: Find optional Policy. |
No knowledge of p & r but it samples | Estimate action values |
Monte Carlo: on Finite Data, MC will converge to Minimum Least Squares Estimate (MLE).
Temportal Difference: on Finite Sample, TD gives the Certainty- Equivalence Estimate (Estimated MDP)
For Infinite Data, both converge to the same answer.
If the process is Markov, we expect TD to produce lower error on future data even though MC is better on existing data.
For each episode
- initialize s, choose A from S using policy Q
For each step
- Take A, observe R, S'
- Choose A' from S' using policy Q (Improvement)
- Update Q(S,A) (Evaluation)
- S=S', A=A'
For each episode
- Initialise S
For each step
- Choose A from S using Q (eps-greedy)
- Take A, Observe R, S'
- Update Q(S, A)
- S = S'
Convergence: All pairs should continue to get updated