Chapter 1: Introduction & Multi-Armed Bandits

This chapter introduces the fundamental concepts of Reinforcement Learning including its key characteristics of trial-and-error search and delayed rewards. It also introduces Multi Armed Bandits, exploration-exploitation tradeoffs, and various methods for action-value estimation

§1.01: Introduction

Exploration-Exploitation Tradeoff

This trade-off is pivotal, as excessive exploration may hinder the agent from exploiting its current knowledge effectively, while an overly exploitative approach may lead to suboptimal performance due to a lack of new learning.


§1.02: Elements of Reinforcement Learning

Beyond the agent and the environment, one can identify four main subelements of a reinforcement learning system: a policy, a reward signal, a value function, and, optionally, a model of the environment.


§1.03: Multi-Armed Bandits

The most important feature distinguishing reinforcement learning from other types of learning is that it uses training information that evaluates the actions taken rather than instructs by giving correct actions. This is what creates the need for active exploration, for an explicit search for good behavior.

A $k-$armed Bandit Problem

Consider the following learning problem. You are faced repeatedly with a choice among k different options, or actions. After each choice you receive a numerical reward chosen from a stationary probability distribution that depends on the action you selected. Your objective is to maximize the expected total reward over some time period, for example, over 1000 action selections, or time steps.

If we knew the value of each action i.e. expected reward over 1000 time-steps for example, the it would be trivial to solve the problem. In most scenarios $q_*(a)$ is not known. Instead we know its estimate:

If you maintain estimates of the action values, then at any time step there is at least one action whose estimated value is greatest. We call these the greedy actions. When you select one of these actions, we say that you are exploiting your current knowledge of the values of the actions. If instead you select one of the nongreedy actions, then we say you are exploring, because this enables you to improve your estimate of the nongreedy action’s value.


§1.04: Action-Value Methods

A natural way to estimate the true value of an action is by averaging the rewards

\[Q_t(a) = \frac{\text{sum of rewards when }a \text{ taken prior to }t}{\text{number of times }a \text{ taken prior to } t} = \frac{\Sigma_{i=1}^{t-1}R_i 1_{A_i = a}}{\Sigma_{i=1}^{t-1}1_{A_i = a}}\]

Then a greedy strategy for action selection would be

\[A_t = \text{argmax}_a Q_t(a)\]

Since greedy actions exploit current knowledge without spending time on exploration, we can behave greedily most of the time however choose one of the non-greedy actions randomly with a small probability $\epsilon$.

An asymptotic guarantee of the $\epsilon$-greedy method is that over a a large number of time-steps, all $Q_t(a) \rightarrow q_*(a) \Rightarrow$ Probability of selection the optimal action converges to greater than $1-\epsilon$ i.e. to near certainty.


§1.05: Incremental Implementation

Note that for some action,

\[Q_n = \frac{R_1 + R_2 + ... + R_{n-1}}{n-1}\] \[Q_{n+1} = \frac{1}{n}\Sigma_{i=1}^n R_i = \frac{1}{n}(R_n + \Sigma_{i=1}^{n-1} R_i)\] \[= \frac{1}{n}(R_n + (n-1)\frac{1}{n-1} \Sigma_{i=1}^{n-1}R_i) = \frac{1}{n}(R_n + (n-1)Q_n)\] \[= Q_n + \frac{1}{n}[R_n - Q_n]\]

The update rule is of the form,

New Estimate = Old Estimate + Step Size [Target - Old Estimate]

The expression [Target-Old Estimate] is an error in the estimate. It is reduced by taking a step toward the “Target.” The target is presumed to indicate a desirable direction in which to move, though it may be noisy. More generally we can denote the step size at time-step t by $\alpha_t(a)$


§1.06: Tracking a Non-Stationary Problem

For non-stationary bandit problems in which rewards change over time, it makes more sense to give more weight to recent rewards. One way to achieve this is to use constant step-size.

\[Q_{n+1} = Q_n + \alpha[R_n - Q_n], \alpha \in (0,1)\]

For example, take $\alpha = 0.9$. Then,

\[Q_3 = Q_2 + 0.9[R_2 - Q_2] = 0.9R_2 + 0.1Q_2\] \[= 0.9R_2 + 0.1(Q_1 + 0.9[R_1 - Q_1])\] \[= 0.9R_2 + 0.1(0.1Q_1 + 0.9R_1)\] \[= 0.9R_2 + 0.09R_1 + 0.01Q_1\]

The general form of the above expansion is:

\[Q_{n+1} = (1-\alpha)^n Q_1 + \Sigma_{i=1}^n \alpha (1-\alpha)^{n-i}R_i\]

In the above equation, we can clearly see that, the weight given to $R_2$ is far more than that of $R_1$. Hence $Q_{n+1}$ is a weighted average because the sum of weights is equal to 1. It is also called an exponential recency-weighted decay. Note the for constant step-size, the estimates never completely converge but continue to vary in response to the most recently received rewards.


§1.07: Optimistic Initial Values


§1.08: Upper Confidence Bounds

Exploration is vital due to uncertainty in action-value estimates. Greedy actions seem optimal, but alternatives might be better. $\epsilon$-greedy explores non-greedy actions indiscriminately. It’s wiser to choose non-greedy actions based on their potential optimality, considering estimate closeness to the maximum and uncertainty. An effective approach is to select actions based on:

\[A_t = \text{argmax}_a [Q_t(a) + c\sqrt{\frac{\ln t}{N_t(a)}}]\]

The upper confidence bound (UCB) action selection method leverages the square-root term to gauge the uncertainty in an action’s estimated value. The term being maximized acts as an upper limit on the potential true value of the action, with the confidence level controlled by the parameter $c$ When the action $a$ is chosen, the uncertainty diminishes due to the increase in $N_t(a)$, which is in the denominator. Conversely, when other actions are selected, $t$ increases, but $N_t(a)$ remains constant, causing the uncertainty estimate to rise. The natural logarithm ensures that the growth slows over time, although it remains unbounded. Consequently, all actions will eventually be chosen, but those with lower value estimates or higher selection frequency will become less likely choices over time.


§1.09: Contextual Bandits

In a contextual bandit scenario, an agent is presented with a set of choices or actions, akin to the arms of a multi-armed bandit. However, there is an additional crucial element in contextual bandits: each action is associated with a context or set of features that describe the environment or situation in which the choice is to be made.

Each time the learner faces a decision, it is provided with a context or a feature vector. This context represents the state of the environment and helps the learner make informed choices.