Chapter 1: Introduction & Multi-Armed Bandits

This chapter introduces the fundamental concepts of Reinforcement Learning including its key characteristics of trial-and-error search and delayed rewards. It also introduces Multi Armed Bandits, exploration-exploitation tradeoffs, and various methods for action-value estimation

§1.01: Introduction

Reinforcement learning is learning what to do—how to map situations to actions—so as to maximize a numerical reward signal. The learner is not told which actions to take, but instead must discover which actions yield the most reward by trying them. In the most interesting and challenging cases, actions may affect not only the immediate reward but also the next situation and, through that, all subsequent rewards.

Although one might be tempted to think of reinforcement learning as a kind of unsupervised learning because it does not rely on examples of correct behavior, reinforcement learning is trying to maximize a reward signal instead of trying to find hidden structure. Uncovering structure in an agent’s experience can certainly be useful in reinforcement learning, but by itself does not address the reinforcement learning problem of maximizing a reward signal.

Exploration-Exploitation Tradeoff

The exploration-exploitation problem in reinforcement learning pertains to the fundamental dilemma of balancing the need to gather new information about the environment (exploration) with the desire to maximize the cumulative reward by exploiting the currently known best actions (exploitation).

This trade-off is pivotal, as excessive exploration may hinder the agent from exploiting its current knowledge effectively, while an overly exploitative approach may lead to suboptimal performance due to a lack of new learning.

Agents must make decisions based on their current knowledge and past experiences, determining when to choose familiar, high-reward actions (exploitation) and when to venture into uncharted territory to gather additional data and improve their understanding of the environment (exploration). Striking the right balance is a complex challenge in reinforcement learning, as it demands the design of exploration strategies that maximize the long-term cumulative reward while considering uncertainty and the stochastic nature of the environment.

§1.02: Elements of Reinforcement Learning

Beyond the agent and the environment, one can identify four main subelements of a reinforcement learning system: a policy, a reward signal, a value function, and, optionally, a model of the environment.

Policy: defines the learning agent’s way of behaving at a given time. Roughly speaking, a policy is a mapping from perceived states of the environment to actions to be taken when in those states. In some cases the policy may be a simple function or lookup table, whereas in others it may involve extensive computation such as a search process. The policy is the core of a reinforcement learning agent in the sense that it alone is suffcient to determine behavior. In general, policies may be stochastic, specifying probabilities for each action.
Reward Signal: defines the goal of a reinforcement learning problem. On each time step, the environment sends to the reinforcement learning agent a single number called the reward. The agent’s sole objective is to maximize the total reward it receives over the long run. The reward signal thus defines what are the good and bad events for the agent. They are the immediate and defining features of the problem faced by the agent. The reward signal is the primary basis for altering the policy; if an action selected by the policy is followed by low reward, then the policy may be changed to select some other action in that situation in the future. In general, reward signals may be stochastic functions of the state of the environment and the actions taken.
Value Function: Whereas the reward signal indicates what is good in an immediate sense, a value function specifies what is good in the long run. Roughly speaking, the value of a state is the total amount of reward an agent can expect to accumulate over the future, starting from that state. For example, a state might always yield a low immediate reward but still have a high value because it is regularly followed by other states that yield high rewards. We are most concerned with values when making and evaluating decisions as action choices are made based on value judgments.
Model: This is something that mimics the behavior of the environment, or more generally, that allows inferences to be made about how the environment will behave. For example, given a state and action, the model might predict the resultant next state and next reward. Models are used for planning, by which we mean any way of deciding on a course of action by considering possible future situations before they are actually experienced.

§1.03: Multi-Armed Bandits

The most important feature distinguishing reinforcement learning from other types of learning is that it uses training information that evaluates the actions taken rather than instructs by giving correct actions. This is what creates the need for active exploration, for an explicit search for good behavior.

A $k-$armed Bandit Problem

Consider the following learning problem. You are faced repeatedly with a choice among k different options, or actions. After each choice you receive a numerical reward chosen from a stationary probability distribution that depends on the action you selected. Your objective is to maximize the expected total reward over some time period, for example, over 1000 action selections, or time steps.

$A_t$: Action selected on timestep t
$R_t$: Reward corresponding to action $A_t$
$q_\star(a) = E[R_t \mid A_t=a]$ : Expected or Mean reward of action a.

If we knew the value of each action i.e. expected reward over 1000 time-steps for example, the it would be trivial to solve the problem. In most scenarios $q_*(a)$ is not known. Instead we know its estimate:

$Q_t(a)$: Estimated value of action a at time-step t.

If you maintain estimates of the action values, then at any time step there is at least one action whose estimated value is greatest. We call these the greedy actions. When you select one of these actions, we say that you are exploiting your current knowledge of the values of the actions. If instead you select one of the nongreedy actions, then we say you are exploring, because this enables you to improve your estimate of the nongreedy action’s value.

§1.04: Action-Value Methods

A natural way to estimate the true value of an action is by averaging the rewards

\[Q_t(a) = \frac{\text{sum of rewards when }a \text{ taken prior to }t}{\text{number of times }a \text{ taken prior to } t} = \frac{\Sigma_{i=1}^{t-1}R_i 1_{A_i = a}}{\Sigma_{i=1}^{t-1}1_{A_i = a}}\]

Then a greedy strategy for action selection would be

\[A_t = \text{argmax}_a Q_t(a)\]

Since greedy actions exploit current knowledge without spending time on exploration, we can behave greedily most of the time however choose one of the non-greedy actions randomly with a small probability $\epsilon$.

An asymptotic guarantee of the $\epsilon$-greedy method is that over a a large number of time-steps, all $Q_t(a) \rightarrow q_*(a) \Rightarrow$ Probability of selection the optimal action converges to greater than $1-\epsilon$ i.e. to near certainty.

§1.05: Incremental Implementation

Note that for some action,

\[Q_n = \frac{R_1 + R_2 + ... + R_{n-1}}{n-1}\] \[Q_{n+1} = \frac{1}{n}\Sigma_{i=1}^n R_i = \frac{1}{n}(R_n + \Sigma_{i=1}^{n-1} R_i)\] \[= \frac{1}{n}(R_n + (n-1)\frac{1}{n-1} \Sigma_{i=1}^{n-1}R_i) = \frac{1}{n}(R_n + (n-1)Q_n)\] \[= Q_n + \frac{1}{n}[R_n - Q_n]\]

The update rule is of the form,

New Estimate = Old Estimate + Step Size [Target - Old Estimate]

The expression [Target-Old Estimate] is an error in the estimate. It is reduced by taking a step toward the “Target.” The target is presumed to indicate a desirable direction in which to move, though it may be noisy. More generally we can denote the step size at time-step t by $\alpha_t(a)$

§1.06: Tracking a Non-Stationary Problem

For non-stationary bandit problems in which rewards change over time, it makes more sense to give more weight to recent rewards. One way to achieve this is to use constant step-size.

\[Q_{n+1} = Q_n + \alpha[R_n - Q_n], \alpha \in (0,1)\]

For example, take $\alpha = 0.9$. Then,

\[Q_3 = Q_2 + 0.9[R_2 - Q_2] = 0.9R_2 + 0.1Q_2\] \[= 0.9R_2 + 0.1(Q_1 + 0.9[R_1 - Q_1])\] \[= 0.9R_2 + 0.1(0.1Q_1 + 0.9R_1)\] \[= 0.9R_2 + 0.09R_1 + 0.01Q_1\]

The general form of the above expansion is:

\[Q_{n+1} = (1-\alpha)^n Q_1 + \Sigma_{i=1}^n \alpha (1-\alpha)^{n-i}R_i\]

In the above equation, we can clearly see that, the weight given to $R_2$ is far more than that of $R_1$. Hence $Q_{n+1}$ is a weighted average because the sum of weights is equal to 1. It is also called an exponential recency-weighted decay. Note the for constant step-size, the estimates never completely converge but continue to vary in response to the most recently received rewards.

§1.07: Optimistic Initial Values

All the methods we have discussed so far are dependent to some extent on the initial action-value estimates, $Q_1(a)$ i.e. they are biased. For the sample average methods, the bias disappears once all actions have been selected at least once, but for methods with constant $\alpha$, the bias is permanent, though decreasing over time. The downside is that the initial estimates become, in effect, a set of parameters that must be picked by the user, if only to set them all to zero. The upside is that they provide an easy way to supply some prior knowledge about what level of rewards can be expected.
Initial action values can also be used as a simple way to encourage exploration. Setting them to a very high reward encourages them to explore more. For example let $q_*(a) = 1$. However if we set $Q_1(a) = 5$, for all $a$, then whichever actions are initially selected, the reward is less than the starting estimates; the learner switches to other actions, being “disappointed” with the rewards it is receiving. The result is that all actions are tried several times before the value estimates converge. The system does a fair amount of exploration even if greedy actions are selected all the time.

§1.08: Upper Confidence Bounds

Exploration is vital due to uncertainty in action-value estimates. Greedy actions seem optimal, but alternatives might be better. $\epsilon$-greedy explores non-greedy actions indiscriminately. It’s wiser to choose non-greedy actions based on their potential optimality, considering estimate closeness to the maximum and uncertainty. An effective approach is to select actions based on:

\[A_t = \text{argmax}_a [Q_t(a) + c\sqrt{\frac{\ln t}{N_t(a)}}]\]

The upper confidence bound (UCB) action selection method leverages the square-root term to gauge the uncertainty in an action’s estimated value. The term being maximized acts as an upper limit on the potential true value of the action, with the confidence level controlled by the parameter $c$ When the action $a$ is chosen, the uncertainty diminishes due to the increase in $N_t(a)$, which is in the denominator. Conversely, when other actions are selected, $t$ increases, but $N_t(a)$ remains constant, causing the uncertainty estimate to rise. The natural logarithm ensures that the growth slows over time, although it remains unbounded. Consequently, all actions will eventually be chosen, but those with lower value estimates or higher selection frequency will become less likely choices over time.

§1.09: Contextual Bandits

In a contextual bandit scenario, an agent is presented with a set of choices or actions, akin to the arms of a multi-armed bandit. However, there is an additional crucial element in contextual bandits: each action is associated with a context or set of features that describe the environment or situation in which the choice is to be made.

Each time the learner faces a decision, it is provided with a context or a feature vector. This context represents the state of the environment and helps the learner make informed choices.