This chapter revisits the theory of risk minimization, providing more in-depth analysis on established losses and the connection between empirical risk minimization and maximum likelihood estimation.
Our goal is to minimise the risk, for a certain hypothesis $f \in H$ and loss $L(y, f(x))$
\[R_L(f) = E_{xy}[L(y,f(x))] = \int L(y,f(x))\text{dP}_{xy}\]This risk here is simple what I have previously known as “loss”. $f \in H$ is a just a possible model (like logistic, linear, etc) and we sum over all $x,y$. We simply want to minimize the expectation with respect to the data generating process.
The $f^* \in H$ with minimal risk is called the Risk Minimiser, Population Minimizer, or Bayes Optimal Model where
\[f^* = \underset{f: X \rightarrow \mathbb{R}^g}{\operatorname{argmin}} R_L(f) = \underset{f: X \rightarrow \mathbb{R}^g}{\operatorname{argmin}} \int L(y, f(x)) \text{dP}_{xy}\]The resulting risk is called the Bayes Risk,
\[R_L^* = \inf R_L(f)\]In practice, (i) we usually have to restrict the hypothesis space such that we can efficiently search over it (ii) we usually do not also know the true data generating process $\text{P}_{xy}$. Instead of $R_L(f)$ (aka the theoretical risk minimiser), we usually optimise the empirical risk i.e.
\[\hat{f} = \underset{f \in H}{\operatorname{argmin}}R_{emp} (f) = \underset{f \in H}{\operatorname{argmin}} \Sigma_{i=1}^n L(y^{(i)}, f(x^{(i)}))\]According to Law of Large Numbers, $R_{emp} \rightarrow R_L(f)$ as $n \rightarrow \infty$. (This is basically loss as I have always learnt).
The goal of learning is there to minimise the Bayes Regret i.e. train a model such that the difference between $R_L(\hat{f}) - R_L^*$ is as low as possible. Note that this Bayes Regret can be decomposed as follows:
![]() |
(i) Estimation Error: We fit $\hat{f}$ via the empirical risk minimization and usually use approximate optimisation. Therefore we do not find the optimal $f \in H$. In simple words given that we have restricted the Hypothesis space, we can still only find the empirical risk minimiser since the true data generating process is not known. The Estimation error identifies this delta. |
(ii) Approximation Error: Now even if we find the optimal model when restricting the hypothesis space (i.e. we know the data generating process and its not an empirical estimation - note how its $f$ and not $\hat{f}$), it is still entirely possible that that the bayes optimal model is not even in our hypothesis space (For example if the true optimal model is quadratic but the hypothesis space is linear). The approximation error accounts for this error.
If we study the risk $R$ of our inducing algorithm $I$ for a specific data generating distribution $P_{xy}$, and we sample training sets $D_{train}$ from this distribution of size $n_{train}$ and we run the algorithm on $D_{train}$, and evaluate the true risk of the model, in probability the risk converges to Bayes Risk $R_L^*$, we say that the learning method $I$ is consistent wrt to $P_{xy}$.
\[R(I(D_{train}), \lambda) \rightarrow^p R^*_L\]While the risk minimizer gives us the (theoretical) optimal solution, the optimal constant model (i.e. featureless predictor) gives us a computable empirical lower baseline solution.
The model $f(x) = \theta$ optimizes the risk $R_{emp}(\theta)$.
We further define pseudo-residuals as the negative first derivates of loss functions wrt $f(x)$:
\[\tilde{r} = -\frac{\partial L(y, f(x)) }{\partial f(x)}\]Consider $L(y, f(x)) = (y-f(x))^2$ i.e. the $L_2$ loss. Then, $\tilde{r} = 2(y-f(x))$
Similarly, if we compute the best point-wise update using gradient descent,
\[f(x) \leftarrow f(x) - \frac{\partial L(y, f(x))}{\partial f(x)} = f(x) + \tilde{r}\]Iteratively stepping in the direction of the pseudo-residuals is the whole idea behind gradient boosting.
Normally, you don’t do gradient descent on the model itself (since then there is no restriction on the form of the model) as you may end up with an invalid model which does not lie in the hypothesis space. Thus gradient descent is normally performed on the parameters of the model:
\[\theta^{[t+1]} = \theta^{[t]} - \alpha^{[t]} \nabla_{\theta}R_{emp}(\theta)|_{\theta = \theta^{[t]}}\]The $L_2$ loss is given by $L(y, f(x)) = (y - f(x))^2$. Minimising this function is the same as minimising $0.5 (y - f(x))^2$. Considering this as our loss function, we get
\[\frac{\partial L(y, f(x))}{\partial f(x)} = y - f(x) \Rightarrow \tilde{r} = r\]i.e. the pseudo-residuals are equal to residuals.
Using the law of total expectation we get, \(R_L(f) = E_{XY}[L(y, f(x))] = E_X[E_{Y|X}[ L(y, f(X) | X = x ]] = E_X[E_{Y|X}[ (y-f(X))^2 | X=x ]]\)
In a theoretical risk minimisation problem, we have an unrestricted hypothesis space and therefore we can construct an $f$ point-wise such that it outputs any value $c$. Hence we no longer need to consider the outer expectation over $X$. The problem then boils down to finding a $c$ such that $E_{Y\mid X}[(y-c)^2 \mid X = x]$ is minimum i.e.
\[\hat{f}(x) = \underset{c}{\operatorname{argmin}} M(c) = E_{Y|X}[(y-c)^2 | X = x]\] \[\begin{aligned} M(c) &= \mathbb{E}_{Y|X}[(y - c)^2 \mid X = x] - \left(\mathbb{E}_{Y|X}[y \mid X = x] - c\right)^2 \\ &= \operatorname{Var}(Y - c \mid X = x) + \left(\mathbb{E}_{Y|X}[y \mid X = x] - c\right)^2 \\ &= \operatorname{Var}(Y \mid X = x) + \left(\mathbb{E}_{Y|X}[y \mid X = x] - c\right)^2 \end{aligned}\]Where we add a “0” and by definition $Var(X) = E[x] - E[x]^2$, and $Var(X+a) = Var(X)$, we get that that the first term (the variance) is independent of $c$ hence the optimization problem boils down to:
\[\underset{c}{\operatorname{argmin}} (E_{Y|X}[Y|X=x] - c)^2\]It is evident that the function is minimum when $c = \hat{f}(X) = \mathbb{E}_{Y \mid X}[Y \mid X=x]$.
In the optimal constant model, $f(x) = \theta$, for the $L_2$ loss, $L(y, f(x)) = (y-\theta)^2$. Thus $R_{emp}(f; \theta) = \Sigma (y^{(i)} - \theta)^2$ . Thus taking the derivative of this function and setting it to 0 should give us the optimal $\hat{\theta}$.
\[\frac{\partial R_{emp}(f; \theta)}{\partial \theta} = 2 \Sigma (y^{(i)} - \theta) =_{set} 0\] \[= \Sigma y^{(i)} - n\theta = 0 \Leftrightarrow \hat{\theta} = \frac{1}{n} \Sigma y^{(i)} = \bar{y}\]Note that in the L2 loss with optimal constant, we minimising the squared distances from one single point. In our case the mean. By definition, this is the sum of squared errors (just plug in $\bar{y}$ in the $R_{emp}$), which if divided by $n$ or $n-1$ is simply the empirical variance. Intuitively also this makes sense since in the optimal constant model we are minimising the squared distances from a single point.
Consider a discrete classifier $h(x): X \rightarrow Y$. The most natural choice for $L(y, h(x))$ would be the 0-1 loss:
\[L(y, h(x)) = \mathbb{1}({y \neq h(x)})\]For the binary case, we can also express this as a scoring function $v = yh(x).$ Therefore the loss would then be $\mathbb{1}(v < 0).$ i.e. $\mathbb{1}(yh(x) < 0)$. Note that $v > 0$ when $x$ is correctly classified $(-1 \cdot -1 = 1, 1 \cdot 1 = 1)$. However, $v < 0$ when it is incorrectly classified.
We can now derive the risk-minimiser for the 0-1 loss where the second line is just the definition of expected value in discrete case (since its a classification problem). Note that $P(y=k \mid x=x)$ is the posterior probability for class k.
\[\begin{align} \mathcal{R}(f) &= \mathbb{E}_{xy}[L(y,f(\mathbf{x}))] \\ &= \mathbb{E}_{x}\left[\mathbb{E}_{y|x}[L(y,f(\mathbf{x}))]\right] \\ &= \mathbb{E}_{x}\left[\sum_{k \in \mathcal{Y}} L(k,f(\mathbf{x}))P(y=k \mid \mathbf{x}=\mathbf{x})\right] \end{align}\]The theoretical risk-minimiser when we have an unrestricted hypothesis space is always constructing a point-wise optimiser. Thus the problem essentially boils down to:
\[h^*(x) = \underset{l \in Y}{\operatorname{argmin}} \Sigma_{k \in Y} L(k, l) P(y=k | x=x)\]Remember that when $k=l$, the loss is $0$. Therefore we can rewrite the sum as (make note of the set we are summing over):
\[h^*(x) = \underset{l \in Y}{\operatorname{argmin}} \Sigma_{k \neq l} P(y=k | x=x)\]From the basic axioms of probability, the sum of all probabilities apart from the case when $k=l$ is equivalent to:
\[h^*(x) = \underset{l \in Y}{\operatorname{argmin}} 1 - P(y=l | x=x) \\ = \underset{l \in Y}{\operatorname{argmax}} P(y=l|x=x)\]Since the argmin of a negative is equivalent to the argmax of the positive and the 1 is constant which doesn’t affect the argmin/argmax. This is also intuitive since we do want to maximise the probability that k=l i.e. always choose the class with the maximal posterior probability. $h^*(x) = \underset{l \in Y}{\operatorname{argmax}} P(y=l \mid x=x)$ is also referred to as the Bayes Optimal Classifier. The associated Bayes Risk is given by:
\[E_x[1 - \max_{l \in Y} P(y=l|x=x)] \\ = 1 - E_x[\max_{l \in Y} P(y=l|x=x)]\]For all potential feature vectors x, for an arbitrary point x, we take class with the maximal posterior probability, and consider the counter probability which is our Bayes Risk. Take for example two points one with posterior probability of 0.9 for the maximal class and the other with 0.8 for the maximal class. Hence the bayes risk is 1 - (0.8+0.9)/2 = 0.15.
There are two equivalent formulations of the loss for different encoding systems. Its not hard to show that one is equivalent to another: \(\begin{align} L(y, f(\mathbf{x})) &= \ln(1 + \exp(-y \cdot f(\mathbf{x}))) && \text{for } y \in \{-1, +1\} \\ L(y, f(\mathbf{x})) &= -y \cdot f(\mathbf{x}) + \log(1 + \exp(f(\mathbf{x}))) && \text{for } y \in \{0, 1\} \end{align}\)
If the scores $f(x)$ are transformed into probabilities using the logistic function $\pi(x) = \frac{1}{1 + e^{-f(x)}} \leftrightarrow f(x) = \ln(\frac{\pi(x)}{1-\pi(x)})$. This formulation leads to another equivalent formulation of loss (the one I have known):
\[L(y, \pi(x)) = -y\log(\pi(x)) - (1-y)\log(1-\pi(x))\]For $\pi(x)$ and $y \in {0,1}$ , we can write risk as follows:
\[\begin{align} R(f) &= \mathbb{E}_x\left[L(1, \pi(x)) \cdot P(y = 1 \mid x) + L(0, \pi(x)) \cdot (1 - P(y = 1 \mid x))\right] \\ &= \mathbb{E}_x\left[-\log(\pi(x)) \cdot P(y = 1 \mid x) - (1 - \pi(x)) \cdot (1 - P(y = 1 \mid x))\right] \end{align}\]For a fixed $x$, we can compute the point-wise optimal value $c$ by setting the derivative to 0:
$\frac{\partial}{\partial c}(-\log( c) P(y=1 \mid x) - \log(1-c)(1-P(y=1 \mid x)) = 0 \Leftrightarrow c = P(y=1 \mid x)$
Hence the risk minimiser for the probabilistic Bernoulli is simply $\pi^*(x) = P(y=1 \mid x)$
For a loss defined on $y \in {-1, 1}$ and scores $f(x)$, we can write the risk as follows:
\[\begin{align} R(f) &= \mathbb{E}_x\left[L(1, f(x)) \cdot P(y = 1 \mid x) + L(-1, f(x)) \cdot (1 - P(y = 1 \mid x))\right] \\ &= \mathbb{E}_x\left[\ln(1 + \exp(-f(x))) \cdot P(y = 1 \mid x) + \ln(1 + \exp(f(x))) \cdot (1 - P(y = 1 \mid x))\right] \end{align}\]For a fixed $x$ we can compute the point-wise optimal value $c$ by setting the derivative to 0:
Thus the risk minimiser is given by:
\[f^*(x) = \ln(\frac{P(y \mid x)}{1 - P(y \mid x)})\]This function is undefined when $P(y \mid x) = 0/1$ but predicts a smooth curve which grows when $P(y \mid x)$ increases and equals $0$ when $P(y \mid x) = 0.5$.
https://github.com/slds-lmu/lecture_sl/raw/main/slides-pdf/slides-advriskmin-logreg-deepdive.pdf
The brier score can be thought of as $L_2$ loss on probabilities $\pi(x)$ i.e.
\[L(y, \pi(x)) = (y - \pi(x))^2\]The risk minimiser of the binary Brier Score is given by:
\[\pi^*(x) = P(Y=1|X=x)\]Then, given iid data $D = {(x^{(i)}, y^{(i)}) }_{i=1}^{n}$, in MLE we:
\[\max_\theta L(\theta) = \Pi_{i=1}^{n} p(y^{(i)} | x^{(i)}, \theta)\] \[\Leftrightarrow\]Simply defining the negative log-likelihood as our loss function:
\[L(y, f(x|\theta)) = -\log p(y|x, \theta)\]Then we get ERM, $R_{emp}(\theta) =$
\[\Sigma_{i=1}^n L(y^{(i)}, f(x^{(i)}|\theta))\]which is equal to our MLE.
The reverse direction holds If we can write the loss as $L(y, f(x)) = L(y - f(x)) = L(r)$ i.e. the loss function is a function of the residuals, then minimizing $L(y - f(x))$ is equivalent to maximizing $\log(p(y-f(x | \theta))$ if |
The generalisation error can be decomposed into three parts - bias, variance, and inherent noise. The expected error of learning algorithm $I_L$ for an induced model $\hat{f}_{D_n}$, on training sets of size $n$ when applied to fresh unseen observation is:
\[GE_n(I_L) = E_{D_n \sim P^n_{xy}, (x,y) \sim P_{xy}} [L(y, \hat{f}_{D_n}(x)]\]i.e. we need to take an expectation over all training sets of size n as well as the independent test observation.