Chapter 12: Multiclass Classification

This chapter treats the multiclass case of classification. Tasks with more than two classes preclude the application of some techniques studied in the binary scenario and require an adaptation of loss functions.

§12.01: Multiclass Classification and Losses

Multiclass Brier Score is defined on a vector of class probabilities $(\pi_1(x), …\pi_g(x))$:
\[L(y, \pi(x)) = (1_{\{y=k\}} - \pi(k))^2\]

Multiclass Log Loss / Cross Entropy Loss is a generalisation of the binary case:
\[L(y, \pi(x)) = - \Sigma_{k=1}^g1_{\{y=k\}} \log(\pi_k(x))\]

§12.02: Softmax Regression

The softmax regression is a straightforward generalisation of the logistic regression to the multi class case. Instead of single linear discriminant function, we have g linear discriminant functions, each indicating confidence in class k:
\[f_k(x) = \theta_k^Tx\]
The g score functions are transformed into probabilities by the softmax function:
\[\pi_k(x) = s(f(x))_k = \frac{\exp(\theta_k^Tx)}{\Sigma_{j=1}^g \exp(\theta_j^Tx)}\]
Instead of the Bernoulli loss in the logistic regression, we use log-loss/cross-entropy.
Note that the softmax function is a smooth approximation of the argmax function and is invariant to constant offsets i.e. :
\[s(f(x) + c) = s(f(x))\]
In a logistic regression with two classes, we only needed one parameter (the other was simply $1-\theta$). Similarly in the Softmax case, we don’t need g parameter vectors, we only need g-1. This implies that the minimiser is not unique although convex.Therefore often in practice $\theta_g = 0$ and optimise for the other $\theta_k$ (just think of subtracting $\theta_g$ from all vectors and applying the above invariance property).
The softmax function is also rank-preserving which means the ranks among the elements of vector z are the same as those among the vector s(z). Its inverse is also unique, mono-tonic, and rank-preserving.

§12.03: One-vs-One and One-vs-Rest

Assume we only have a way to train binary classifiers. Then we need to find a model agnostic way to reduce the multi class problem to multiple binary problems. There are two common approaches.
One-vs-Rest: For a g-class problem, create g-subproblems where in each problem, one class is encoded as positive, all the others are encoded as the negative. Then you output the class which has the highest score i.e.:
\[\hat{y} = \underset{k \in 1,2,...,g}{\operatorname{argmax}} \hat{f}_k(x)\]
One-vs-One: For a g-class problem, $\frac{g(g-1)}{2}$ where each subproblem only considers data from two classes and others are ommited. Label prediction is usually done using Majority Voting or Pairwise-Coupling (a heuristic used to transform scores to probabilities).