Chapter 13: Information Theory

This chapter covers basic information-theoretic concepts and discusses their relation to machine learning.

§13.01: Entropy 1

For a discrete random variable $X$ and pmf $p(x)$, the entropy given by:
\[\mathrm{H}(X) = -E[\log_2(p(X)] = - \Sigma_x p(x) \log_2p(x)\]
The negative log probabilities are called surprisal. A more surprising event means a less likely event. Entropy is simple expected surprisal.
1. Entropy is non-negative i.e. $\mathrm{H}(x) \geq 0$, since $p(x) \in[0,1]$, therefore $\log_2p(x) \leq 0$ and the negative of that is positive.
2. If $p(x) = 1, \log_2p(x) = 0$ and hence $\mathrm{H}(x) = 0$
3. Adding or removing an event with $p(x) = 0$ doesn’t change it.
4. $\mathrm{H}(x)$ is continuous in probabilities $p(x)$
5. If the values in $p(x)$ are simply reordered, the entropy does not change (because entropy is a sum and it doesn’t matter which position the even is in).
For a discrete uniform distribution with $g$ outcomes:
\[\mathrm{H}(x) = -\Sigma_{i=1}^{g} \frac{1}{g}\log_2(\frac{1}{g}) = \Sigma_{i=1}^{g} \frac{1}{g}\log_2(g) = g*\frac{1}{g} \log_2(g) = \log_2(g)\]

§13.02: Entropy 2

The joint entropy of two discrete random variable $X$ and $Y$ is:
\[\mathrm{H}(X,Y) = -\Sigma_x \Sigma_y p(x,y) \log_2(p(x,y))\]
1. Entropy is additive for independent RVs.
2. Uniqueness Theorem: the only family of functions satisfying continuity in $p(x)$, no change when removing events with 0 probability, additive for independent RVs, and maximal for uniform distribution is of the following form:
  \[H(p) = -\lambda \Sigma_x p(x) \log_2(p(x))\]
  where $\lambda$ is a positive constant.
3. Maximum Entropy Principle: Assuming we know M properties about a discrete distribution $p(x)$ stated as moment conditions i.e. $E[g_m(X)] = \alpha_m$. Among all feasible distributions satisfying the constrains, choose the one with max entropy $\Rightarrow$ Use lagrange.

§13.03: Differential Entropy

For continuous random variable $X$ with density function $f(x)$ , the analogue is differential entropy:
\[h(X) = h(f) = -E[\log(f(x))] = - \int_x f(x) \log(f(x))dx\]
1. The integral above does not necessarily exist for all densities.
2. Unlike discrete entropy, differential entropy CAN BE negative since f(x) > 1 is possible.
For a uniform RV on [0,a], $h(X) = \log_2(a)$ which is < 0 for $a < 1$.
The differential entropy of a gaussian RV, $h(X) = \log_2(\sigma \sqrt{2\pi e})$. Note that the entropy is not a function of $\mu$ and is hence translation invariant. Note also that it is a function of $\sigma$and the differential entropy increases as the variance increases.
$h(X)$ is not a generalisation of $\mathrm{H}(X)$. Instead for an n-bit quantisation of a continuous RV, $\mathrm{H}(X) = h(X) + n$.
The joint differential entropy is defined as:
\[h(X_1, X_2,...,X_n) = h(f) = -\int_x f(x) \log(f(x)) dx\]

Properties of Differential Entropy

$h(f)$ can be negative
$h(f)$ is additive for independent random variables
$h(f)$ is maximized by the multivariate normal, if we restrict to all distributions with the same (co)variance so $h(X) \leq \frac{1}{2} \log(2 \pi e)$
$h(f)$ is maximized by the continuous uniform distribution for a random variable with a fixed range
$h(f)$ is translation invariant i.e. $h(X + a) = h(X)$
$h(aX) = h(X) + \log \mid a \mid$
$h(AX) = h(X) + \log \mid A \mid$ for random vectors and matrix $A$.

§13.04: KL Divergence

We want to establish a measure of “distance” (not accurate) or rather divergence between 2 distributions which have the same support:
\[D_{KL}(p \mid \mid q) = E_{X \sim p}[\log\frac{p(x)}{q(x)}] = \Sigma_x p(x) \log\frac{p(x)}{q(x)} \text{ or } \int_x p(x) \log\frac{p(x)}{q(x)} dx\]
Note that, $0 \log(0/0) = 0, 0\log(0/q) = 0, p\log(p/0) = \infty$. This implies that for any $x$ where $p(x) > 0 \text{ and } q(x) = 0, D_{KL}(p \mid \mid q) = \infty$. This is why the support of both distributions must be the same.
Note that this is not a true measure of distance, since it is not symmetric i.e. $D_{KL}(p \mid \mid q) \neq D_{KL}(q \mid \mid p)$.

$D_{KL}(p \mid \mid q) \geq 0$ is always true and is only equal if and only if $p=q$
KL as Log-Difference: Suppose data is generated from $p(x)$ and approximated using $q(x)$, we can look at KL as expected log-difference. A good approximation should minimise the difference to $p(x)$:
\[D_{KL}(p||q) = E_{X\sim p}[\log p(x) - \log q(x)]\]
KL in ML Fitting / Loss Function: Because KL quantifies the difference between distributions, it can be used as a loss function between distributions.
KL as Likelihood Ratio: In statistics, we usually look at the (log) likelihood ratio to see which distribution fits better. If the ratio $\frac{p}{q} > 1$, then $p$ fits better else $q$ does. If we assume that the data is generated from $p$ and ask ourselves “how much better does p fit than q, on average? ” - this is simply taking the expectation of the log likelihood ratio i.e.
\[E_p[\log \frac{p(X)}{q(X)}]\]
Therefore the expected likelihood ratio is simply the KL-Divergence.

§13.05: Cross-Entropy and KL

Cross-Entropy measures the average amount of information required to represent an event from one distribution $p$ using a predictive scheme based on another distribution $q$:
\[H(p||q) = -\Sigma_x p(x) \log q(x) \text{ or } -\int_x p(x) \log q(x) = -E_{X\sim p}[\log q(x)]\]
Note the following relationship between cross entropy, Entropy and KL-Divergence: $H(p||q) = H(p) + D_{KL}(p||q)$

§13.06: Information Theory for Machine Learning

Note: Imagine its a classification task with three classes and for a certain sample you have class 3 as the true label. Then this can be thought of as our distribution $p=(0,0,1)$ and our model predicts probabilities across three classes $q = (0.1, 0.1, 0.8)$. Now if you minimise the cross entropy across these two distributions $\Rightarrow$ Log-Loss! This is where the name cross entropy loss comes from. (For a 2-class case, it just becomes the Bernoulli loss).

§13.07: Joint Entropy and Mutual Information I

The conditional Entropy quantifies the uncertainty of $Y$ that remains if the outcome of $X$ is given:

$\begin{align*} \mathrm{H}(Y|X) &= \mathbb{E}_X[\mathrm{H}(Y|X = x)] \\ &= \sum_{x} p(x) \mathrm{H}(Y|X = x) \\ &= - \sum_{x} p(x) \sum_{y} p(y|x) \log_2 p(y|x) \\ &= - \sum_{x} \sum_{y} p(x, y) \log_2 p(y|x) \\ &= -\mathbb{E}_{x,y}[\log_2 p(y|x)] \end{align*}$
1. Chain Rule: $\mathrm{H}(X,Y) = \mathrm{H}(X) + \mathrm{H}(Y \mid X)$
2. $\mathrm{H}(X,X) = \mathrm{H}(X)$
3. $\mathrm{H}(X \mid X) = 0$
4. $\mathrm{H}((X,Y)\mid Z) = \mathrm{H}(X \mid Z) + \mathrm{H}(Y \mid (X,Z))$
5. If $\mathrm{H}(X \mid Y) = 0 \Rightarrow$ $X$ is a function of $Y$.

Mutual Information

The Mutual Information describes the amount of info about one RV obtained through another RV or how different their joint distribution is from pure independence. Thus the mutual information $I(X;Y)$ is the KL Divergence between the joint distribution and the product of marginals i.e.:

\[I(X;Y) = D_{KL}(p(X,Y) \mid \mid p(x)p(y)) = E_{p(x,y)}[\log_2 \frac{p(X,Y)}{p(X) p(Y)}]\]

$I(X;Y) = \mathrm{H}(X) - \mathrm{H}(X \mid Y) = \mathrm{H}(Y) - \mathrm{H}(Y \mid X)$
$I(X;Y) \leq \min{\mathrm{H}(X), \mathrm{H}(Y)}$
$I(X;Y) = H(X) + H(Y) - H(X,Y)$

§13.08: Joint Entropy and Mutual Information II

$I(X;Y) = D_{KL}(p(x,y) \mid \mid p(x)p(y)) \geq 0$ and holds with equality $\Leftrightarrow p(x,y) = p(x) p(y)$.
Conditioning reduces entropy: $\mathrm{H}(X \mid Y) \leq \mathrm{H}(X)$ and holds with equality $\Leftrightarrow X$ and $Y$ are independent.
$\mathrm{H}(X_1, …,X_n) = \Sigma_i\mathrm{H}(X_i \mid X_{i-1},…,1) \leq \Sigma_i \mathrm{H}(X_i)$
If two RVs are independent, their correlation is 0. But the other way does not necessarily hold since Correlation only measure linear dependence whereas MI can be seen as a more general measure of dependence.