This chapter covers basic information-theoretic concepts and discusses their relation to machine learning.
For a discrete random variable $X$ and pmf $p(x)$, the entropy given by:
\[\mathrm{H}(X) = -E[\log_2(p(X)] = - \Sigma_x p(x) \log_2p(x)\]The negative log probabilities are called surprisal. A more surprising event means a less likely event. Entropy is simple expected surprisal.
For a discrete uniform distribution with $g$ outcomes:
\[\mathrm{H}(x) = -\Sigma_{i=1}^{g} \frac{1}{g}\log_2(\frac{1}{g}) = \Sigma_{i=1}^{g} \frac{1}{g}\log_2(g) = g*\frac{1}{g} \log_2(g) = \log_2(g)\]The joint entropy of two discrete random variable $X$ and $Y$ is:
\[\mathrm{H}(X,Y) = -\Sigma_x \Sigma_y p(x,y) \log_2(p(x,y))\]Uniqueness Theorem: the only family of functions satisfying continuity in $p(x)$, no change when removing events with 0 probability, additive for independent RVs, and maximal for uniform distribution is of the following form:
\[H(p) = -\lambda \Sigma_x p(x) \log_2(p(x))\]where $\lambda$ is a positive constant.
For continuous random variable $X$ with density function $f(x)$ , the analogue is differential entropy:
\[h(X) = h(f) = -E[\log(f(x))] = - \int_x f(x) \log(f(x))dx\]The joint differential entropy is defined as:
\[h(X_1, X_2,...,X_n) = h(f) = -\int_x f(x) \log(f(x)) dx\]We want to establish a measure of “distance” (not accurate) or rather divergence between 2 distributions which have the same support:
\[D_{KL}(p \mid \mid q) = E_{X \sim p}[\log\frac{p(x)}{q(x)}] = \Sigma_x p(x) \log\frac{p(x)}{q(x)} \text{ or } \int_x p(x) \log\frac{p(x)}{q(x)} dx\]Note that, $0 \log(0/0) = 0, 0\log(0/q) = 0, p\log(p/0) = \infty$. This implies that for any $x$ where $p(x) > 0 \text{ and } q(x) = 0, D_{KL}(p \mid \mid q) = \infty$. This is why the support of both distributions must be the same.
Note that this is not a true measure of distance, since it is not symmetric i.e. $D_{KL}(p \mid \mid q) \neq D_{KL}(q \mid \mid p)$.
KL as Log-Difference: Suppose data is generated from $p(x)$ and approximated using $q(x)$, we can look at KL as expected log-difference. A good approximation should minimise the difference to $p(x)$:
\[D_{KL}(p||q) = E_{X\sim p}[\log p(x) - \log q(x)]\]KL as Likelihood Ratio: In statistics, we usually look at the (log) likelihood ratio to see which distribution fits better. If the ratio $\frac{p}{q} > 1$, then $p$ fits better else $q$ does. If we assume that the data is generated from $p$ and ask ourselves “how much better does p fit than q, on average? ” - this is simply taking the expectation of the log likelihood ratio i.e.
\[E_p[\log \frac{p(X)}{q(X)}]\]Therefore the expected likelihood ratio is simply the KL-Divergence.
Cross-Entropy measures the average amount of information required to represent an event from one distribution $p$ using a predictive scheme based on another distribution $q$:
\[H(p||q) = -\Sigma_x p(x) \log q(x) \text{ or } -\int_x p(x) \log q(x) = -E_{X\sim p}[\log q(x)]\]Note the following relationship between cross entropy, Entropy and KL-Divergence: \(H(p||q) = H(p) + D_{KL}(p||q)\)
The conditional Entropy quantifies the uncertainty of $Y$ that remains if the outcome of $X$ is given:
\(\begin{align*} \mathrm{H}(Y|X) &= \mathbb{E}_X[\mathrm{H}(Y|X = x)] \\ &= \sum_{x} p(x) \mathrm{H}(Y|X = x) \\ &= - \sum_{x} p(x) \sum_{y} p(y|x) \log_2 p(y|x) \\ &= - \sum_{x} \sum_{y} p(x, y) \log_2 p(y|x) \\ &= -\mathbb{E}_{x,y}[\log_2 p(y|x)] \end{align*}\)
The Mutual Information describes the amount of info about one RV obtained through another RV or how different their joint distribution is from pure independence. Thus the mutual information $I(X;Y)$ is the KL Divergence between the joint distribution and the product of marginals i.e.:
\[I(X;Y) = D_{KL}(p(X,Y) \mid \mid p(x)p(y)) = E_{p(x,y)}[\log_2 \frac{p(X,Y)}{p(X) p(Y)}]\]