Entropy - Jerry's Math Garden

> [!quote] > > You should call it "entropy" and for two reasons: first, the function is already in use in thermodynamics under that name; second, and most importantly, most people don't know what entropy really is, and if you use the word "entropy" in an argument you will win every time! > > —John Von Neumann > [!definition] > > $ > H(X) = \ev(I(X)) = -\sum_{j = 1}^n p_j \log_2(p_j) > $ > The [[Entropy|entropy]] $H(X)$ of a discrete [[Random Variable|random variable]] $X$ is its [[Expectation|expected]] [[Information|information]], or a measure of uncertainty[^1]. In cases where the random variable is often unsurprising, the entropy is low. When it is regularly surprising, the entropy is high. Uncertainty represents the *potential information* to be gained by measuring a random variable. > > In cases where $p_j = 0$, $p_j \log_2(p_j)$ is not well defined. For the purposes of entropy, $p\log_2(p)$ stands for the function $\phi: [0, 1] \to [0, \infty)$, where $\phi(p) = 0$ when $p = 0$, and $\phi(p) = p\log_2(p)$. > [!theorem] > > Let $X$ be a discrete random variable: > - $H(X) \ge 0$ and $H(X) = 0$ if and only if $X$ takes one of its values with certainty. > - A [[Certain Random Variable|certain]] random variable never produces any new information because its outcome is already known. > - $H(X) \le \log_2(n)$ with equality if and only if $X$ is [[Uniform Distribution|uniformly distributed]]. > - Uniform distribution corresponds to the greatest uncertainty. > - $H_n = \log_2(n)$ denotes the entropy of a uniform distribution with $n$ outcomes. > > *Proof*. Begin by expressing $\ln(x)$ as a [[Integral|definite integral]]. > $ > \begin{align*} > \ln(x) &= \int_{a}^{x}\frac{1}{t}dt + C\\ > \ln(1) &= 0 = \int_1^1\frac{1}{t}dt\\ > \ln(x) &= \int_1^{x}\frac{1}{t}dt > \end{align*} > $ > Since $\frac{1}{t} \ge 1$ for $0 \le t \le 1$ and $\frac{1}{t} < 1$ for $t \ge 1$, > $ > \begin{align*} > \ln(x) &\le \int_1^{x}dt \quad &\text{for }0 \le t \le 1 \\ > \ln(x) &\le x - 1 &\text{for }0 \le t \le 1 \\ > \ln(x) = \int_1^{x}dt &\lt x - 1 &\text{for }t \gt 1\\ > \end{align*} > $ > > Solving $\ln(x) = x - 1$ gives $x = 1$, meaning that $\ln(x) \le x - 1$ with equality when $x =1$. As a result, $H(X) = -\sum p\log_2(p) \ge x - 1$, $H(X) \ge 0$ since $p < 1$. > > If $H(X) = 0$, then each $p_j \log_2(p_j) = 0$, and there must exist some $k(1 \le k \le n)$ where $p_j = 0(j \ne k)$, $p_k = 1$. > > Suppose that each $p_j \ge 0(1 \le j \le n)$, (Since $\sum_{j = 1}^{n}p_j = 1$ the $\ln(n)$ can be grouped into the sum) > $ > \begin{align*} > H(X) - \log_2(n) &= -\frac{1}{\ln(2)} \paren{\sum_{j = 1}^{n} p_j \ln(p_j) + \ln(n)}\\ > &= -\frac{1}{\ln(2)} \paren{\sum_{j = 1}^{n} p_j (\ln(p_j) + \ln(n))}\\ > &= -\frac{1}{\ln(2)} \sum_{j = 1}^{n} p_j \ln(p_j n)\\ > &= \frac{1}{\ln(2)} \sum_{j = 1}^{n} p_j \ln\paren{\frac{1}{p_j n}}\\ > &\le \frac{1}{\ln(2)} \sum_{j = 1}^{n} p_j \paren{\frac{1}{p_j n} - 1} \quad (\ln(x) \le x - 1)\\ > &\le \frac{1}{\ln(2)} \sum_{j = 1}^{n} \paren{\frac{1}{n} - p_j}\\ > &=1 \quad \frac{1}{p_jn} - 1 = 0\ \text{i.f.f.}\ p_j = \frac{1}{n} > \end{align*} > $ > > [!theoremb] Theorem > > If a [[Probability Distribution|probability distribution]] is "more uniform", then its entropy increases. > > *Proof*. Consider a random variable $X$ with a range of size $n$. Create a random variable $S$ with the same probability distribution as $X$, except that two of the probabilities $p_1$ and $p_2$, where $p_1 \lt p_2$, are replaced with $p_1 + \varepsilon$ and $p_2 - \varepsilon$ respectively, where $p_1 + \varepsilon \lt p_2 - \varepsilon$ ($0 \lt 2\varepsilon \lt p_2 - p_1$). > > First calculate the entropy of $X$ and $S$. > $ > \begin{align*} > H(X) &= \sum_{j = 1}^{n}p_j \log_2(p_j) > \end{align*} > $ > $ > \begin{align*} > H(S) &= (p_1 + \varepsilon)\log_2(p_1 + \varepsilon) + > (p_2 - \varepsilon)\log_2(p_2 - \varepsilon) + \sum_{j = 3}^{n}p_j\log_2(p_j) > \end{align*} > $ > > The change in entropy $\Delta H$ going from $X$ to $S$ is: > $ > \begin{align*} > \Delta H &= H(S) - H(X) \\ > &= (p_1 + \varepsilon)\log_2(p_1 + \varepsilon) + > (p_2 - \varepsilon)\log_2(p_2 - \varepsilon) \\ > &\quad + \sum_{j = 3}^{n}p_j\log_2(p_j) - \sum_{j = 1}^{n}p_j\log_2(p_j) \\ > &= (p_1 + \varepsilon)\log_2(p_1 + \varepsilon) + > (p_2 - \varepsilon)\log_2(p_2 - \varepsilon)-p_1\log_2(p_1) - p_2 \log_2(p_2) \\ > &= p_1 \log_2\paren{\frac{p_1}{p_1 + \varepsilon}} + p_2\log_2\paren{\frac{p_2}{p_2 - \varepsilon}} + \varepsilon\log_2\paren{\frac{p_2 - \varepsilon}{p_1 + \varepsilon}} > \end{align*} > $ > > Consider $\Delta H$ as a [[Function|function]] of $\varepsilon$, notice that $\Delta H (0) = 0$, and its derivative is: > $ > \frac{\partial \Delta H}{\partial \varepsilon} = \log_2\paren{\frac{p_2 - \varepsilon}{p_1 + \varepsilon}} > $ > > Since $p_2 \gt p_1$ and $p_2 - \varepsilon \gt p_1 + \varepsilon$, $\frac{p_2 - \varepsilon}{p_1 + \varepsilon} \gt 1$, $\frac{\partial \Delta H}{\partial \varepsilon} \gt 0$ for all $0 \lt 2\varepsilon \lt p_2 - p_1$, meaning that $\Delta H$ is an increasing function of $\varepsilon$, and that $\Delta H \gt 0$ for all $0 \lt 2\varepsilon \lt p_2 - p_1$. # Uniqueness > [!definition] > > Let $X$ be a random variable with [[Probability Distribution|probability distribution]] $\{p_1, p_2, \cdots, p_n\}$. A real valued [[Function|function]] $U(X) = U(p_1, p_2, \cdots, p_N)$ is a [[Measure Space|measure]] of uncertainty if it satisfies: > 1. $U(X)$ is maximum when $X$ has a uniform distribution. > 2. If $Y$ is another random variable, then $U(X, Y) = U_X(Y) + U(X)$. > 3. Uncertainty should not change with the addition of an impossible event: $U(p_1, p_2, \cdots, p_n, 0) = U(p_1, p_2, \cdots, p_n)$ > 4. $U(p_1, p_2, \cdots, p_n)$ is [[Continuity|continuous]] with respect to all its arguments. > [!theorem] > > $U(X)$ is a measure of uncertainty if and only if > $ > U(X) = KH(X) \quad \text{where} \quad k \ge 0 > $ > > *Proof*. Define $A(n) = U\paren{\frac{1}{n}, \frac{1}{n}, \cdots, \frac{1}{n}}$. > > Begin by showing that $A(n) = K\log(n)$. By conditions 1 and 3, > $ > A(n) = U\paren{\frac{1}{n}, \frac{1}{n}, \cdots, \frac{1}{n}, 0} \le A(n + 1) > $ > Since $A(n + 1)$ measures a "more" uniform distribution than $A(n)$, $A(n + 1)$ has to be greater or equal to $A(n)$. Therefore, $A$ is a non-decreasing function of $n$. > > Let $X_1, X_2, \cdots, X_m$ be [[Probabilistic Independence|independent]] uniformly distributed random variables, each with $r$ values in its range, such that each $U(X_j) = A(r)$. By condition 2, > $ > U(X_1, X_2) = U_{X_2}(X_1) + U(X_2) = U(X_1) + U(X_2) = 2A(r) > $ > And by induction, > $ > U(X_1, X_2, \cdots, X_m) = mA(r) > $ > Since the random [[Vector|vector]] $X = (X_1, X_2, \cdots, X_m)$ has $r^m$ equally likely outcomes, > $ > U(X_1, X_2, \cdots, X_M) = A(r^m) > $ > Meaning that > $ > A(r^m) = mA(r) > $ > As the result also holds true for $n$ independent uniformly distributed variables, each with $s$ values in its range, $A(s^n) = nA(s)$. > > Choose $r$, $s$, and $n$ arbitrarily, and let $m$ be such that: > $ > r^m \le s^n \le r^{m + 1} > $ > > Using the fact that $A$ is a non-decreasing function, > $ > A(r^m) \le A(s^n) \le A(r^{m + 1}) > $ > $ > mA(r) \le nA(s) \le (m + 1)A(r) > $ > $ > \frac{m}{n} \le \frac{A(s)}{A(r)} \le \frac{m}{n} + \frac{1}{n} > $ > $ > \left|\frac{A(S)}{A(R)} - \frac{m}{n}\right| \le \frac{1}{n} > $ > > Taking the logarithm of the earlier expression gives: > $ > m\log(r) \le n \log(s) \le (m + 1) \log(r) > $ > $ > \left|\frac{\log(s)}{\log(r)} - \frac{m}{n}\right| \le \frac{1}{n} > $ > > Since $|a + b| \le |a| + |b|$, > $ > \begin{align*} > \left|\frac{A(s)}{A(r)} - \frac{\log(s)}{\log(r)}\right| > &= \left|\paren{\frac{A(s)}{A(r)} - \frac{m}{n}} + > \paren{\frac{m}{n} - \frac{\log(s)}{\log(r)}}\right|\\ > &\le \left|\frac{A(s)}{A(r)} - \frac{m}{n}\right| + > \left|\frac{m}{n} - \frac{\log(s)}{\log(r)}\right|\\ > &\le \left|\frac{A(s)}{A(r)} - \frac{m}{n}\right| + > \left|\frac{\log(s)}{\log(r)} - \frac{m}{n}\right|\\ > &\le \frac{2}{n} > \end{align*} > $ > > Because $n$ can be as large as possible, as $n \to \infty$, > $ > \frac{A(s)}{A(r)} = \frac{\log(s)}{\log(r)} \Leftrightarrow A(s)=K\log(s) > $ > > Moving on to showing that the theorem is true for all rational $p_j$s. Let > $ > p_j = \frac{m_j}{m} \quad \text{where} \sum_{j = 1}^{n}m_j = m > $ > > Introduce another random variable $Y$, and define its sample space as: > $ > y_{11}, y_{12}, \cdots, y_{1m_1} \quad > y_{21}, y_{22}, \cdots, y_{2m_2} \quad > y_{n1}, y_{n2}, \cdots, y_{nm_n} > $ > And define its probability distributions as: > $ > \begin{align*} > P_{X = x_r}(Y = y_{rk}) &= \frac{1}{m_r} &\quad \text{for}\ 1 \le k \le m_r \\ > P_{X = x_r}(Y = y_{sk}) &= 0 &\quad \text{for}\ 1 \le k \le m_s, s \ne r > \end{align*} > $ > Basically $Y$ serves as a list of uniformly distributed random variables indexed by $X$. > > This means that the conditional uncertainty $U_{X = x_r}(Y) = K\log(m_r)$ because the random variable $\tilde{Y}_r$ is uniformly distributed. As a result, > $ > U_X(Y) = \sum_{r = 1}^{n}p_r U_{X = x_r}(Y) = K\sum_{r = 1}^{n}\frac{m_j}{m}\log(m_r) > $ > Using the [[Joint Distribution|joint probabilities]] of $X$ and $Y$, > $ > \begin{align*} > P((X = x_r) \cap (Y = y_{sk})) &= p_rP_r(Y = y_{sk}) = 0 \quad \text{when}\ s \ne r \\ > &= \frac{m_r}{m} \cdot \frac{1}{m_r} = \frac{1}{m} \quad \text{for each}\ 1 \le k \le m_r > \end{align*} > $ > Deduce that > $ > Y(X, Y) = kA(m) = k\log(m) > $ > Which means that > $ > \begin{align*} > U(X) &= U(X, Y) - U_X(Y) \\ > &= K\log(m) - K\sum_{r = 1}^{n}\frac{m_r}{m}\log(m_r)\\ > &= -K\sum_{r = 1}^{n}\paren{\frac{m_r}{m}\log(m_r) - \frac{m_r}{m}\log(m)}\\ > &= -K\sum_{r = 1}^{n}\frac{m_r}{m}\log\paren{\frac{m_r}{m}} > \end{align*} > $ > > Now let the probabilities be arbitrary real numbers, such that each $p_j$ can be approximated by a [[Sequence|sequence]] of rational numbers $p_j^{(N)}$, where each $p_j^{(N)}$ can be written in the form of $\frac{m_j}{m}$. Let $H^{(N)}(X)$ be the corresponding sequence of entropies and define > $ > H(X) = -\sum_{j = 1}^{n}p_j \log(p_j) > $ > Since the limit $H(X) = \limv{n}H^{(N)}(X)$ is unique and that each step to the limit is a measure of uncertainty, $U(X) = H(X)$. [^1]: In physics (statistical thermodynamics), the entropy of a macrostate (say temperature, pressure, etc.) is the logarithm of the number of its microstates (since we assume the principle of symmetry summing all the 1/n for n times conveniently cancels out). A macrostate has low entropy if only a few microstates are associated with it, and high entropy if lots of microstates are associated with it. Translating this to information: a macrostate has low entropy (uncertainty) if it has few possibilities, while such a macrostate may be surprising, once it is *known*, there is little left to be known; a macrostate has high entropy if it has many possibilities, while such a macrostate is unsurprising, there is still much to be known about each individual state.