> [!quote]
>
> You should call it "entropy" and for two reasons: first, the function is already in use in thermodynamics under that name; second, and most importantly, most people don't know what entropy really is, and if you use the word "entropy" in an argument you will win every time!
>
> —John Von Neumann
> [!definition]
>
> $
> H(X) = \ev(I(X)) = -\sum_{j = 1}^n p_j \log_2(p_j)
> $
> The [[Entropy|entropy]] $H(X)$ of a discrete [[Random Variable|random variable]] $X$ is its [[Expectation|expected]] [[Information|information]], or a measure of uncertainty[^1]. In cases where the random variable is often unsurprising, the entropy is low. When it is regularly surprising, the entropy is high. Uncertainty represents the *potential information* to be gained by measuring a random variable.
>
> In cases where $p_j = 0$, $p_j \log_2(p_j)$ is not well defined. For the purposes of entropy, $p\log_2(p)$ stands for the function $\phi: [0, 1] \to [0, \infty)$, where $\phi(p) = 0$ when $p = 0$, and $\phi(p) = p\log_2(p)$.
> [!theorem]
>
> Let $X$ be a discrete random variable:
> - $H(X) \ge 0$ and $H(X) = 0$ if and only if $X$ takes one of its values with certainty.
> - A [[Certain Random Variable|certain]] random variable never produces any new information because its outcome is already known.
> - $H(X) \le \log_2(n)$ with equality if and only if $X$ is [[Uniform Distribution|uniformly distributed]].
> - Uniform distribution corresponds to the greatest uncertainty.
> - $H_n = \log_2(n)$ denotes the entropy of a uniform distribution with $n$ outcomes.
>
> *Proof*. Begin by expressing $\ln(x)$ as a [[Integral|definite integral]].
> $
> \begin{align*}
> \ln(x) &= \int_{a}^{x}\frac{1}{t}dt + C\\
> \ln(1) &= 0 = \int_1^1\frac{1}{t}dt\\
> \ln(x) &= \int_1^{x}\frac{1}{t}dt
> \end{align*}
> $
> Since $\frac{1}{t} \ge 1$ for $0 \le t \le 1$ and $\frac{1}{t} < 1$ for $t \ge 1$,
> $
> \begin{align*}
> \ln(x) &\le \int_1^{x}dt \quad &\text{for }0 \le t \le 1 \\
> \ln(x) &\le x - 1 &\text{for }0 \le t \le 1 \\
> \ln(x) = \int_1^{x}dt &\lt x - 1 &\text{for }t \gt 1\\
> \end{align*}
> $
>
> Solving $\ln(x) = x - 1$ gives $x = 1$, meaning that $\ln(x) \le x - 1$ with equality when $x =1$. As a result, $H(X) = -\sum p\log_2(p) \ge x - 1$, $H(X) \ge 0$ since $p < 1$.
>
> If $H(X) = 0$, then each $p_j \log_2(p_j) = 0$, and there must exist some $k(1 \le k \le n)$ where $p_j = 0(j \ne k)$, $p_k = 1$.
>
> Suppose that each $p_j \ge 0(1 \le j \le n)$, (Since $\sum_{j = 1}^{n}p_j = 1$ the $\ln(n)$ can be grouped into the sum)
> $
> \begin{align*}
> H(X) - \log_2(n) &= -\frac{1}{\ln(2)} \paren{\sum_{j = 1}^{n} p_j \ln(p_j) + \ln(n)}\\
> &= -\frac{1}{\ln(2)} \paren{\sum_{j = 1}^{n} p_j (\ln(p_j) + \ln(n))}\\
> &= -\frac{1}{\ln(2)} \sum_{j = 1}^{n} p_j \ln(p_j n)\\
> &= \frac{1}{\ln(2)} \sum_{j = 1}^{n} p_j \ln\paren{\frac{1}{p_j n}}\\
> &\le \frac{1}{\ln(2)} \sum_{j = 1}^{n} p_j \paren{\frac{1}{p_j n} - 1} \quad (\ln(x) \le x - 1)\\
> &\le \frac{1}{\ln(2)} \sum_{j = 1}^{n} \paren{\frac{1}{n} - p_j}\\
> &=1 \quad \frac{1}{p_jn} - 1 = 0\ \text{i.f.f.}\ p_j = \frac{1}{n}
> \end{align*}
> $
>
> [!theoremb] Theorem
>
> If a [[Probability Distribution|probability distribution]] is "more uniform", then its entropy increases.
>
> *Proof*. Consider a random variable $X$ with a range of size $n$. Create a random variable $S$ with the same probability distribution as $X$, except that two of the probabilities $p_1$ and $p_2$, where $p_1 \lt p_2$, are replaced with $p_1 + \varepsilon$ and $p_2 - \varepsilon$ respectively, where $p_1 + \varepsilon \lt p_2 - \varepsilon$ ($0 \lt 2\varepsilon \lt p_2 - p_1$).
>
> First calculate the entropy of $X$ and $S$.
> $
> \begin{align*}
> H(X) &= \sum_{j = 1}^{n}p_j \log_2(p_j)
> \end{align*}
> $
> $
> \begin{align*}
> H(S) &= (p_1 + \varepsilon)\log_2(p_1 + \varepsilon) +
> (p_2 - \varepsilon)\log_2(p_2 - \varepsilon) + \sum_{j = 3}^{n}p_j\log_2(p_j)
> \end{align*}
> $
>
> The change in entropy $\Delta H$ going from $X$ to $S$ is:
> $
> \begin{align*}
> \Delta H &= H(S) - H(X) \\
> &= (p_1 + \varepsilon)\log_2(p_1 + \varepsilon) +
> (p_2 - \varepsilon)\log_2(p_2 - \varepsilon) \\
> &\quad + \sum_{j = 3}^{n}p_j\log_2(p_j) - \sum_{j = 1}^{n}p_j\log_2(p_j) \\
> &= (p_1 + \varepsilon)\log_2(p_1 + \varepsilon) +
> (p_2 - \varepsilon)\log_2(p_2 - \varepsilon)-p_1\log_2(p_1) - p_2 \log_2(p_2) \\
> &= p_1 \log_2\paren{\frac{p_1}{p_1 + \varepsilon}} + p_2\log_2\paren{\frac{p_2}{p_2 - \varepsilon}} + \varepsilon\log_2\paren{\frac{p_2 - \varepsilon}{p_1 + \varepsilon}}
> \end{align*}
> $
>
> Consider $\Delta H$ as a [[Function|function]] of $\varepsilon$, notice that $\Delta H (0) = 0$, and its derivative is:
> $
> \frac{\partial \Delta H}{\partial \varepsilon} = \log_2\paren{\frac{p_2 - \varepsilon}{p_1 + \varepsilon}}
> $
>
> Since $p_2 \gt p_1$ and $p_2 - \varepsilon \gt p_1 + \varepsilon$, $\frac{p_2 - \varepsilon}{p_1 + \varepsilon} \gt 1$, $\frac{\partial \Delta H}{\partial \varepsilon} \gt 0$ for all $0 \lt 2\varepsilon \lt p_2 - p_1$, meaning that $\Delta H$ is an increasing function of $\varepsilon$, and that $\Delta H \gt 0$ for all $0 \lt 2\varepsilon \lt p_2 - p_1$.
# Uniqueness
> [!definition]
>
> Let $X$ be a random variable with [[Probability Distribution|probability distribution]] $\{p_1, p_2, \cdots, p_n\}$. A real valued [[Function|function]] $U(X) = U(p_1, p_2, \cdots, p_N)$ is a [[Measure Space|measure]] of uncertainty if it satisfies:
> 1. $U(X)$ is maximum when $X$ has a uniform distribution.
> 2. If $Y$ is another random variable, then $U(X, Y) = U_X(Y) + U(X)$.
> 3. Uncertainty should not change with the addition of an impossible event: $U(p_1, p_2, \cdots, p_n, 0) = U(p_1, p_2, \cdots, p_n)$
> 4. $U(p_1, p_2, \cdots, p_n)$ is [[Continuity|continuous]] with respect to all its arguments.
> [!theorem]
>
> $U(X)$ is a measure of uncertainty if and only if
> $
> U(X) = KH(X) \quad \text{where} \quad k \ge 0
> $
>
> *Proof*. Define $A(n) = U\paren{\frac{1}{n}, \frac{1}{n}, \cdots, \frac{1}{n}}$.
>
> Begin by showing that $A(n) = K\log(n)$. By conditions 1 and 3,
> $
> A(n) = U\paren{\frac{1}{n}, \frac{1}{n}, \cdots, \frac{1}{n}, 0} \le A(n + 1)
> $
> Since $A(n + 1)$ measures a "more" uniform distribution than $A(n)$, $A(n + 1)$ has to be greater or equal to $A(n)$. Therefore, $A$ is a non-decreasing function of $n$.
>
> Let $X_1, X_2, \cdots, X_m$ be [[Probabilistic Independence|independent]] uniformly distributed random variables, each with $r$ values in its range, such that each $U(X_j) = A(r)$. By condition 2,
> $
> U(X_1, X_2) = U_{X_2}(X_1) + U(X_2) = U(X_1) + U(X_2) = 2A(r)
> $
> And by induction,
> $
> U(X_1, X_2, \cdots, X_m) = mA(r)
> $
> Since the random [[Vector|vector]] $X = (X_1, X_2, \cdots, X_m)$ has $r^m$ equally likely outcomes,
> $
> U(X_1, X_2, \cdots, X_M) = A(r^m)
> $
> Meaning that
> $
> A(r^m) = mA(r)
> $
> As the result also holds true for $n$ independent uniformly distributed variables, each with $s$ values in its range, $A(s^n) = nA(s)$.
>
> Choose $r$, $s$, and $n$ arbitrarily, and let $m$ be such that:
> $
> r^m \le s^n \le r^{m + 1}
> $
>
> Using the fact that $A$ is a non-decreasing function,
> $
> A(r^m) \le A(s^n) \le A(r^{m + 1})
> $
> $
> mA(r) \le nA(s) \le (m + 1)A(r)
> $
> $
> \frac{m}{n} \le \frac{A(s)}{A(r)} \le \frac{m}{n} + \frac{1}{n}
> $
> $
> \left|\frac{A(S)}{A(R)} - \frac{m}{n}\right| \le \frac{1}{n}
> $
>
> Taking the logarithm of the earlier expression gives:
> $
> m\log(r) \le n \log(s) \le (m + 1) \log(r)
> $
> $
> \left|\frac{\log(s)}{\log(r)} - \frac{m}{n}\right| \le \frac{1}{n}
> $
>
> Since $|a + b| \le |a| + |b|$,
> $
> \begin{align*}
> \left|\frac{A(s)}{A(r)} - \frac{\log(s)}{\log(r)}\right|
> &= \left|\paren{\frac{A(s)}{A(r)} - \frac{m}{n}} +
> \paren{\frac{m}{n} - \frac{\log(s)}{\log(r)}}\right|\\
> &\le \left|\frac{A(s)}{A(r)} - \frac{m}{n}\right| +
> \left|\frac{m}{n} - \frac{\log(s)}{\log(r)}\right|\\
> &\le \left|\frac{A(s)}{A(r)} - \frac{m}{n}\right| +
> \left|\frac{\log(s)}{\log(r)} - \frac{m}{n}\right|\\
> &\le \frac{2}{n}
> \end{align*}
> $
>
> Because $n$ can be as large as possible, as $n \to \infty$,
> $
> \frac{A(s)}{A(r)} = \frac{\log(s)}{\log(r)} \Leftrightarrow A(s)=K\log(s)
> $
>
> Moving on to showing that the theorem is true for all rational $p_j$s. Let
> $
> p_j = \frac{m_j}{m} \quad \text{where} \sum_{j = 1}^{n}m_j = m
> $
>
> Introduce another random variable $Y$, and define its sample space as:
> $
> y_{11}, y_{12}, \cdots, y_{1m_1} \quad
> y_{21}, y_{22}, \cdots, y_{2m_2} \quad
> y_{n1}, y_{n2}, \cdots, y_{nm_n}
> $
> And define its probability distributions as:
> $
> \begin{align*}
> P_{X = x_r}(Y = y_{rk}) &= \frac{1}{m_r} &\quad \text{for}\ 1 \le k \le m_r \\
> P_{X = x_r}(Y = y_{sk}) &= 0 &\quad \text{for}\ 1 \le k \le m_s, s \ne r
> \end{align*}
> $
> Basically $Y$ serves as a list of uniformly distributed random variables indexed by $X$.
>
> This means that the conditional uncertainty $U_{X = x_r}(Y) = K\log(m_r)$ because the random variable $\tilde{Y}_r$ is uniformly distributed. As a result,
> $
> U_X(Y) = \sum_{r = 1}^{n}p_r U_{X = x_r}(Y) = K\sum_{r = 1}^{n}\frac{m_j}{m}\log(m_r)
> $
> Using the [[Joint Distribution|joint probabilities]] of $X$ and $Y$,
> $
> \begin{align*}
> P((X = x_r) \cap (Y = y_{sk})) &= p_rP_r(Y = y_{sk}) = 0 \quad \text{when}\ s \ne r \\
> &= \frac{m_r}{m} \cdot \frac{1}{m_r} = \frac{1}{m} \quad \text{for each}\ 1 \le k \le m_r
> \end{align*}
> $
> Deduce that
> $
> Y(X, Y) = kA(m) = k\log(m)
> $
> Which means that
> $
> \begin{align*}
> U(X) &= U(X, Y) - U_X(Y) \\
> &= K\log(m) - K\sum_{r = 1}^{n}\frac{m_r}{m}\log(m_r)\\
> &= -K\sum_{r = 1}^{n}\paren{\frac{m_r}{m}\log(m_r) - \frac{m_r}{m}\log(m)}\\
> &= -K\sum_{r = 1}^{n}\frac{m_r}{m}\log\paren{\frac{m_r}{m}}
> \end{align*}
> $
>
> Now let the probabilities be arbitrary real numbers, such that each $p_j$ can be approximated by a [[Sequence|sequence]] of rational numbers $p_j^{(N)}$, where each $p_j^{(N)}$ can be written in the form of $\frac{m_j}{m}$. Let $H^{(N)}(X)$ be the corresponding sequence of entropies and define
> $
> H(X) = -\sum_{j = 1}^{n}p_j \log(p_j)
> $
> Since the limit $H(X) = \limv{n}H^{(N)}(X)$ is unique and that each step to the limit is a measure of uncertainty, $U(X) = H(X)$.
[^1]: In physics (statistical thermodynamics), the entropy of a macrostate (say temperature, pressure, etc.) is the logarithm of the number of its microstates (since we assume the principle of symmetry summing all the 1/n for n times conveniently cancels out). A macrostate has low entropy if only a few microstates are associated with it, and high entropy if lots of microstates are associated with it. Translating this to information: a macrostate has low entropy (uncertainty) if it has few possibilities, while such a macrostate may be surprising, once it is *known*, there is little left to be known; a macrostate has high entropy if it has many possibilities, while such a macrostate is unsurprising, there is still much to be known about each individual state.