Noiseless Coding Theorem - Jerry's Math Garden

> [!abstract] Introduction > > Given a source alphabet of size $n$ and a [[Code|code]] alphabet of size $r$. Create a [[Set|set]] of $n$ codewords for each source symbol, with lengths $\list{l}{n}$ and the [[Frequency|frequency]] of each as a [[Probability Distribution|probability distribution]] $\list{p}{n}$. The [[Random Variable|random variable]] $L$ that represents the length of a codeword produces the [[Expectation|expected value]] > $ > \ev(L) = \sum_{j = 1}^{n}p_j l_j > $ > that represents the average length of the code word. > > A prefix-free code is *optimal* if it [[Extrema|minimises]] $\ev(L)$. > [!theoremb] Noiseless Coding Theorem > > Define the [[Entropy|entropy]] of the source alphabet $S$ as $H(S) = -\sum_{j = 1}^{n}p_j \log_2(p_j)$. For any prefix-free code with $r$ code symbols, > $ > \ev(L) \ge \frac{H(S)}{\log_2(r)} > $ > Meaning that the maximum average [[Information|information]] transmitted by one code symbol is the entropy of a [[Uniform Distribution|uniform distribution]] with $r$ possible values. > $ > \frac{H(S)}{\ev(L)} \le \log_2(r) > $ > > There exists a prefix-free code that satisfies > $ > \frac{H(S)}{\log_2(r)} \le \ev(L) \lt \frac{H(S)}{\log_2(r)} + 1 > $ > > *Proof.* If the probability that each code symbol appears is uniform, then the probability distribution of each codeword $\list{q}{n}$ would be > $ > q_j = \frac{e^{-l_j}}{\sum_{i = 1}^{n}r^{-l_i}} > $ > By the [[Gibbs Inequality|Gibbs inequality]], > $ > \begin{align*} > H(S) = -\sum_{j = 1}^{n}p_j \log_2(p_j) &\le -\sum_{j = 1}^{n}p_j \log_2(q_j) \\ > &= -\sum_{j = 1}^{n}p_j \log_2\paren{r^{-l_j}} + \sum_{j = 1}^{n}p_j\log_2\paren{\sum_{i = 1}^{n}r^{-l_i}} > \end{align*} > $ > > By the [[Kraft-McMillan Inequality|Kraft-McMillan inequality]], $\sum_{i = 1}^{n}r^{-l_i} \le 1$, $\log_2\paren{\sum_{i = 1}^{n}r^{-l_i} \le 1} \le 0$, meaning that: > $ > \begin{align*} > H(S) &\le -\sum_{j = 1}^{n}p_j \log_2\paren{r^{-l_j}} \\ > H(S) &\le \sum_{j = 1}^{n}p_j l_j\log_2(r) \\ > H(S) &\le \ev(L)\log_2(r) > \end{align*} > $ > > By the Gibbs inequality and the Kraft-McMillan inequalities, this would only be equal if the probability that each code symbol appears is uniform ($p = q$). > > For the right-hand side, construct code words such that the lengths satisfy > $ > r^{-l_j} \le p_j \lt r^{-l_j + 1} > $ > Meaning that each code word should convey as much information as a random word with one less code symbol (otherwise the extra symbol would be unnecessary). Since this satisfies the [[Kraft-McMillan Inequality]], it is a prefix-free code. Take logarithms on both sides, > $ > -l_j \log_2(r) \le \log_2(p_j) \lt (-l_j + 1)\log_2(r) > $ > > Then multiply both sides by $p_j$ and sum over $j$: > $ > \begin{align*} > H(S) = -\sum_{j = 1}^{n}p_j \log_2(p_j) &\gt \sum_{j = 1}^{n}p_j(l_j - 1)\log_2(r) \\ > H(S) &\gt (\ev(L) - 1)\log_2(r) \\ > \frac{H(S)}{\log_2(r)} &\gt \ev(L) - 1 \\ > \ev(L) &\lt \frac{H(S)}{\log_2(r)} + 1 \\ > \end{align*} > $ > [!theorem] Corollary > > Given any $\varepsilon \gt 0$ (no matter how small), there exists a prefix free code such that > $ > \frac{H(S)}{\log_2(r)} \le \frac{\ev(L^{(m)})}{m} \lt \frac{H(S)}{\log_2(r)} + \epsilon > $ > > *Proof*. Since $\varepsilon$ exists, there exists a $m \in \nat$ such that $\frac{1}{m} \lt \varepsilon$. Choosing one such $m$ and using a [[Block Coding|block code]] with the $m$-th extension of $S$, by the noiseless coding theorem, > $ > \begin{align*} > \frac{H(S^{(m)})}{\log_2(r)} &\le \ev(L^{(m)}) \lt \frac{H(S^{(m)})}{\log_2(r)} + 1 \\ > \frac{mH(S)}{\log_2(r)} &\le \ev(L^{(m)}) \lt \frac{mH(S)}{\log_2(r)} + 1\\ > \frac{H(S)}{\log_2(r)} &\le \frac{\ev(L^{(m)})}{m} \lt \frac{H(S)}{\log_2(r)} + \frac{1}{m}\\ > \frac{H(S)}{\log_2(r)} &\le \frac{\ev(L^{(m)})}{m} \lt \frac{H(S)}{\log_2(r)} + \varepsilon > \end{align*} > $