The Notebook
İstanbul, Turkey

# Asymptotic Equipartition Property

This blog post aims to introduce the concept of typical sequence, and some properties of these typical sequences, which are collectively known as Asymptotic Equipartition Property.

Typicality

Consider a sequence of L = 20 bits emitted by a discrete memoryless source (DMS) with: P_{U}(0) = \frac{3}{4},\:\: P_{U}(1) = \frac{1}{4}

1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1

1,0,1,0,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0

0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

The first sequence is a all one sequence. The second one has 14 zeros and 6 ones, and the third is a all zero sequence.

Probability of occurrence of the sequences are:

\frac{1}{4}^{20}$, \frac{1}{4}^{20}\times3^{14}, \frac{1}{4}^{20}\times3^{20} respectively. All zero and all one sequence are very unlikely to come out of this source because we have been given that: P_{U}(0) = \frac{3}{4},\:\: P_{U}(1) = \frac{1}{4} If we take large enough block size it is likely that we should get 0 three fourth of time, and we should get 1 one fourth of time. If we had P_{U}(0) = 0.99,\:\: P_{U}(1) = 0.01, then we can say the all zero sequence is most likely to happen. The second sequence has 14 zeros and 6 ones. Roughly if we take large enough bits, I expect \frac{3}{4}$ number of 1s and rest to be zero.

Now, we will formally define what is a typical sequence and then we will show some properties of a typical sequence.

1. Let \textbf{U} denote an output sequence of length \textbf{L} emitted a K-ary DMS and P_{U}(u) is the output probability distribution.
2. Let $n_{ai}(u) denotes the number of occurence of the letter a_{i} in the sequence \textbf{u} 3. Let \textbf{u} = [$u_{1}$,$u_{2}$, ...,$u_{L}] denote possible values of, i.e. u_{j} = \{{a_{1}, a_{2}, ..., a_{K}}\} for 1 \leq j \leq L.
Then \textbf{u} is an \epsilon-typical sequence of length \textbf{L} for this K-ary DMS if:

(1 - \epsilon)P_{u}(a_{i}) \leq \frac{n_{ai}(\textbf{u})}{L} \leq (1 + \epsilon) P_{u}(a_{i})$,$1 \leq i \leq K

Please note \epsilon is a small number. We consider the same example as we considered earlier.

Properties

Property 1: If \textbf{u} is an \epsilon-typical output sequence of length \textbf{L} from a K-ary DMS with entropy H(U) in bits, then:

2^{-(1+\epsilon)LH(U)} \leq P_{U}(\textbf{u}) \leq 2^{-(1-\epsilon)LH(U)}

Proof: From the definition of a DMS, we have

P_{U}(\textbf{u}) = \prod\limits_{j=1}^L P_{U}(u_{j})

So, since this is a K-ary source, it will emit $a_{1}, a_{2}, \dots, a_{K}$.

So, let us say $a_{1}$ is emitted $n_{a1}(u)$ times, $a_{2}$ is emitted $n_{a2}(u)$ times, similarly $a_{K}$ is $n_{aK}(u)$ times.

Consequently, we can write:

P_{U}(\textbf{u}) = \prod_{j=1}^LP_{U}(u_{j}) = \prod_{i=1}^K[P_{U}(a_{i})]^{n_{ai}(\textbf{u})}

Here, we can say that:

P_{U}(\textbf{u}) = \prod_{i=1}^K[P_{U}(a_{i})]^{n_{ai}(\textbf{u})} = \prod_{i=1}^K[P_{U}(a_{i})]^{LP_{U}(a_{i})} \geq \prod_{i=1}^K[P_{U}(a_{i})]^{(1+\epsilon)LP_{U}(a_{i})}

since raising a power of a number in between 0 and 1 (which is probability of $a_{i}$ in this case) with a positive number (which is $\epsilon$ in this case) decreases it's value.

Equivalently (see logarithmic exchange part below),

P_{U}(\textbf{u}) \geq \prod_{i=1}^K2^{(1+\epsilon)LP_{U}(a_{i})\log_{2}P_{U}(a_{i})}

Simplifying we get,

P_{U}(\textbf{u}) \geq 2^{(1+\epsilon) L \sum\limits_{i=1}^KP_{U}(a_{i})\log_{2}P_{U}(a_{i})}

Hence,

P_{U}(\textbf{u}) \geq 2^{-(1+\epsilon)LH(U)}

Similar arguments can be used to prove

P_{U}(\textbf{u}) \leq 2^{-(1-\epsilon)LH(U)}

Property 2: The probability 1 - P(F), that the lenght \textbf{L} output sequence \textbf{U} from a K-ary DMS is $\epsilon-typical satisfies: 1 - P(F) \leq 1 - \frac{K}{L\epsilon^{2}P_{min}} where P_{min} is the smallest positive value of$P_{U}(u)

Proof: Interested to show that for large \textbf{L}, the output sequence \textbf{U} of the DMS is certain to be \epsilon-typical.

Here, we will use Tchebychef inequality:

P(|\frac{n_{A}}{n} - P(A)| \geq \epsilon) \leq \frac{P(A)[1-P(A)]}{n\epsilon^2}

Let B_{i} denote the event that \textbf{U} takes on value \textbf{u} such that the condition for \epsilon-typical sequence is not satisfied.

Then,

P(B_i) = P(|\frac{n_{a_{i}}}{L} - P_{U}(a_i)| > \epsilon P_{U}(a_i))

here, we can say that

P(B_i) \leq \frac{P_{U}(a_i)[1 - P_{U}(a_i)]}{L[\epsilon P_{U}(a_i)]^2}

Simplifying, we have

P(B_i) \leq \frac{1 - P_{U}(a_i)}{L \epsilon^{2}P_U(a_i)}

Let P_{min} is the minimum non-zero value of P_{U}(u), we get,

P(B_{i}) < \frac{1}{L \epsilon^{2}P_{min}}

Let F be the failure event that \textbf{U} is not \epsilon-typical.

Since F occurs in at least one of the events, B_i, 1 \leq i \leq K, using union bounds we get:

P(F) \leq \sum_{i=1}^{K}P(B_i) < \frac{K}{L \epsilon^2 P_{min}}

Property 3: The number M of \epsilon-typical sequence \textbf{u} from a K-ary DMS with entropy H(U) in bits satisfies:

(1 - \frac{K}{L \epsilon^2 P_{min}}) . 2^{(1-\epsilon)LH(U)} < M \leq 2^{(1 + \epsilon)LH(U)}

Proof: We know that

1 = \sum_{\forall u} P_{U}(u) \geq M . 2 ^{-(1 + \epsilon)LH(U)}

since P_u(u) is lower bounded by 2^{-(1+\epsilon)}LH(U).

This gives upper bound

M \leq 2 ^ {(1 + \epsilon)LH(U)}

Now we are going to prove the lower bound. Total probability of the \epsilon-typical sequence is 1-P(F), so

1 - P(F) \leq M . 2 ^ {-(1 + \epsilon)LH(U)}

using property-2, we know that P(F) \leq \frac{K}{L \epsilon ^2 P_{min}}.

This gives the lower bound.

M > (1 - \frac{K}{L \epsilon^{2}P_{min}}) . 2 ^ {(1 - \epsilon)LH(U)}

Now, lets try to look at these results when \textbf{L} is very large. We have shown three properties of typical sequences.

1. each of these \epsilon-typical sequences has probability equal to 2^{-LH(U)}.
2. the total probability of these \epsilon-typical sequences is very nearly 1
3. when \textbf{L} is large and \epsilon is small, there are roughly 2^{LH(U)} amount of \epsilon-typical sequences \textbf{u}.

Appendix

Logarithmic Exchange:

$b^{\log_{b}(x)} = x$

Proof: $log_{b}(x)$ answers "b to what power equals x?" via following property:

\log_{b}(x) = y \Leftrightarrow b^{y} = x

Variance and Mean of a Binary Random Variable:

Proof: let's define a binary random variable first,

P(X=0) = a$and$P(X=1) = 1 - a

then, we can calculate the mean as following:

E[X] & = (1 - P(X=1)) \times 0 + P(X=1) \times 1 = P(X=1) \\ \\ &= (1 - a) \times 0 + a \times 1 \\ & = a \\

likewise, we can calculate the variance as following:

\mathrm{Var}[X] & = E[(X-\mu)^2] \\ \\ & = (1 - P(X=1)) \times (0 - P(X=1))^2 + P(X=1) \times (1 - P(X=1))^2\\ & = (1 - a) \times (0 - a)^2 + a \times (1 - a)^2 \\ & = (1 - a) \times (a^2 +a - a^2) \\ & = (1 - a) \times a

References: