Analytic Information Theory and the Redundancy Rate Problem

Wojciech Szpankowski

Department of Computer Sciences, Purdue University

Algorithms Seminar

February 13, 2000

[summary by Philippe Flajolet]

A properly typeset version of this document is available in postscript and in pdf.

If some fonts do not look right on your screen, this might be fixed by configuring your browser (see the documentation here).

1   Information, Entropy, and Codes

One of the most basic problems of information theory [1] is that of source coding. A source is by definition a mechanism that produces messages over a finite alphabet  A, a message of length n being conventionally denoted by x1n=(x1,...,xn). A code C is a translation mechanism (an injective function, an algorithm) that, for each n, takes as input a message from  An and transforms it into a binary sequence. Such a translation is thus a fixed-length to variable-length encoding.

Messages have some structure. For the English language source, the sequence `Rzqxwa gkvzzxq wzd aaaaaaa rxbleurp' is much less likely than the sequence `It rained yesterday over England'. Indeed, some letters are more frequent than others, certain letter combinations are impossible, etc. It is then customary to try and capture the principal features of the source by some probability distribution of sorts over  An. The main models considered in the talk are the following. As discovered by Shannon around 1949, information is measured by entropy. The entropy of a probability distribution P={ps}s S over any finite set S is defined as
H(P):=-
 
s S
ps lg ps,
where lg x=log2 x. (Roughly, the definition extends the fact that an element in a set of cardinality m needs to be encoded by about lg m bits in order to be distinguished from its companions elements.) Most ``reasonable'' source models have an entropy rate h; namely, if x1n is randomly drawn according to the source model P, then the following limit exists,
h=
 
lim
n
-
1
n
 
x1n An
P(x1n)lg P(x1n).
For instance, the entropy rate of a sequence drawn according to the memoryless model equals the entropy of the distribution of individual characters. For a Markov chain with transition probabilities pi,j, the entropy rate is
h=
 
i
pi
 
j
pi,jlg pi,j,
with pi the stationary probability distribution of the chain. The entropy rate of written English is estimated to be about 1.3 bits per character.

Sources produce messages which are not uniformly random and this lies at the basis of data compression---the fact that one may find codes that tend to be shorter than the original message. (E.g., the present summary is compressed by gzip at a rate of about 3.5 bits per character.) We cannot compress arbitrarily however. The most fundamental theorem of information is due to Shannon and asserts the following: You cannot beat entropy. In other words, any code has an expected length per character that is at least as large as the entropy rate of the source.

Another famous theorem of Shannon goes the other direction and asserts: The entropy rate is asymptotically achievable. This leaves plenty of room for algorithmic design. As a matter of fact, coding algorithms separate into two groups: (i) codes that are designed for a specific (known) probability distribution over the inputs; (ii) universal codes that do not assume such a probabilistic distribution to be known a priori and do their best to come close to the optimum over an entire class of models. Amongst the first group, we find Huffman codes [3, pp. 402--406] and Shannon--Fano1 codes [1, pp. 101--103]. Amongst the second group, the best known algorithms are the ones due to Lempel and Ziv2 in 1977 and 1978.

2   Redundancy of Classical Codes

The codes normally considered are at least near-optimal with respect to the entropy lower bound. Define first the pointwise redundancy of a code C with respect to a model P as
Rn(C,P;x1n):=L ( C(x1n) ) +lg P(x1n),
where L is length. Two critical parameters are then the average redundancy and the maximal redundancy defined by
_
R
 
n(C,P)=E [ Rn(C,P;x1n) ] ,     Rn*(C,P)=
 
max
x1n
[ Rn(C,P;x1n) ] .     (1)
where both average and maximum are meant with respect to x1n. In other words, the question asked is: How far are we from the information theoretic optimum, either on average or in the worst case? There, we assume the source distribution to be known and the code to be fixed, and analyse the redundancy parameters of the given code.

In this perspective, the talk first reviews results relative to the classical Huffman code and to a version of Fano--Shannon codes, this in the case of a memoryless source. Redundancy is then O(1) but with fluctuations that depend on the fine arithmetic structure of the parameters of the model under consideration; see Figure 1. The methods use Fourier analysis and Gleichverteilung mod 1.



Figure 1: Huffman code redundancy for a memoryless source with control parameter a=lg(1/p-1): (a) irrational case (p=1/p); (b) rational case (p=1/9).


Louchard and Szpankowski (1997), Savari (1997), Wyner (1998), and Jacquet--Szpankowski (1995) proved that the Lempel--Ziv algorithms under either a memoryless or a Markov model have rates that are Q(n/log n) for LZ'783 and Q(nloglog n/log n) for LZ'77. The proofs provide detailed asymptotic information on the redundancy. The results again involve subtle fluctuations. The analysis is close to that of digital tries, with Mellin transforms playing a prominent rle.

3   Minimax Redundancy for Classes of Source Models

The strong redundancy-rate problem asks what can be achieved when the source model ranges over a whole class of sources  S. Thus, the source model is a bit constrained but basically unknown and the question becomes information-theoretic rather than algorithmic (no coding algorithm is fixed any more). Consider redundancies in the sense of (1) and define the minimax redundancies,
_
R
 
n( S )=
 
min
C
 
max
P S
_
R
 
n(C,P),      Rn*( S )=
 
min
C
 
max
P S
_
R
 
n*(C,P),     (2)
corresponding to an average-case or a worst-case scenario, respectively. By their definitions, these quantities represent the additional cost on top of entropy incurred (at least) by any code (this is minC) in order to be able to cope with all sources (this is maxP S).

It would seem that the minimax problem of estimating the quantities in (2) is intractable. However, Shtarkov proved in 1978 that the (worst-case) minimax redundancy is narrowly bounded by the (Shtarkov) inequalities
lg
 
x1n An
 
sup
P S
P(x1n) Rn*( S) 1+lg
 
x1n An
 
sup
P S
P(x1n).     (3)
There the quantity sup P(x1n) could be termed a ``maximum likelihood coefficient'' since it describes the probability of any individual realization x1n under the model P S that assigns to it the highest probability. Take for instance a binary word x1n{0,1}n comprising k letters 0 and n-k letters 1, and S the class of all memoryless models with P(0)=p and P(1)=1-p. Clearly, the maximum likelihood coefficient is given by the Bernoulli distribution whose parameter is p=k/n (maximum likelihood probabilities equal frequencies), and its value is (k/n)k((n-k)/n)n-k. The sum appearing in (3) then evaluates to
An :=
 
x1n An
 
sup
P S
P(x1n) =
n
k=0

n
k




k
n



k



 



n-k
n



n-k



 
.
This has the same flavour as Abel's identities. Indeed, we have
An=
n!
nn
 [zn]
1
( 1-T(z) ) 2
  where   T(z)=zeT(z)
is the tree function. It is then a simple matter, by singularity analysis of the tree function, to get
An~
1
2
 
n!en
nn
~(
p n
2
)1/2   and   lg An=
1
2
lg n+lg(
p
2
)1/2+o(1).
The quantity lg An is at most 1 from the minimax redundancy as results from inequalities (3).

Renewal sources

Another topic of the talk is to analyse redundancy for the class of renewal sources defined as follows. This class of sources makes for an interesting study since minimax redundancy turns out to be O((n)1/2 ); see [2] for a complete analysis.

The maximal likelihood approach leads to the consideration of the sum
rn=
 
k
 
P(n,k)

k
k0,..., kn-1




k0
k



k0



 



k1
k



k1



 
...


kn-1
k



kn-1



 
.
There, the summation condition P(n,k) is n=k0+2k1+, k=k0+k1+ . The computation heavily involves the tree function T(z) and proceeds in several steps.

First, one disposes of the normalizing factor of k!/kk~ e-k(2p k)1/2 by introducing as an artefact a random variable Kn and relating rn to E[(2p Kn)1/2 ]. Second, the distribution of Kn is described by the bivariate generating function
S(z,u):=
i=1
b(ziu)  where   b(z)=
1
1-T(ze-1)
.
This has roughly the character of (the square root of) a partition generating function with u marking the number of parts. Third, the saddle-point method is applied to extract coefficients. Fourth, the saddle-point analysis conduces to a local analysis near 1 that is solved by Mellin transform techniques. The eventual result is that
lg rn =
2
log 2
(


p2
6
-1


n)1/2 -
5
8
lg n +
1
2
lg log n +O(1),
and this quantity closely approximates the minimax redundancy of renewal sources by Shtarkov's inequalities. Note the asymptotic form rn e(n)1/2 that is typical of partition estimates.

Conclusion

The redundancy problem is typical of situations where second-order asymptotics are essential. Such problems of information theory are thus candidates par excellence for the methods of analytic information theory. By this, it is meant the study of randomness in words and codes by means of the classical methods of analytic combinatorics. The reader interested in these questions will be well-advised to consult the forthcoming book by Szpankowski [4] and references therein.

References

[1]
Cover (Thomas M.) and Thomas (Joy A.). -- Elements of information theory. -- John Wiley & Sons Inc., New York, 1991, xxiv+542p. A Wiley-Interscience Publication.

[2]
Flajolet (Philippe) and Szpankowski (Wojtek). -- Analytic Variations on Redundancy Rates of Renewal Processes. -- Research Report n°3553, Institut National de Recherche en Informatique et en Automatique, 1998. 10 pages. Submitted to IEEE Transactions on Information Theory.

[3]
Knuth (Donald E.). -- The art of computer programming. -- Addison-Wesley Publishing Co., Reading, Mass.-London-Amsterdam, 1997, third edition, xx+650 pp.p. Volume 1: Fundamental algorithms.

[4]
Szpankowski (Wojciech). -- Average-case analysis of algorithms on sequences. -- John Wiley & Sons Inc. To appear, preliminary version available from http://www.cs.purdue.edu/people/spa.

1
To design a Shannon--Fano code for the distribution P on S, partition S as S=S0 S1 in such a way that the probabilities of S0 and S1 differ by as little as possible from 1/2. All elements of Sj are assigned a code that starts with j. Proceed recursively.
2
Roughly, the LZ algorithms recognize, as characters flow, the frequently repeated blocks of letter and avoid copying these over and over again, but instead output pointers to the location of the first occurrence of such a block.
3
LZ'78 parses a sequence into ``phrases'' and outputs a pointer to the longest phrase already encountered; LZ'77 outputs a pointer to the longest factor already encountered.

This document was translated from LATEX by HEVEA.