Towards Analytical Information Theory: Recent Results on Lempel-Ziv Data Compression Schemes

Towards Analytical Information Theory: Recent Results on
Lempel-Ziv Data Compression Schemes

Wojciech Szpankowski

Purdue University

Algorithms Seminar

September 23, 1996

[summary by Philippe Flajolet]

A properly typeset version of this document is available in postscript and in pdf.

If some fonts do not look right on your screen, this might be fixed by configuring your browser (see the documentation here).

Abstract

The Lempel-Ziv algorithms are well-known dynamic dictionary algorithms of use in data compression. The talk shows how analytic models originally developed for the analysis of tries and digital search trees may be used to characterize their compression characteristics.

1 The Lempel-Ziv algorithms

A text to be compressed is always a message to be transmitted (to your friend, to your laser printer) that has a ``meaning'', hence a certain structure that goes along with some sort of ``redundancy''. In a natural language like English, trigrams like `ted' or `ing' are much more likely to be encountered, and much more frequently so, than `qrm' or `bzw', and a text on data compression is likely to contain more repetitions of `algorithms' and fewer of `smoking' than a text on public health. Compression algorithms precisely try to capture such regularities.

These observations have given rise to a first generation of methods. For a text in English, list all the conceivable trigrams (say!) in the language and transmit trigrams codes instead of individual letters. In addition, shorter codes may be assigned to more frequent trigrams, yielding further gains---this can be done efficiently by Huffman's algorithm. These methods are known as ``static dictionary algorithms''. They have the obvious drawback of not being adaptive; a scheme originally designed for English text is not likely to accommodate well Sanskrit epics, postscript code, or image bitmaps.

Such was essentially the state of the art before the appearance of the celebrated Lempel-Ziv (LZ) papers in 1977 and 1978; see [10, 11]. The LZ algorithms---there are two basic ones and a denumerable collection of variants---can be viewed as building adaptively a ``dynamic dictionary'' that is dependent upon the particular text subjected to compression. Their common basis is:

The ``deja vu'' principle.

(Pronounce as ``day-jah voo''!) As the text proceeds, it is parsed into segments also called phrases. Instead of transmitting the letters of the text itself, just transmit references to places in the text where each segment was encountered before.

More precisely, the LZ77 and LZ78 algorithms, as considered here, are defined by:

LZ77. Scan the text. Starting a new segment, search for the longest matching factor already encountered in the past. Transmit its reference (position and length) and the next letter.

LZ78. Scan the text. Starting a new segment, search for the longest matching segment already encountered in the past. Transmit its reference (rank) and the next letter.

For instance, given a text that is a long sequence of a's, the two parsings start like

LZ77: | a | aa | aaaa | aaaaaaaa | ···
LZ78: | a | aa | aaa | aaaa | aaaaa | ···

In other words, LZ77 defines the new segment as anything (of maximal length) that has already been seen before, possibly across boundaries of previously defined segments, while LZ78 respects the boundaries of previously defined segments.

For LZ77, the transmission of each segment is of the type <position, length, letter> (position of the previous occurrence and length of the factor, plus new letter). For LZ78, it is of the type <rank, letter> (rank of the already encountered segment, plus new letter). So, in the case of our long string of a's, the next segment formed is encoded as

LZ77: <1, 15, a>, LZ78: <5, a>.

For LZ77, this says: ``repeat the 15 characters starting at position 1 in the text and append an a''. For LZ78: ``repeat the 5th segment already parsed and append an a''. Reconstruction of the source text at the receiving end is then particularly easy as it suffices to ``expand'' the references.

Another example is provided by the two sequences

LZ77: <0,0,a> <0,0,b> <0,0,r> <1,1,c> <1,1,d> <1,4,#>
LZ78: <0,a> <0,b> <0,r> <1,c> <1,d> <1,b> <3,a> <0,#>

that encode `abracadabra# (with a terminator symbol). As these examples demonstrate, the LZ77 algorithm ``learns'' faster but, the implied dictionary being larger, references are more costly as their encodings require more bits. Fine analysis is thus needed in order to characterize the various tradeoffs involved.

Variants and implementations

There exist a great many variants of the LZ77 and LZ78 algorithms.

LZ77: One can consider that the entire alphabet sequence is prepended to the text. In this case, the new segments need not include a new-letter field, as each letter at least has been already encountered before. Also, one can limit the past fraction of the text to which the ``deja vu'' principle is applied (historically, to 8192 characters). One obtains in this way an algorithm close to the original LZ77 specification.
LZ78: Similarly, one can update the dictionary by deleting old segments (say, on a least recently used basis), and thus maintain a dictionary of an a priori bounded size.

In LZ77, when a new segment is started at position (n+1) in the text, one must look for a longest factor that has occurred before. This corresponds to a virtual dictionary D that should contain all the 1/2n(n+1) factors of the text, as seen so far. There are two immediate solutions to this problem: (i) don't store the dictionary and use naïve string searching in the text itself; (ii) build a digital tree (also known as trie) of all suffixes of the part of the text seen so far. The time/space complexity pairs of these solutions are á O(n²),O(n)ñ and á O(n²),O(n²)ñ, respectively. The trie solution is interesting since a construction (the suffix tree) is known in order to eliminate duplication of informations; the suffix tree implementation has an á O(n),O(n)ñ complexity but is somewhat intricate to implement.

In contrast, LZ78 is much easier to implement. It suffices to maintain a digital search tree of all the segments encountered so far. A new segment is detected by following a path in the tree. Each internal node of the tree is associated with such a segment seen in the past and insertion takes place at an external node, which corresponds to extending a previously encountered segment.

These algorithms are famous. They are used in the Unix compress and zip commands (based on LZ78 and LZ77, resp.), in data transmission (the V42bis standard for modems), in image compression (the gif format), etc. Jump to your favourite web search engine for details.

2 Models and analysis

Classical information theory teaches us that the best rate at which a text can be compressed is given by the entropy function. Here, we concentrate on a binary memoryless channel, where each character in the source text has probability p of being a 0 and probability q=1-p of being a 1. In that case, the (binary) entropy is

h=plog₂

+qlog₂

. (1)

A random text of n characters can then be compressed into no less than

h· n (1+o(1)) (2)

bits, on average. The redundancy of a compression scheme is a measure of its distance to the information-theoretic lower bound (2).

It is already known that the two Lempel-Ziv algorithms achieve asymptotically the lower bound, so that the redundancy per character is o(1) for both schemes. In other words, what is needed for further comparison is a second order asymptotic analysis of these algorithms. The talk centers on the LZ78 algorithm that has a mathematically pleasant decomposable structure closely related to digital search trees. Digital search tree intervene both as a data structure in the implementation of LZ78 and as a probabilistic model of random trees.

Digital search trees

Recall that a digital search tree or DST is a hybrid of the digital trie and the binary search tree defined as follows. A sequence S=(s₁,...,s_m) of m binary strings is given. The digital search tree dst(s) is recursively defined by: (i) the root contains the first string s₁; (ii) the left subtree is formed by taking the subsequence S^[0] of (s₂,...,s_m) of strings starting with a 0 and stripped of this initial 0; (iii) the right subtree is formed similarly from strings starting with a 1. In other words,

dst(S)=á s₁, dst(S^[0]\ 0), dst(S^[1]\ 1) ñ,

where S\ j means the sequence S with all its elements stripped of their initial letter j, and with dst(s)=á s,Ø,Øñ for a one element sequence.

Assume that strings obey the Bernoulli model where each component bit b satisfies

Pr{b=0}=p, Pr{b=1}=q=1-p,

independently of the others. This model implies the random DST model [9], where a tree of size m has a left subtree of size K and a right subtree of size m-1-K satisfying

Pr{K=k}=

æ
è

m-1

ö
ø

p^k q^m-1-k. (3)

In the unbiased case, p=q=1/2, this model has been analysed by Knuth, Coffman and Eve, as well as Konheim and Newman; see the references in [9]. It is known for instance that the expected path length, under the biased model, is of the form

mlog₂ m+ m

æ
ç
ç
è

g-1

log 2

-a +d(log₂ m)

ö
÷
÷
ø

+O(m).

There a:=å_k³1 (2^k-1)^-1 and d is a periodic function with magnitude less than 10^-6. Height and other parameters are studied by Aldous and Shields in [1].

Data compression

Consider an execution of the LZ78 algorithm when n characters of the text have been scanned, assuming that a new segment starts at position (n+1). Under the binary memoryless model, the number of segments formed so far is a random variable M_n that is also the number of internal nodes of the companion dst built by the algorithm. By design, the dst is built of keys that obey Eq. (3), with M_n=m. It is also realized easily that the internal path length of the corresponding tree is equal to n.

Thus, there are two closely related models that are in a way ``inverse'' of each other:

the random dst model where the number of strings m is fixed, and path length of the tree is to be analysed;
the compression model where a dst is built by successive insertions until path length attains a fixed bound n, and the size of the tree (corresponding to the size of the compressed text) is to be analysed.

Analysis starts with the random dst model. Inversion will then be realized by an application of renewal theory.

3 The random digital search tree model

From an analytical standpoint, this model is the most natural one, since the random tree process given by (3) is a decomposable one. The novelty comes from the fact that, till recently, only the unbiased case p=q=1/2 had been analyzed. Jacquet and Szpankowski [7] have proved the following.

Theorem 1 Let L_m be the path length of a digital search tree built on m random strings according to the Bernoulli model. Then, the mean and variance of L_m satisfy

E L_m=

æ
ç
ç
è

log₂ m+

h₂

+g-1-a+d₀(log m)

ö
÷
÷
ø

+O(log m),

Var L_m = c₂ mlog m+O(m),

with d₀ a periodic function of mean value 0 and small amplitude, and

h=plog₂

+qlog₂

, h₂=plog₂²p+qlog

₂²q, a=-

k=1

p^k+1log₂ p+q^q+1log₂ q

1-p^k+1-q^k+1

, c₂=

h₂-h²

h³

Furthermore, the distribution of L_m is asymptotically normal,

L_m-EL_m

(Var L_m)^1/2

®^d N(0,1).

As is well-known, there are three ways to conduct such an analysis, with Poisson, exponential, or ordinary generating functions (PGF, EGF, OGF). The proof given in [7] is of the Poisson type. The bivariate PGF satisfies then the nonlinear difference-differential equation,

¶ L(z,u)

¶ z

=L(pzu,u)L(qzu,u),

with L(z,0)=1. The approach starts with a quasi-linearization technique of [5], by considering l(z,u)=log L(z,u). By bootstrapping, the functional equation is solved in larger and larger pseudo-cones. Solutions are estimated asymptotically by means of the Mellin transform technology summarized in [3]. Then, mean, variance, and limit distribution follow for the Poisson model. Translation to the Bernoulli model (i.e., the binary memoryless channel) is then achieved by ``analytic depoissonization'', i.e., an adequate use of the saddle-point method originating in Jacquet and Régnier's analysis of path length in tries. The technical difficulties are rather formidable, but the results obtained are extremely precise.

4 The compression model

It is now possible to return to the LZ78 algorithm. We recall that M_n is the number of segments (or phrases) built on a random text of length n. We have:

Theorem 2 Let M_n be the number of phrases built on n characters of the text by LZ78, according to the binary memoryless model. Then, the mean and variance of M_n satisfy

E M_n ~

log₂ n

, Var M_n =

c₂h³n

log₂² n

Furthermore, the distribution of M_n is asymptotically normal,

M_n-EM_n

(Var M_n)^1/2

®^d N(0,1).

The proof of this result is in [7]. The ideas have been further refined by Louchard and Szpankowski in [8], and this leads to a second order asymptotic analysis, hence a characterization of redundancy of LZ78.

The ``inverse'' relations alluded to in Section 2 are expressed precisely by the equality,

Pr{M_n>m}=Pr{L_m£ n}.

This is known as the renewal equation. Theorem 1 gives good estimates of the right hand side. Roughly speaking, we have, by Theorem 1, a known dependency between L (path length) and M (size of the dst),

L_M »

Mlog₂ M+ (c₂Mlog₂ M)^1/2 X,

where X is a unit Gaussian variate that represents random fluctuations. This can be formally inverted, leading to

M »

log₂ L

+ (

c₂h³L

log₂² L

)^1/2 X.

This is exactly what Theorem 2 expresses and the process is made valid by an appeal to the general theory of renewal equations; see [2, Thm. 17.3].

From there, it is possible to solve the redundancy problem of Section 1; see [8] for details. Define the redundancy of LZ78 as

_n=

(

E{M_n(log₂ M_n+1)}-nh

)

The idea is that there are M_n phrases and each of them costs about log₂ M_n bits. (The term +1, there, is implementation specific but not essential.)

Then moment bounds and inequalities justify a precise version of the approximation

E{M_nlog₂ M_n}» E{M_n}log₂E{M_n}.

Also, the speed of convergence to the normal limit in Thm. 1 is seen to be O(m^-1/2). This gives all the ingredients for a second-order asymptotic analysis of LZ78.

Theorem 3 The global redundancy of Lempel-Ziv's LZ78 algorithm is O(1/log n). More precisely,

_n~

log₂ n

æ
ç
ç
è

2h-hg-

h₂+ha-hd₀(log₂ n)

ö
÷
÷
ø

Note that the corresponding redundancy of LZ77 is known to be

_n^* =O(

loglog n

log n

and this is conjectured to be the right order. Therefore, assuming this conjecture, the LZ78 algorithm---based on digital search trees and segment boundaries---is less ``redundant'' and deviates less than LZ77 from the information-theoretic optimum.

5 Related problems

The talk mentions two other types of results:

an extension to Markovian models, where systems of difference-differential equations need to be considered;
a modified LZ78 algorithm [8] based on the b-digital search tree, a data structure that had been analysed in the average-case by Flajolet and Richmond [4].

One of the major open problems in the area is the precise analysis of redundancy for the LZ77 algorithm. This is harder as the basic digital tree decomposition is no longer available, and ``overlaps'' in strings must be taken into account. Perhaps the approach of [6] could be useful.

References

[1]: Aldous (David) and Shields (Paul). -- A diffusion limit for a class of randomly-growing binary trees. Probability Theory and Related Fields, vol. 79, n°4, 1988, pp. 509--542.
[2]: Billingsley (Patrick). -- Probability and Measure. -- John Wiley & Sons Inc., New York, 1986, second edition, Wiley Series in Probability and Mathematical Statistics, xiv+622p.
[3]: Flajolet (Philippe), Gourdon (Xavier), and Dumas (Philippe). -- Mellin transforms and asymptotics: harmonic sums. Theoretical Computer Science, vol. 144, n°1-2, 1995, pp. 3--58. -- Special volume on mathematical analysis of algorithms.
[4]: Flajolet (Philippe) and Richmond (Bruce). -- Generalized digital trees and their difference-differential equations. Random Structures and Algorithms, vol. 3, n°3, 1992, pp. 305--320.
[5]: Jacquet (Philippe) and Régnier (Mireille). -- Trie partitioning process: Limiting distributions. In Franchi-Zanetacchi (P.) (editor), CAAP'86, Lecture Notes in Computer Science, vol. 214, pp. 196--210. -- 1986. Proceedings of the 11th Colloquium on Trees in Algebra and Programming, Nice, France, March 1986.
[6]: Jacquet (Philippe) and Szpankowski (Wojciech). -- Autocorrelation on words and its applications: Analysis of suffix trees by string-ruler approach. Journal of Combinatorial Theory, Series A, vol. 66, n°2, 1994, pp. 237--269.
[7]: Jacquet (Philippe) and Szpankowski (Wojciech). -- Asymptotic behavior of the Lempel-Ziv parsing scheme and digital search trees. Theoretical Computer Science, vol. 144, n°1--2, 1995, pp. 161--197.
[8]: Louchard (Guy) and Szpankowski (Wojciech). -- On the average redundancy rate of the Lempel--Ziv algorithm. -- Preprint, 1996. 15 pages. Preminary version presented at the Data Compression Conference, Snowbird, 1996.
[9]: Mahmoud (Hosam M.). -- Evolution of Random Search Trees. -- John Wiley & Sons Inc., New York, 1992, Wiley-Interscience Series in Discrete Mathematics and Optimization, xii+324p.
[10]: Ziv (Jacob) and Lempel (Abraham). -- A universal algorithm for sequential data compression. IEEE Transactions on Information Theory, vol. 23, n°3, May 1977, pp. 337--343.
[11]: Ziv (Jacob) and Lempel (Abraham). -- Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory, vol. 24, n°5, September 1978, pp. 530--536.

This document was translated from L^AT_EX by H^EV^EA.