The Analysis of Hybrid Trie Structures

Julien Clment

Algorithms project, INRIA Rocquencourt

Algorithms Seminar

October 20, 1997

[summary by Frdric Cazals]

A properly typeset version of this document is available in postscript and in pdf.

If some fonts do not look right on your screen, this might be fixed by configuring your browser (see the documentation here).

1   Introduction

Tries are a general-purpose data structure of the dictionary type, that is supporting the three main operations Insert, Delete and Query. To see how they are defined, let A={aj}j=1r be an alphabet and S be a set of strings defined over A. The trie associated to S is recursively defined by the rule
trie(S)=trie(S\ a1),trie(S\ a2),...,trie(S\ ar)
where S \ a refers to the contents of S consisting of strings that start with a and stripped of their initial letter, and the recursion stops as soon as S contains one element.

Searching a trie T for a key w just requires tracing a path down the trie as follows: at depth i, the ith digit of w is used to orientate the branching. (Insertions and deletions are handled in the same way.) To complete the description, we need to specify which search structure is used to choose the correct sub-trie within a node. The main possibilities are:
  1. the ``array-trie'' which uses an array of pointers to sub-tries. This solution is relevant for small alphabets only, otherwise too many empty pointers are created;
  2. the ``list-trie'' that remedies the high-storage requirement of the ``array-trie'' by using a linked list of sub-tries instead. The drawback is a higher cost for the traversal;
  3. the ``bst-trie'' which uses binary-search trees (bst) as a trade-off between the time efficiency of arrays and the space efficiency of lists.
In particular, the bst-trie can be represented as a ternary tree where the search on letters is conducted like in a standard binary search tree, while the tree descent is performed by following an escape pointer upon equality of letters. We shall refer to this data structure as a ternary search trie or tst. An example trie with its basic representation and the equivalent ternary search trie over the alphabet A={a,b,c} is represented on fig. 1 and 2. As is well known, the performances of tries depend upon the probabilistic properties of the strings processed. More precisely, we shall work with two types of models: The quantities we are interested in to capture the tries performances are defined as follows:
Definition 1   The comparison path length of a tst t is defined as the sum of the distances from the root to the external nodes, expressed in number of comparison pointers. Similarly, for a string s, the search cost R(s,t) is defined as the number of comparison pointers followed when accessing s in t. More precisely,
L(t)=l(root (t)) +
r
i=1
L(ti)   and   R(ai s, t)=ri(root(t))+R(s,ti)     (1)
with l() the number of external nodes in the sub-tries pointed at by comparison pointers and ri() the cost of searching ai in the bst present at the root of t.

2   Tools Used to Perform the Analysis

2.1   Left-to-Right Maxima, Shuffle Product and Formal Laplace Transform

Let w be a word of A*. The ith letter of w denoted wi is called a left-to-right maximum if wi wj, j=1,...,i-1. For example, the permutation a2 a3 a1 a5 a4 has three left-to-right maxima, respectively a2, a3 and a5. If one builds a bst from a permutation, the left-to-right maxima are naturally in bijection with the nodes located on the rightmost branch. For the analysis to be performed in section 3.1, we therefore investigate left-to-right maxima.

Clearly, all the possible decompositions of words by sets of left-to-right maxima are encoded by the regular expression
A
*
 
=
r
j=1
(e + aj(a1+...+aj)
*
 
).
Marking ai by the two variables zxi together with u if it is a left-to-right maximum yields the generating function
N
 
max
(z,u,x1,x2,...,xr)=
r
j=1



1 +
z u xj
1-z(x1+...+xj)



whose coefficient [znukx1n1... xrnr] is the number of words of length n that have k maxima and nj occurrences of the letter aj. A similar formula holds for Nmin, the generating function of minima.

Another tool we shall use is the shuffle product W from which a word can be decomposed into words contained certain letters only:
( A \ {a})
*
 
= (a1+...+a
 
a-1
)
*
 
W (a
 
a+1
+...+ar)
*
 
.
This product corresponds to the following operation on generating function



 
n
fn zn


W


 
n
gn zn


=
 
n



n
k=0

n
k

fkgn-k


zn.
A nice way to handle it is through the formal Laplace transform defined by L[n fnzn/n!] = n fnzn. In particular we have f(z) W g(z) = L[ L-1[f(z)] L-1[g(z)]].

3   Main Results

3.1   Search Costs in Bst

We analyze the cost of searching a letter aa in the bst bst(w) built from the letters of a word w from A*. More precisely: given a letter aa and a word w, the search cost ca(w) is defined as the number of edges on the branch corresponding to aa in the bst built from the letters of w. In particular, we are interested in condensing the cost related informations in the formal sum
C
 
a
:=
 
w A
*
 
u
c
 
a
(w)
 
w     (2)
whose exponent of u refers to the search cost ca(w). To see how ca(w) can be evaluated, observe that w=prefix(w)   aa?   suffix(w) where aa? means that the letter aa may be absent. The interest of this decomposition is to show that the cost of searching aa in the bst built from w is chargeable to prefix(w). And since prefix(w) can be expressed as the shuffle product on the sets of letters a1,...,aa-1 with aa+1,...,ar, the formal sum (2) yields the value of Ca(z,u,x1,...,xr):
(N
 
max
(z,u,x1,...,x
 
a-1
) W N
 
min
(z,u,x
 
a+1
,...,xr))


1+
z x
 
a
1-z(x1+...+xr)



.
This form condenses all the information on costs. For example, the generating function of average costs is obtained by differentiating with respect to u and setting u=1. For example:
Theorem 1   The mean search cost of the letter aa in a bst built from the Poisson model is
E [C
 
a
]=
 
min(u,v) jmax(u,v)
j a
nj
P
 
[j,a]


1-e
-z P
 
[j,a]
 


,    with    P
 
[u,v]
=
 
j
pj.

3.2   Exact Analysis

In this section, we outline the analysis of the statistics introduced in def. 1. The crux of this analysis consists in using quantities that are independent from the source model the keys are generated by. To see how this works, consider the Poisson process of parameter z. The number Nh of strings that have a given prefix h obeys a Poisson law of parameter phz, with ph the source dependent probability for a random string to start with the prefix h. Then, the probabilistic behavior of the tst that corresponds to the prefix h is described by a Poisson model of parameter {zph} with individual letter probabilities {pi | h}, with pi | h the conditional probability to have the prefix h followed by the letter ai. Applying theorem 1 locally and unwinding the recurrences (1) yield
Theorem 2   The comparison path length and the comparison cost of a random search in a ternary search trie made of n keys have expectations given by
E [L]
 
n
=2
 
h A
*
 
 
i<j
p
 
h i
p
 
h j
P
 
h
[i,j]2
[n P
 
h
[i,j] -1 + (1-P
 
h
[i,j])n],
E [R]
 
n
=2
 
h A
*
 
 
i<j
p
 
h i
p
 
h j
P
 
h
[i,j]
[1-(1-P
 
h
[i,j])n]
with Ph[i,j]=k=ij pk h and ph a=ph pa | h.
A noteworthy feature of this theorem is its independence from the source model since its derivation uses solely the independence of the digits processed. It can therefore be instantiated for the three models mentioned in section 1.

3.3   Asymptotic Analysis

We aim at finding asymptotic equivalents to the quantities of theorem 2. These quantities are harmonic sums amenable to a treatment with the Mellin transform [2].

The Mellin machinery applied to the formulae of theorem 2 requires evaluating the pi | j probabilities. This is done under two models: a memoryless (a.k.a. Bernoulli) source outputting infinite strings where the letter ai has probability to appear independently of past history; and a Markov one producing letters with an initial distribution and with transition probabilities pi | j. Singularity analysis on the Mellin transforms (combined with the so-called Dirichlet depoissonization) yields
Theorem 3   The comparison external path length and random search cost for a ternary search tree built on n keys produced by a source S, either memoryless (m) or Markovian (M), have averages that satisfy
E [L]
 
n
=
CS
HS
n log n +O(n)     and     E [R]
 
n
=
CS
HS
log n +O(1)
where the entropy HS and the quantity CS are source-dependent constants.

3.4   Comparative Studies

Theorems 3 and 2 quantify precisely the access costs for tst. The same analysis can be carried out for the list-trie variant, while the parameters describing standard array-trie stem from Knuth's books. The results are summarized in table 1---with Cm* and CM* constants known in closed forms---and show that the three structures have logarithmic costs and require linear space.

 
  array-trie (standard) list-trie bst-trie (tst)
 
Pointers r/H S n 2/H S n 3/H S n
Path length 1/H S nlog n C S*/H S nlog n C S/H S nlog n
Search 1/H S log n C S*/H S log n C S/H S log n

Table 1:


In order to assess the relevance of these theoretical analyses, a simulation campaign was undertaken on Herman Melville's novel, Moby Dick. Its conclusions are [1]:
Ternary search tries are an efficient data structure from the information theoretic point of view since a search costs typically about log n comparisons. List-tries require about 3 times as many comparisons. For an alphabet of cardinality 26, the storage cost of ternary search tries is about 9 times smaller than standard array-tries.

References

[1]
Clment (Julien), Flajolet (Philippe), and Valle (Brigitte). -- The analysis of hybrid trie structures. In Proceedings of the Ninth Annual ACM--SIAM Symposium on Discrete Algorithms. pp. 531--539. -- Philadelphia, 1998.

[2]
Flajolet (Philippe), Gourdon (Xavier), and Dumas (Philippe). -- Mellin transforms and asymptotics: harmonic sums. Theoretical Computer Science, vol. 144, n°1-2, 1995, pp. 3--58. -- Special volume on mathematical analysis of algorithms.

This document was translated from LATEX by HEVEA.