Patterns in Random Binary Search Trees

Philippe Flajolet

Algorithms Project, INRIA Rocquencourt

Algorithms Seminar

October 7, 1996

[summary by Michèle Soria]

A properly typeset version of this document is available in postscript and in pdf.

If some fonts do not look right on your screen, this might be fixed by configuring your browser (see the documentation here).

Abstract
In a randomly grown binary search tree (BST) of size n, any fixed pattern occurs with a frequency that is on average proportional to n. Deviations from the average case are highly unlikely and well quantified by a Gaussian law. Trees with forbidden patterns occur with an exponentially small probability that is characterized in terms of Bessel functions. The results obtained extend to BST's a type of property otherwise known for strings and combinatorial tree models. They apply to paged trees or to Quicksort with halting on short subfiles. As a consequence, various pointer saving strategies for maintaining trees obeying the random BST model can be precisely quantified. The methods used are based on analytic models, especially bivariate generating functions subjected to singularity perturbation asymptotics.

The binary search tree (BST) model is a model of randomness that applies to binary search trees and heap-ordered trees on random data, Quicksort, multidimensional trees, randomised search trees and treaps, as well as to syntax trees that occur naturally in software engineering. Basic properties of random BST have been studied in detail, for example it is well-known that the average value of depth nodes or height is asymptotically logarithmic in the size of the tree (see e.g. [7, 8, 9, 1, 2]). This talk, based on [4], is devoted to characteristics of the shape produced by the BST model.

1   Bivariate generating function

The bivariate generating function F(z,y) for the probability fn,k that a random BST of size n has k occurrences of a given pattern u is defined by
F(z,y):=
 å n,k
fn,kykxn =
 å t
l(t) y
 w[t]
z|t|,
where l(t) is the probability that a given unlabelled tree t is the shape of a random BST of size |t| (l(t)=Õv < t1/|v|, where the product is over all subtrees v of t), and w[t]ºwu[t] denotes the number of occurrences of pattern u as a subtree of the BST t.

The recurrence on w[t] is obvious: w[t]=[[t=u]]+w[t0]+w[t1] , where t0,t1 denote the left and right subtrees of t and where the bracket notation [[P]] is the indicator of P with value 1 if the predicate P is true and 0 otherwise. From this recurrence F(z,y) can be easily shown to satisfy a linear second order differential equation.
Lemma 1   The bivariate generating function F(z,y) satisfies the Riccati equation
z
F(z,y)= F2(z,y)+(y-1)l(u)|u|z|u|-1,     F(0,y)=1     (1)
Using equation(1), the mean and variance of the number of occurrences of u are obtained by successive differentiation of F(z,y) with respect to y, upon setting y=1.
Theorem 1  [Moments of occurrences]   The number Wn of occurrences of a pattern u in a random BST of size n has mean and variance which are asymptotically linear µn~ c1(u) n, sn~ c2(u)n, where c1(u) and c2(u) are effectively expressed in terms of l(u) and |u|.
Linearization of the Riccati differential equation (1) gives, in this case, an explicit solution in terms of Bessel functions.
Lemma 2   The bivariate generating function of the number of occurrences of pattern u is
F(z,y)=-
w'(z,y)
w(z,y)
,     w(z,y)=Am(L zm+1)-zBm(L zm+1),
with w'(z,y)=w'z(z,y), L=|u|l(u)(1-y), m=|u|³2, and Am,Bm are normalized Bessel functions of orders -1/(m+1) and 1/(m+1) respectively.

2   Asymptotic analysis

The asymptotic behaviour of the bivariate generating function F(z,y) is dictated by the singularities in the main variable z (which are poles since F(z,y) is the quotient of analytic functions), with the auxiliary variable y entering as a parameter. Different regions of values of the auxiliary variable provide different types of information: probability for excluded patterns (y=0) or rare occurrences (y» 0), limiting distribution for the number of patterns, central (y» 1) as well as local (|y|=1). The techniques of proof belong to the class of singularity analysis methods: it is well-known, after Cauchy's inversion formula that the location of polar singularities of a function f(z)=å fnzn drives the asymptotic form of its coefficients. For bivariate generating function F(z,y), we first perform one level of inversion, resulting in estimates of fn(y)=[zn] F(z,y), with an extra property of uniform error bounds for some values of y. One more inversion is required in order to recover the individual probabilities fn,k. The most direct approach consists in applying Levy's continuity theorem for characteristic functions; this implies estimating fn(y) for y=eiq, but only q near 0 is required because of scaling. Thus, we have a perturbation of the univariate problem at y=1. When fn(y) is a ``quasi-power'', meaning that it behaves very nearly like the powers of a fixed function, a Central Limit Theorem can be derived. A local limit law of the Gaussian type also holds when one can estimate globally fn(y) by quasi-powers in larger regions, like |y|=1, and not merely locally near y=1. In that case, the recovery of fn,k from fn(y) is achieved by using a saddle-point method.

2.1   Trees with excluded patterns

Asymptotic analysis of univariate and bivariate generating functions derived from F(z,y) depends on locating the zeros of w(z,y) where y is a parameter. We thus define the function au(y) to be the root of smallest modulus of the Bessel type equation
Am(Lam+1)-zBm(Lam+1)=0,      L=(1-y)|u|l(u).     (2)
This definition specifies au(y) unambiguously, for |1-y| not too large (|1-y|<5/2 as is shown in lemma 3 of [4]). The probability that a random BST of size n does not contain the pattern u, is by definition [zn]F(z,0). The following result comes from singularity analysis of meromorphic functions.

Theorem 2  [Excluded patterns]   The probability eu,n that a random BST of size n does not contain the pattern u satisfies
eu,n=au(0)-n-1(1+O(K-n))
where K=K(u) is a constant strictly larger than 1, and au(0) is the smallest positive root of Equation (2) with y=0.
The same argument proves that, for small enough |y|, [zn] F(z,y)~ au(y)-n-1, with a uniform exponentially small error term. The probability that a random BST of size n has k occurrences of a pattern u is obtained by differentiating k times this asymptotic expansion.
Theorem 3  [Poisson law for rare occurrences]   Given a fixed pattern u, for each fixed k, one has as n®¥:
Pr { wu[t]=k||t|=n } =au(0)-n-1·
n)k
k!
æ
ç
ç
è
1+O(
1
n
) ö
÷
÷
ø
,     µ=-
au'(0)
au(0)
.

2.2   Gaussian limit laws

The application of Levy's continuity theorem that leads to a Central Limit Theorem, relies on evaluating fn(y) when y=ei t/(n)1/2. By the implicit function theorem and the preparation theorem of Weierstrass (see e.g. [6]), there is a small complex neighbourhood of 1 such that the function au(y) is analytic and by the analysis of meromorphic functions, fn(y) is closely approximated by a large power of a fixed function: fn(y)=au(y)-n-1(1+O(K-n)), a situation that conduces to normal laws. The speed of convergence is bounded by means of the Berry-Esseen inequality (see e.g. [3]) that relates the distance between distribution functions and characteristic functions.
Theorem 4  [Central law for occurrences]   Given a pattern u, the number of occurrences Wn of u in a random BST of size n obeys a central limit law with linear mean µn and variance sn2 (see theorem 1), and the speed of convergence is O(1/(n)1/2):
 sup xÎ R
½
½
½
½
½
½
Pr ì
í
î
Wnn
sn
£ x ü
ý
þ
-
1
(2p)1/2
ó
õ
 x -¥
e
 -w2/2
dw ½
½
½
½
½
½
<
1
(n)1/2
.
The obtention of a local limit law starts again with the quasi-powers form of fn(y) and uses a saddle-point derivation.
Theorem 5  [Local law for occurrences]   Given a pattern u, if |au(y)|¹1 for all y with |y|=1, y¹1, then the random variable Wn satisfies a local limit law:
 sup x
½
½
½
½
sn Pr{Wn=ëµn+xsnû} -
1
(2p)1/2
e
 -x2/2
½
½
½
½
® 0,
for x in any fixed compact subset of ]-¥,+¥[.

3   Other applications

Suitable adaptations of the technique also lead to a distributional analysis of paging and factored representation of BST as a DAG.

3.1   Paged trees

Given a tree t, its b-index is a tree that is constructed by retaining only those internal nodes of t which correspond to subtrees of size > b. Such an index is well-suited to ``paging'', where one has a two-level hierarchical memory structure: the index resides in main memory and the rest of the tree is kept in pages of capacity b on peripheral storage. Let i[t]=ib[t] denote the size of the b-index of t. The analysis is then clearly equivalent to determining the total number of occurrences of all patterns of size £ b. The bivariate generating function G(z,y):=åt l(t)yi[t]z|t| satisfies a Riccati equation
z
G(z,y) =yG2(z,y)+(1-y
d
dz
æ
ç
ç
è
1-zb+1
1-z
ö
÷
÷
ø
,
which is transformed by linearization. This equation has no explicit solution, but the use of Weierstrass Preparation theorem shows that G(z,y) has a unique dominant simple pole for y in a small neighbourhood of 1. Thus [zn]G(z,y) reduces to quasi-powers, and a central limit law follows:
Theorem 6  [Paging distribution]   For fixed b³2, the size In of the b-index constructed on a random BST of size n has mean µn and variance sn2 that satisfy
µn=
2(n+1)
b+2
-1,      sn2=
2
3
(b-1)b(b+1)
(b+2)2
(n+1).
The random variable In obeys a central limit law with speed of convergence O(1/(n)1/2).

3.2   Factored representations of trees

We consider finally a global parameter k[t] of trees that represents the number of structurally different subtrees (i.e., number of different subtree shapes) that occur in t. This parameter is of intrinsic interest as an indicator of the structural ``richness'' of t. It also measures the optimal storage complexity of tree t when all common subtrees are factored and represented only once. Then, k[t] measures the number of nodes of the maximally factored DAG (directed acyclic graph) corresponding to t, a quantity that intervenes in parsing and data compression applications [5].

The quantity that is actually analysed is the size of a DAG representation which is partly redundant (all trees of size less than b are represented once, irrespective of their possible nonoccurrence in t) and partially factored (nodes commanding subtrees of size ³ b are each represented irrespective of the fact that they may be associated to repeated subtrees).
Theorem 7   The average value of the DAG size of a random BST of size n satisfies the upper bound,
Kn £4log 2
n
log n
+O æ
ç
ç
è
nloglog n
(log n)2
ö
÷
÷
ø
.
This upper bound on Kn is of the right order, since L. Devroye has shown a lower bound in O(n/log n) (unpublished paper, May 97); but the constant is still unknown.

References

[1]
Devroye (Luc). -- A note on the expected height of binary search trees. Journal of the ACM, vol. 33, n°3, 1986, pp. 489--498.

[2]
Devroye (Luc). -- Limit laws for local counters in random binary search trees. Random Structures and Algorithms, vol. 2, n°3, 1991, pp. 303--315.

[3]
Feller (William). -- An Introduction to Probability Theory and Its Applications. -- John Wiley & Sons, New York, 1971, 2nd edition, vol. II.

[4]
Flajolet (Philippe), Gourdon (Xavier), and Martínez (Conrado). -- Patterns in random binary search trees. -- Research Report n°2997, Institut National de Recherche en Informatique et en Automatique, October 1996. 23 pages. To appear in Random Structures & Algorithms.

[5]
Flajolet (Philippe), Sipala (Paolo), and Steyaert (Jean-Marc). -- Analytic variations on the common subexpression problem. In Paterson (M. S.) (editor), Automata, Languages and Programming, Lecture Notes in Computer Science, vol. 443, pp. 220--234. -- 1990. Proceedings of the 17th ICALP Conference, Coventry, July 1990.

[6]
Hille (Einar). -- Analytic function theory. Vol. 1. -- Ginn and Company, Boston, 1959, xi+308p. Introduction to Higher Mathematics.

[7]
Knuth (Donald E.). -- The Art of Computer Programming. -- Addison-Wesley, 1973, vol. 3: Sorting and Searching.

[8]
Mahmoud (Hosam M.). -- Evolution of Random Search Trees. -- John Wiley & Sons Inc., New York, 1992, Wiley-Interscience Series in Discrete Mathematics and Optimization, xii+324p.

[9]
Sedgewick (Robert) and Flajolet (Philippe). -- An Introduction to the Analysis of Algorithms. -- Addison-Wesley Publishing Company, 1996, 512p.

This document was translated from LATEX by HEVEA.