Patterns in Random Binary Search Trees

Patterns in Random Binary Search Trees

Philippe Flajolet

Algorithms Project, INRIA Rocquencourt

Algorithms Seminar

October 7, 1996

[summary by Michèle Soria]

A properly typeset version of this document is available in postscript and in pdf.

If some fonts do not look right on your screen, this might be fixed by configuring your browser (see the documentation here).

Abstract

In a randomly grown binary search tree (BST) of size n, any fixed pattern occurs with a frequency that is on average proportional to n. Deviations from the average case are highly unlikely and well quantified by a Gaussian law. Trees with forbidden patterns occur with an exponentially small probability that is characterized in terms of Bessel functions. The results obtained extend to BST's a type of property otherwise known for strings and combinatorial tree models. They apply to paged trees or to Quicksort with halting on short subfiles. As a consequence, various pointer saving strategies for maintaining trees obeying the random BST model can be precisely quantified. The methods used are based on analytic models, especially bivariate generating functions subjected to singularity perturbation asymptotics.

The binary search tree (BST) model is a model of randomness that applies to binary search trees and heap-ordered trees on random data, Quicksort, multidimensional trees, randomised search trees and treaps, as well as to syntax trees that occur naturally in software engineering. Basic properties of random BST have been studied in detail, for example it is well-known that the average value of depth nodes or height is asymptotically logarithmic in the size of the tree (see e.g. [7, 8, 9, 1, 2]). This talk, based on [4], is devoted to characteristics of the shape produced by the BST model.

1 Bivariate generating function

The bivariate generating function F(z,y) for the probability f_n,k that a random BST of size n has k occurrences of a given pattern u is defined by

F(z,y):=

n,k

f_n,ky^kxⁿ =

l(t) y

w[t]

z^|t|,

where l(t) is the probability that a given unlabelled tree t is the shape of a random BST of size |t| (l(t)=Õ_{v < t}1/|v|, where the product is over all subtrees v of t), and w[t]ºw_u[t] denotes the number of occurrences of pattern u as a subtree of the BST t.

The recurrence on w[t] is obvious: w[t]=[[t=u]]+w[t₀]+w[t₁] , where t₀,t₁ denote the left and right subtrees of t and where the bracket notation [[P]] is the indicator of P with value 1 if the predicate P is true and 0 otherwise. From this recurrence F(z,y) can be easily shown to satisfy a linear second order differential equation.

Lemma 1 The bivariate generating function F(z,y) satisfies the Riccati equation

¶ z

F(z,y)= F²(z,y)+(y-1)l(u)|u|z^|u|-1, F(0,y)=1 (1)

Using equation(1), the mean and variance of the number of occurrences of u are obtained by successive differentiation of F(z,y) with respect to y, upon setting y=1.

Theorem 1 [Moments of occurrences] The number W_n of occurrences of a pattern u in a random BST of size n has mean and variance which are asymptotically linear µ_n~ c₁(u) n, s_n~ c₂(u)n, where c₁(u) and c₂(u) are effectively expressed in terms of l(u) and |u|.

Linearization of the Riccati differential equation (1) gives, in this case, an explicit solution in terms of Bessel functions.

Lemma 2 The bivariate generating function of the number of occurrences of pattern u is

F(z,y)=-

w'(z,y)

w(z,y)

, w(z,y)=A_m(L z^m+1)-zB_m(L z^m+1),

with w'(z,y)=w'_z(z,y), L=|u|l(u)(1-y), m=|u|³2, and A_m,B_m are normalized Bessel functions of orders -1/(m+1) and 1/(m+1) respectively.

2 Asymptotic analysis

The asymptotic behaviour of the bivariate generating function F(z,y) is dictated by the singularities in the main variable z (which are poles since F(z,y) is the quotient of analytic functions), with the auxiliary variable y entering as a parameter. Different regions of values of the auxiliary variable provide different types of information: probability for excluded patterns (y=0) or rare occurrences (y» 0), limiting distribution for the number of patterns, central (y» 1) as well as local (|y|=1). The techniques of proof belong to the class of singularity analysis methods: it is well-known, after Cauchy's inversion formula that the location of polar singularities of a function f(z)=å f_nzⁿ drives the asymptotic form of its coefficients. For bivariate generating function F(z,y), we first perform one level of inversion, resulting in estimates of f_n(y)=[zⁿ] F(z,y), with an extra property of uniform error bounds for some values of y. One more inversion is required in order to recover the individual probabilities f_n,k. The most direct approach consists in applying Levy's continuity theorem for characteristic functions; this implies estimating f_n(y) for y=e^iq, but only q near 0 is required because of scaling. Thus, we have a perturbation of the univariate problem at y=1. When f_n(y) is a ``quasi-power'', meaning that it behaves very nearly like the powers of a fixed function, a Central Limit Theorem can be derived. A local limit law of the Gaussian type also holds when one can estimate globally f_n(y) by quasi-powers in larger regions, like |y|=1, and not merely locally near y=1. In that case, the recovery of f_n,k from f_n(y) is achieved by using a saddle-point method.

2.1 Trees with excluded patterns

Asymptotic analysis of univariate and bivariate generating functions derived from F(z,y) depends on locating the zeros of w(z,y) where y is a parameter. We thus define the function a_u(y) to be the root of smallest modulus of the Bessel type equation

A_m(La^m+1)-zB_m(La^m+1)=0, L=(1-y)|u|l(u). (2)

This definition specifies a_u(y) unambiguously, for |1-y| not too large (|1-y|<5/2 as is shown in lemma 3 of [4]). The probability that a random BST of size n does not contain the pattern u, is by definition [zⁿ]F(z,0). The following result comes from singularity analysis of meromorphic functions.

Theorem 2 [Excluded patterns] The probability e_u,n that a random BST of size n does not contain the pattern u satisfies

e_u,n=a_u(0)^-n-1(1+O(K^-n))

where K=K(u) is a constant strictly larger than 1, and a_u(0) is the smallest positive root of Equation (2) with y=0.

The same argument proves that, for small enough |y|, [zⁿ] F(z,y)~ a_u(y)^-n-1, with a uniform exponentially small error term. The probability that a random BST of size n has k occurrences of a pattern u is obtained by differentiating k times this asymptotic expansion.

Theorem 3 [Poisson law for rare occurrences] Given a fixed pattern u, for each fixed k, one has as n®¥:

{

w_u[t]=k||t|=n

}

=a_u(0)^-n-1·

(µ n)^k

æ
ç
ç
è

1+O(

)

ö
÷
÷
ø

, µ=-

a_u'(0)

a_u(0)

2.2 Gaussian limit laws

The application of Levy's continuity theorem that leads to a Central Limit Theorem, relies on evaluating f_n(y) when y=e^i
t/(n)^{^1/2}. By the implicit function theorem and the preparation theorem of Weierstrass (see e.g. [6]), there is a small complex neighbourhood of 1 such that the function a_u(y) is analytic and by the analysis of meromorphic functions, f_n(y) is closely approximated by a large power of a fixed function: f_n(y)=a_u(y)^-n-1(1+O(K^-n)), a situation that conduces to normal laws. The speed of convergence is bounded by means of the Berry-Esseen inequality (see e.g. [3]) that relates the distance between distribution functions and characteristic functions.

Theorem 4 [Central law for occurrences] Given a pattern u, the number of occurrences W_n of u in a random BST of size n obeys a central limit law with linear mean µ_n and variance s_n² (see theorem 1), and the speed of convergence is O(1/(n)^1/2):

sup

xÎ R

½
½
½
½
½
½

ì
í
î

W_n-µ_n

s_n

£ x

ü
ý
þ

(2p)^1/2

ó
õ

-¥

-w²/2

½
½
½
½
½
½

(n)^1/2

The obtention of a local limit law starts again with the quasi-powers form of f_n(y) and uses a saddle-point derivation.

Theorem 5 [Local law for occurrences] Given a pattern u, if |a_u(y)|¹1 for all y with |y|=1, y¹1, then the random variable W_n satisfies a local limit law:

sup

½
½
½
½

s_n Pr{W_n=ëµ_n+xs_nû} -

(2p)^1/2

-x²/2

½
½
½
½

® 0,

for x in any fixed compact subset of ]-¥,+¥[.

3 Other applications

Suitable adaptations of the technique also lead to a distributional analysis of paging and factored representation of BST as a DAG.

3.1 Paged trees

Given a tree t, its b-index is a tree that is constructed by retaining only those internal nodes of t which correspond to subtrees of size > b. Such an index is well-suited to ``paging'', where one has a two-level hierarchical memory structure: the index resides in main memory and the rest of the tree is kept in pages of capacity b on peripheral storage. Let i[t]=i_b[t] denote the size of the b-index of t. The analysis is then clearly equivalent to determining the total number of occurrences of all patterns of size £ b. The bivariate generating function G(z,y):=å_t l(t)y^i[t]z^|t| satisfies a Riccati equation

¶ z

G(z,y) =yG²(z,y)+(1-y)

æ
ç
ç
è

1-z^b+1

1-z

ö
÷
÷
ø

which is transformed by linearization. This equation has no explicit solution, but the use of Weierstrass Preparation theorem shows that G(z,y) has a unique dominant simple pole for y in a small neighbourhood of 1. Thus [zⁿ]G(z,y) reduces to quasi-powers, and a central limit law follows:

Theorem 6 [Paging distribution] For fixed b³2, the size I_n of the b-index constructed on a random BST of size n has mean µ_n and variance s_n² that satisfy

µ_n=

2(n+1)

b+2

-1, s_n²=

(b-1)b(b+1)

(b+2)²

(n+1).

The random variable I_n obeys a central limit law with speed of convergence O(1/(n)^1/2).

3.2 Factored representations of trees

We consider finally a global parameter k[t] of trees that represents the number of structurally different subtrees (i.e., number of different subtree shapes) that occur in t. This parameter is of intrinsic interest as an indicator of the structural ``richness'' of t. It also measures the optimal storage complexity of tree t when all common subtrees are factored and represented only once. Then, k[t] measures the number of nodes of the maximally factored DAG (directed acyclic graph) corresponding to t, a quantity that intervenes in parsing and data compression applications [5].

The quantity that is actually analysed is the size of a DAG representation which is partly redundant (all trees of size less than b are represented once, irrespective of their possible nonoccurrence in t) and partially factored (nodes commanding subtrees of size ³ b are each represented irrespective of the fact that they may be associated to repeated subtrees).

Theorem 7 The average value of the DAG size of a random BST of size n satisfies the upper bound,

K_n £4log 2

log n

æ
ç
ç
è

nloglog n

(log n)²

ö
÷
÷
ø

This upper bound on K_n is of the right order, since L. Devroye has shown a lower bound in O(n/log n) (unpublished paper, May 97); but the constant is still unknown.

References

[1]: Devroye (Luc). -- A note on the expected height of binary search trees. Journal of the ACM, vol. 33, n°3, 1986, pp. 489--498.
[2]: Devroye (Luc). -- Limit laws for local counters in random binary search trees. Random Structures and Algorithms, vol. 2, n°3, 1991, pp. 303--315.
[3]: Feller (William). -- An Introduction to Probability Theory and Its Applications. -- John Wiley & Sons, New York, 1971, 2nd edition, vol. II.
[4]: Flajolet (Philippe), Gourdon (Xavier), and Martínez (Conrado). -- Patterns in random binary search trees. -- Research Report n°2997, Institut National de Recherche en Informatique et en Automatique, October 1996. 23 pages. To appear in Random Structures & Algorithms.
[5]: Flajolet (Philippe), Sipala (Paolo), and Steyaert (Jean-Marc). -- Analytic variations on the common subexpression problem. In Paterson (M. S.) (editor), Automata, Languages and Programming, Lecture Notes in Computer Science, vol. 443, pp. 220--234. -- 1990. Proceedings of the 17th ICALP Conference, Coventry, July 1990.
[6]: Hille (Einar). -- Analytic function theory. Vol. 1. -- Ginn and Company, Boston, 1959, xi+308p. Introduction to Higher Mathematics.
[7]: Knuth (Donald E.). -- The Art of Computer Programming. -- Addison-Wesley, 1973, vol. 3: Sorting and Searching.
[8]: Mahmoud (Hosam M.). -- Evolution of Random Search Trees. -- John Wiley & Sons Inc., New York, 1992, Wiley-Interscience Series in Discrete Mathematics and Optimization, xii+324p.
[9]: Sedgewick (Robert) and Flajolet (Philippe). -- An Introduction to the Analysis of Algorithms. -- Addison-Wesley Publishing Company, 1996, 512p.

This document was translated from L^AT_EX by H^EV^EA.