Patterns in Random Binary Search Trees
Algorithms Project, INRIA Rocquencourt
October 7, 1996
[summary by Michèle Soria]
A properly typeset version of this document is available in postscript
and in pdf
If some fonts do not look right on your screen, this might be
fixed by configuring your browser (see the documentation here).
In a randomly grown binary search tree (BST) of size n, any fixed
pattern occurs with a frequency that is
on average proportional to n.
Deviations from the average case are highly unlikely and
well quantified by a Gaussian law. Trees with forbidden patterns
occur with an exponentially small probability
that is characterized in terms of Bessel functions.
The results obtained
extend to BST's a type of
property otherwise known for strings and combinatorial
tree models. They apply to paged
trees or to Quicksort with halting on short subfiles.
As a consequence, various pointer saving strategies for maintaining
trees obeying the random BST model can be precisely quantified.
The methods used are based on analytic models,
especially bivariate generating functions subjected to
singularity perturbation asymptotics.
The binary search tree (BST) model is a model of randomness that applies to
binary search trees and heap-ordered trees on random data, Quicksort,
multidimensional trees, randomised search trees and treaps,
as well as to syntax trees
that occur naturally in software engineering. Basic properties of
random BST have been studied in detail, for example it is well-known
that the average value of depth nodes or height is asymptotically
logarithmic in the size of the tree (see
e.g. [7, 8, 9, 1, 2]).
This talk, based on , is devoted to characteristics
of the shape produced by the BST model.
1 Bivariate generating function
The bivariate generating function F(z,y) for the probability fn,k
that a random BST of size n has k
occurrences of a given pattern u is defined by
where l(t) is the probability that
a given unlabelled tree t is the shape of a random BST of
size |t| (l(t)=Õv < t1/|v|,
where the product is over all subtrees v of t), and
w[t]ºwu[t] denotes the number of occurrences
of pattern u as a subtree of the BST t.
|| l(t) y
The recurrence on w[t] is obvious:
where t0,t1 denote the left and right subtrees of t and where the
bracket notation [[P]]
is the indicator of P with
value 1 if the predicate P is true and 0 otherwise.
From this recurrence F(z,y) can be easily shown to satisfy
a linear second order differential equation.
The bivariate generating function F(z,y) satisfies
the Riccati equation
Using equation(1), the mean and variance
of the number of occurrences of u
are obtained by successive differentiation of F(z,y)
with respect to y, upon setting y=1.
Theorem 1 [Moments of occurrences]
The number Wn of occurrences of a pattern u
in a random BST
of size n has mean and variance which are asymptotically linear
µn~ c1(u) n,
where c1(u) and c2(u) are effectively expressed in terms of
l(u) and |u|.
Linearization of the Riccati differential
equation (1) gives, in this case, an explicit solution
in terms of Bessel functions.
The bivariate generating function of the number of occurrences of
pattern u is
L=|u|l(u)(1-y), m=|u|³2, and Am,Bm
are normalized Bessel functions of orders -1/(m+1) and
w(z,y)=Am(L zm+1)-zBm(L zm+1),
2 Asymptotic analysis
The asymptotic behaviour of the bivariate generating function
F(z,y) is dictated by the
singularities in the main variable z (which are poles since F(z,y) is the
quotient of analytic functions),
with the auxiliary variable y
entering as a parameter. Different regions of values of the auxiliary variable
provide different types of information: probability for excluded
patterns (y=0) or rare occurrences (y» 0), limiting
distribution for the number of patterns, central (y» 1)
as well as local (|y|=1). The techniques of proof belong to the class
of singularity analysis methods: it is well-known, after Cauchy's inversion
the location of polar singularities of a function f(z)=å fnzn
drives the asymptotic form of
its coefficients. For bivariate generating function F(z,y),
we first perform one level of inversion, resulting in
fn(y)=[zn] F(z,y), with an extra property of uniform error bounds
for some values of y. One more inversion is required in order to
individual probabilities fn,k.
The most direct approach consists in applying Levy's continuity
theorem for characteristic functions; this implies estimating fn(y)
for y=eiq, but only q near 0
is required because of scaling.
Thus, we have a perturbation of the univariate problem at
When fn(y) is a ``quasi-power'', meaning that it
behaves very nearly like the powers of a fixed function, a Central
Limit Theorem can be derived. A local limit law of the Gaussian type
also holds when one can estimate
globally fn(y) by quasi-powers in larger
regions, like |y|=1, and not merely locally near y=1.
In that case, the recovery of fn,k from
fn(y) is achieved by using a saddle-point method.
2.1 Trees with excluded patterns
Asymptotic analysis of univariate and bivariate generating functions
derived from F(z,y) depends on locating the zeros of
w(z,y) where y is a parameter.
We thus define the function au(y) to be
the root of smallest modulus of the Bessel type equation
This definition specifies au(y) unambiguously,
for |1-y| not too large (|1-y|<5/2
as is shown in lemma 3 of ).
The probability that a random BST of size n does not contain
the pattern u, is by definition [zn]F(z,0). The following result
comes from singularity analysis of meromorphic
Theorem 2 [Excluded patterns]
The probability eu,n that a random BST of size n does not contain
the pattern u satisfies
The same argument proves that, for small enough |y|,
[zn] F(z,y)~ au(y)-n-1,
with a uniform exponentially small error term.
The probability that a random BST of size n has k occurrences
of a pattern u is obtained by differentiating k times this
where K=K(u) is a constant strictly larger than 1, and au(0)
is the smallest positive root of Equation (2)
Theorem 3 [Poisson law for rare occurrences]
Given a fixed pattern u, for each
one has as n®¥:
2.2 Gaussian limit laws
The application of Levy's continuity theorem that leads to a Central
Limit Theorem, relies on evaluating fn(y) when y=ei
implicit function theorem
and the preparation theorem of Weierstrass
(see e.g. ), there is
a small complex neighbourhood of 1 such that the function
au(y) is analytic and by the analysis of meromorphic
functions, fn(y) is closely approximated
by a large power of a fixed function:
fn(y)=au(y)-n-1(1+O(K-n)), a situation that
conduces to normal laws. The speed of convergence is bounded
by means of the Berry-Esseen
inequality (see e.g. )
that relates the distance between
distribution functions and characteristic functions.
Theorem 4 [Central law for occurrences]
Given a pattern u, the number of occurrences
Wn of u in a random BST of size n obeys a central
limit law with linear mean µn and variance sn2
(see theorem 1),
and the speed of
convergence is O(1/(n)1/2):
The obtention of a local limit law starts again with the quasi-powers
form of fn(y) and uses a saddle-point derivation.
|Theorem 5 [Local law for occurrences]
Given a pattern u, if |au(y)|¹1 for all y with |y|=1,
y¹1, then the random variable Wn satisfies a local limit law:
for x in any fixed compact
subset of ]-¥,+¥[.
3 Other applications
Suitable adaptations of the technique also lead to
a distributional analysis of paging and factored representation of
BST as a DAG.
3.1 Paged trees
Given a tree t,
its b-index is a tree that is constructed
by retaining only those internal nodes of t
which correspond to subtrees of size > b.
Such an index is well-suited to ``paging'',
where one has a two-level hierarchical memory
structure: the index resides in main memory
and the rest of the tree is kept in pages
of capacity b on peripheral storage. Let
i[t]=ib[t] denote the size of the b-index of t. The analysis
is then clearly equivalent to determining
the total number of occurrences of all patterns of size £ b.
The bivariate generating function
satisfies a Riccati equation
which is transformed by linearization. This equation has no explicit
solution, but the use of Weierstrass Preparation theorem shows that
G(z,y) has a unique dominant simple pole for y in a small
neighbourhood of 1. Thus [zn]G(z,y) reduces to quasi-powers, and
a central limit law follows:
|Theorem 6 [Paging distribution]
For fixed b³2, the size In
of the b-index constructed on
a random BST of size n has mean µn
and variance sn2 that satisfy
The random variable In obeys a central limit law
with speed of convergence O(1/(n)1/2).
3.2 Factored representations of trees
We consider finally a global parameter k[t]
of trees that represents the number of structurally different subtrees
(i.e., number of different subtree shapes)
that occur in t. This parameter
is of intrinsic interest
as an indicator of
the structural ``richness'' of t.
It also measures the optimal storage complexity of tree t
when all common subtrees are factored and represented only once.
Then, k[t] measures the number of nodes
of the maximally factored DAG (directed acyclic graph)
corresponding to t,
a quantity that intervenes in parsing and data compression
The quantity that is actually analysed is the size
of a DAG representation which is partly redundant
(all trees of size less than b
are represented once, irrespective of their possible nonoccurrence
in t) and partially factored (nodes commanding subtrees of size ³ b are
each represented irrespective of the fact that they may be associated
to repeated subtrees).
The average value of the DAG size of a random BST
of size n satisfies the upper bound,
This upper bound on Kn is of the right order, since L. Devroye has
shown a lower bound in O(n/log n) (unpublished
paper, May 97); but the constant is still unknown.
|Kn £4log 2
Devroye (Luc). --
A note on the expected height of binary search trees. Journal of
the ACM, vol. 33, n°3, 1986, pp. 489--498.
Devroye (Luc). --
Limit laws for local counters in random binary search trees. Random Structures and Algorithms, vol. 2, n°3, 1991,
Feller (William). --
An Introduction to Probability Theory and Its Applications. --
John Wiley & Sons, New York, 1971, 2nd edition,
Flajolet (Philippe), Gourdon (Xavier), and Martínez (Conrado). --
Patterns in random binary search trees. --
Research Report n°2997, Institut National de Recherche en
Informatique et en Automatique, October 1996. 23 pages. To
appear in Random Structures & Algorithms.
Flajolet (Philippe), Sipala (Paolo), and Steyaert (Jean-Marc). --
Analytic variations on the common subexpression problem. In Paterson
(M. S.) (editor), Automata, Languages and Programming, Lecture
Notes in Computer Science, vol. 443, pp. 220--234. --
1990. Proceedings of the 17th ICALP Conference,
Coventry, July 1990.
Hille (Einar). --
Analytic function theory. Vol. 1. --
Ginn and Company, Boston, 1959, xi+308p. Introduction
to Higher Mathematics.
Knuth (Donald E.). --
The Art of Computer Programming. --
Addison-Wesley, 1973, vol. 3: Sorting and Searching.
Mahmoud (Hosam M.). --
Evolution of Random Search Trees. --
John Wiley & Sons Inc., New York, 1992, Wiley-Interscience Series in Discrete Mathematics and Optimization,
Sedgewick (Robert) and Flajolet (Philippe). --
An Introduction to the Analysis of Algorithms. --
Addison-Wesley Publishing Company, 1996, 512p.
This document was translated from LATEX by