On TreeGrowing Search Strategies
Hosam M. Mahmoud
Algorithms Project, INRIA Rocquencourt
and The George Washington
University
Algorithms Seminar
January 5, 1998
[summary by Michèle Soria]
A properly typeset version of this document is available in
postscript and in
pdf.
If some fonts do not look right on your screen, this might be
fixed by configuring your browser (see the documentation here).
Abstract
A search algorithm that adds a key to a sorted file can be represented
by a deterministic tree whose external nodes are equally likely
targets for insertion. The collection of algorithms that one uses
throughout the stages of insertion sort is called a search
strategy. Using the concept of ``treegrowing'' strategy, we
demonstrate that most practical algorithm have a normal behavior.
We present a sufficient condition for normality of treegrowing
strategies. The sufficient condition specifies a relationship between
the overall variance and the rate of growth in height of the sequence
of trees that the search strategy ``grows''.
Insertion sort is a wellknown online sorting algorithm: at each stage
of the sorting, the elements obtained so far make up a sorted array; when
reading a new element, the algorithm searches for its proper position in
the array and inserts it. The searching may be done by any method (linear,
binary, etc) and the methods may be different from one stage to another.
At each stage, given a searching
strategy, the positions of probes
(positions for comparisons between the new element
to be inserted and elements of the current array) is represented by a
binary decision tree. The first probe is the root of the
decision tree, the two positions of the second probe, at most one on each side
of the first probe, become the children of the root and all internal nodes
are constructed so forth. The leaves of
the tree correspond to the places where new elements are to be inserted. The
roottoleaf paths of the tree thus represent the possible probe sequences of
the searching algorithm.
1 Treegrowing and normal strategies
Let S_{i} denote the searching algorithm at stage i, with corresponding
decision tree T_{i}; the collection of searching algorithms
S = {S_{i}}_{i=1}^{¥}, or equivalently the collection of
corresponding decision trees T = {T_{i}}_{i=1}^{¥},
will be called
a search strategy. A search strategy is treegrowing if, for each
positive integer i, the shape of T_{i+1} is obtained by replacing a leaf
of T_{i} by an internal node (with two hanging leaves).
We analyse treegrowing search strategies under the assumption of uniform
distribution of the leaves at each stage. Let the random variable C_{n} be the
total number of times S compares a new element
to a probe during the sorting
of the first n elements. The class of normal search strategies is
composed of the strategies for which C_{n}, once normalized,
converges in distribution to the normal law, i.e.
C_{n}E[C_{n}] 

(Var[C_{n}])^{1/2} 

® 

N(0,1). 
The following lemma gives a sufficient condition for a treegrowing strategy
to be normal. This condition relates the variance of C_{n} to the height of the
decision trees of the search strategy. We denote by h_{n}
the height of tree T_{n},
and by s_{n}^{2} the variance of C_{n}.
Lemma 1
If h_{n}=o(s_{n}) then S is a normal strategy.
Proof.
C_{n} is the sum of n random variables (X_{i})_{i=1,...,n}, where X_{i}
denotes the number of comparisons made by S_{i}. Since each insertion is
performed independently of all others, we assume the X_{i}'s to be independent
random variables. The proof of Lemma 1 is a technical verification of
Lindeberg's condition, which ensures normality:
"e>0,





ó õ 

X_{i}^{2} dF 

= 0. 
Practically for a given strategy, the difficulty lies in computing the
variance of C_{n}, which is the sum of the variances of the X_{i}'s.
The most commonly used strategies, linear search and binary search, satisfy
the condition of Lemma 1. When linear search is used at every stage, it is easy
to show that C_{n} has average value asymptotic to n^{2}/4, and variance
asymptotic to n^{3}/36, whereas h_{n} equals n1. For binary repeated search
strategies, one can easily show that h_{n} is asymptotic to log n and
the average value of C_{n}
is equivalent to nlog_{2} n, but the computation of the variance is more
intricate and finally leads to s_{n}^{2}=n A(n) + O(log n), where A(n)
is an oscillating function of bounded magnitude.
There exists treegrowing search strategies which are not normal.
In [2], the
authors exhibit a strategy that does not satisfy the condition of Lemma 1,
and can be shown to be non normal by FellerLindeberg condition
(see [1, vol. 2, §XV.6]). To ensure normality, some further conditions,
which are presented in the next
section,
are required
on the decision trees of the search strategy.
2 Normality of consistent strategies
This section identifies a subclass of treegrowing strategies, the
consistent strategies, which are proved to be normal.
Let T = {T_{i}}_{i=1}^{¥} be the collection of
decision trees corresponding to a search strategy S. For
each T_{i}, we denote by T_{L}_{i} its left subtree (with size n_{L}_{i}) and T_{R}_{i} its right subtree (with size n_{R}_{i}). The size of the smaller subtree is noted by
g(i)=min(n_{L}_{i}, n_{R}_{i}) .
A search strategy S is said to be selfsimilar if
for each decision tree, its left and right subtrees belong to T, that is T_{L}_{i}=T_{n}_{L}_{i} and
T_{R}_{i}=T_{n}_{R}_{i} (where trees are considered as equal if they have
the same shape).
And S is said to be wellproportioned if the proportion
of nodes belonging to the smaller subtree approaches a limit as i
tends to infinity, that is lim_{i® ¥} g(i)/i exists.
Finally we call consistent a strategy which is treegrowing,
selfsimilar and well proportioned.
Many usual strategies, such as linear search (g(i)/i® 0) or
binary search (g(i)/i® 1/2), are
consistent.
Theorem 1
All consistent strategies are normal.
The proof of this theorem relies on three properties of search
strategies decision
trees that
ensure the sufficient
condition for normality stated in Lemma 1.
Property 1
For each positive integer i, the decision tree T_{i}
has at least one external node on
each unsaturated level.
Property 2
Let m_{k} be the number of decision trees with height k, the
sequence {m_{k}}_{k=1}^{¥} is nondecreasing in k.
For consistent strategies, these two properties result from
selfsimilarity and treegrowing of the decision trees.
Property 3
The variance of C_{n} satisfies s_{n}^{2}=W (n).
This property holds true for any decision tree, the intuition being
that the tree with smallest variance is the complete binary tree (all
levels saturated, except possibly the last one), which is the decision
tree associated with binary search.
The proof of Theorem 1 considers two cases, g(i)/i® 1/2 and
lim g(i)/iÎ [0, 1/2). In the first case the strategy is similar
to binary search, h_{n}=o(s_{n}) and the result follows from
Property 3.
In the second case Properties 1 and 2 are used to show that
h_{n}=o(s_{n}).
3 Relaxed conditions for normality
The sufficient condition for normality stated in Lemma 1 holds true
for some families of treegrowing search strategies that are not
consistent. For example, Fibonaccian search, where Fibonacci numbers
are used to indicate the next probe, is not consistent since lim
g(i)/i does not exist; but it can be shown to be normal with a proof
similar to the one of Theorem 1, since liminf_{n®¥} g(n)/n
as well as limsup_{n®¥} g(n)/n stand in (0,1/2).
More generally, one can exhibit different conditions on search
strategies, that lead to normality by showing that the heights of the
decision trees grow at a steady rate. For example
Proposition 1
Any treegrowing search strategy S for which
liminf_{n®¥} g(n)/n belongs to (0,1/2) is normal.
Proposition 2
If m_{k}=W (k^{1+e}) for some e >0, and
Var[X_{n}]= W (1), then the corresponding
treegrowing search
strategy is normal.
References
 [1]

Feller (William). 
An introduction to probability theory and its applications. 
John Wiley & Sons Inc., New York, 1971, second
edition, vol. II, xxiv+669p.
 [2]

Lent (Janice) and Mahmoud (Hosam M.). 
On treegrowing search strategies. The Annals of Applied
Probability, vol. 6, n°4, 1996, pp. 12841302.
This document was translated from L^{A}T_{E}X by
H^{E}V^{E}A.