Distributional Analysis of Recursive Algorithms by the Contraction Method

Ralph Neininger

University of Freiburg

Algorithms Seminar

November 22, 1999

[summary by Elchanan Mossel]

A properly typeset version of this document is available in postscript and in pdf.

If some fonts do not look right on your screen, this might be fixed by configuring your browser (see the documentation here).

## 1   Basic Algorithms

We consider records which belong to a k-dimensional region D = D1 × ... × Dk Ì Rk. A file is a finite subset F of D. Given a query q Î (D1 È {*}) × ... × (Dk È {*}), the objective is to find all the records r Î F such that ri = qi when qi ¹ *. The probabilistic assumption is that all the coordinates of the records and the queries (which are not *) are independent uniform random variables. For the discussion below, it is easy to see that this assumption may be replaced by a weaker assumption that all variables are independent with the same continuous distribution. The specification pattern consists of the configuration in {*,S}k of specified and unspecified variables.

There exist several comparison-based trees:
• Quadtrees. Each record x has 2k subtrees which correspond to all possible elements of {<,>}k. Thus (y0,y1,y2) will belong to the (<,>,<) subtree of (x0,x1,x2) if y0 < x0, y1 > x1, y2 < x2. See Figure 1.

Figure 1: A quadtree: the data partition the unit-cube recursively into quadrants; the quadtree corresponds to this partitioning.

• kD trees. Each record x at level l has two subtrees corresponding to xl mod k > yl mod k and xl mod k < yl mod k, respectively.

• Randomised kD trees. Each record x at level l has two children corresponding to xl(x) > yl(x) and xl(x) < yl(x) where l(x) are i.i.d. uniform variables in the range 0,...,k-1.

• Squarish kD trees. This is another version in which every node has two subtrees, but the coordinate with respect to which we split depends more strongly on the tree structure.
The basic quantity we are after is the limit law of Cn where Cn is the random number of nodes we traverse when finding all records which match the query. Here n is the size of F.

## 2   Quadtrees in Two Dimensions

We let W = (U,V) be the first key to be inserted and q = (Y,*) be the query. So U, V, and Y are uniform i.i.d. variables. We also let In be the vector of cardinalities for the subtrees of the root. We thus derive the following recursive distributional equation:
Cn = 1Y < U æ
ç
ç
è
C
 1 I1n
+ C
 2 I2n
ö
÷
÷
ø
+ 1Y > U æ
ç
ç
è
C
 3 I3n
+ C
 4 I4n
ö
÷
÷
ø
+ 1,
wherein the variables Y, U, V, and Cij are independent and all the Cij have the distribution of Ci. Given (U,V) the variable In is multi-monomial with parameters (U,V) and n.

By previous works [1, 2, 6] there are known constants a, b, and g for which
E [Cn] ~ g n
 a - 1
,     Var [Cn] ~ b n
 2 a - 2
.
Looking for a limit, we consider the variable: Xn = (Cn - E[Cn])/na-1.

In this way we obtain the equation:
Xn = 1Y<U æ
ç
ç
ç
ç
ç
è
æ
ç
ç
è
I1n
n
ö
÷
÷
ø
 a-1
æ
ç
ç
è
X
 1 I1n
+g ö
÷
÷
ø
ö
÷
÷
÷
÷
÷
ø
+ 1Y<U æ
ç
ç
ç
ç
ç
è
æ
ç
ç
è
I2n
n
ö
÷
÷
ø
 a-1
æ
ç
ç
è
X
 2 I2n
+g ö
÷
÷
ø
ö
÷
÷
÷
÷
÷
ø
+1Y>U æ
ç
ç
ç
ç
ç
è
æ
ç
ç
è
I3n
n
ö
÷
÷
ø
 a-1
æ
ç
ç
è
X
 3 I3n
+g ö
÷
÷
ø
ö
÷
÷
÷
÷
÷
ø
+ 1Y>U æ
ç
ç
ç
ç
ç
è
æ
ç
ç
è
I4n
n
ö
÷
÷
ø
 a-1
æ
ç
ç
è
X
 4 I4n
+g ö
÷
÷
ø
ö
÷
÷
÷
÷
÷
ø
-g+o(1).
(1)
By the law of large numbers we have
 In/n ® W = ( UV,U(1-V),(1-U)V,(1-U)(1-V) )
in probability. We thus obtain the following limiting equation:
X = 1Y < U W
 a-1 1
(X1 + g) + 1Y < U W
 a-1 2
(X2 + g)
+ 1Y > U W
 a-1 3
(X3 + g) + 1Y > U W
 a-1 4
(X4 + g) - g,
(2)
where the Xi are independent copies of X.

This suggests that we should consider the following operator on random variables Z:
T(Z) = 1Y < U W
 a-1 1
( Z1 + g ) + 1Y < U W
 a-1 2
( Z2 + g )
+ 1Y > U W
 a-1 3
( Z3 + g ) + 1Y > U W
 a-1 4
( Z4 + g ) - g,
where the Zi's are independent copies of Z.

We now work with the following metric (on the space of variables with zero mean and finite variance): l2(Z,Z') = inf (E[Z-Z']2)1/2 where the infimum is taken over all couplings of Z and Z'. It turns out that this space equipped with this metric is a Banach space. Moreover, using the representation of T one can see that T is a contraction on this space. It therefore follows that there exists a unique random variable Z which satisfies T(Z) = Z.

The main technical part of the proof is showing that we obtain the same limit if we work with the exact equations (1) instead of the approximate equations (2). This essentially uses the known estimates that E[Cn] = g na-1 (1 + o(1)). In this way we obtain the following theorem.
Theorem 1   Let Xn be the normalised number of traversed nodes and X the variable such that T(X) = X, then l2(Xn,X) ® 0.

## 3   Other Trees

In a similar manner one can prove the same kind of result for multidimensional quadtree. One of the differences is that in this case the variance Var[Cn] is not known beforehand. Instead, we guess that the right normalisation should be
Xn =
Cn - E[Cn]
n
 a-1
.
In this way we obtain again a limit law similar to the above: the limit X depends only on the number of *'s in the query. Given this limit law we can now compute a constant which depends only on the number of *'s in the query such that Var[Cn] = b n2a-2.

### 3.2   kD Trees

Vaguely speaking, the difference between quadtrees and kD trees, is that for kD trees different levels behave differently. Thus, in order to obtain a theorem similar to the above, a single recursion step should go k levels forward instead of just one. Doing that, we obtain a result similar to the above.

### 3.3   Randomised kD tree

The randomisation allows one to use one-level recursion, therefore obtaining a theorem and a proof similar to the case of quadtrees.

### 3.4   Squarish kD tree

It seems like the above methods do not work in this case. This is because the coordinate with respect to which we split depends on the structure of the tree and on the data stored in it.

## 4   Internal Path Length in Random Trees

In the previous sections we studied the cost of a query. In this section we consider the cost of building the tree which is nothing but the sum of depths of nodes in the tree. For the quadtrees we obtain the following recursive equation:
Yn =
 2d-1 å k=0
Y
 k Ikn
+ n.
The article [3] gives the expectation E[Yn]=(2/dnln n+udn+o(n), but the variance was not derived there. We guess the normalisation: Xn=(Yn-E[Yn])/n. We therefore obtain the equation:
Xn =
 2d-1 å i=0
Ikn
n
X
 k Ikn
+ Cn(In)
where
Cn(i0,...,i
 2d-1
) = 1 +
1
n
 2d-1 å i=0
E [Y
 ik
] - E[Yn].
Using the expectation formula we obtain:
Cn(i) = 1 +
2
d
 2d-1 å i=0
ik
n
ln
ik
n
+ o(1).
We now continue in the same route as before to obtain the l2 limit and the asymptotic variance.

## 5   Find Algorithm

We consider the following version of quicksort. We want to sort the values {1,...,n} which are given in a random uniform permutation. In order to perform the sort we pick a pivotal element p and continue sorting the elements larger than this element, and the elements smaller than this element. The way to pick p is by taking three independent uniform keys k1, k2, k3 and taking p to be their median. We thus obtain the following recursion equation:
Cn = 1
 Zn > Mn
C'
 Zn-1
+ 1
 Zn < Mn
C''
 n-Zn
+ n - 1
where Mn is uniform in {1,...,n} and Zn is a median of three uniform variables in {1,...,n}. We now continue in a similar way: it is known [4, 5] that E[Cn]=5n/2+O(ln n), we guess that the normalisation is: Yn=(Cn-E[Cn])/n to obtain a limit law. This limit law enables us to give asymptotic form for all the moments: E[Cnk] ~ mk nk where we have a closed formula for mk.

## References

[1]
Flajolet (Philippe), Gonnet (Gaston), Puech (Claude), and Robson (J. M.). -- The analysis of multidimensional searching in quad-trees. In Proceedings of the Second Annual ACM-SIAM Symposium on Discrete Algorithms (San Francisco, CA, 1991). pp. 100--109. -- ACM, New York, 1991.

[2]
Flajolet (Philippe), Gonnet (Gaston), Puech (Claude), and Robson (J. M.). -- Analytic variations on quadtrees. Algorithmica, vol. 10, n°6, 1993, pp. 473--500.

[3]
Flajolet (Philippe), Labelle (Gilbert), Laforest (Louise), and Salvy (Bruno). -- Hypergeometrics and the cost structure of quadtrees. Random Structures & Algorithms, vol. 7, n°2, 1995, pp. 117--144.

[4]
Kirschenhofer (P.), Prodinger (H.), and Martínez (C.). -- Analysis of Hoare's FIND algorithm with median-of-three partition. Random Structures & Algorithms, vol. 10, n°1-2, 1997, pp. 143--156. -- Average-case analysis of algorithms (Dagstuhl, 1995).

[5]
Kirschenhofer (Peter), Martínez (Conrado), and Prodinger (Helmut). -- Analysis of an optimized search algorithm for skip lists. Theoretical Computer Science, vol. 144, n°1-2, 1995, pp. 199--220. -- Special volume on mathematical analysis of algorithms.

[6]
Martínez (Conrado), Panholzer (Alois), and Prodinger (Helmut). -- On the number of descendants and ascendants in random search trees. Electronic Journal of Combinatorics, vol. 5, n°1, 1998, pp. Research Paper 20, 36 pp.

[7]
Neininger (Ralph). -- Asymptotic distributions for partial match queries in kD trees. -- 1999. Preprint.

[8]
Neininger (Ralph) and Rüschendorf (Ludger). -- On the internal path length of d-dimensional quad trees. Random Structures & Algorithms, vol. 15, n°1, 1999, pp. 25--41.

This document was translated from LATEX by HEVEA.