Distributional Analysis of Recursive Algorithms by the Contraction Method

Distributional Analysis of Recursive Algorithms by the Contraction Method

Ralph Neininger

University of Freiburg

Algorithms Seminar

November 22, 1999

[summary by Elchanan Mossel]

A properly typeset version of this document is available in postscript and in pdf.

If some fonts do not look right on your screen, this might be fixed by configuring your browser (see the documentation here).

1 Basic Algorithms

We consider records which belong to a k-dimensional region D = D₁ × ... × D_k Ì R^k. A file is a finite subset F of D. Given a query q Î (D₁ È {*}) × ... × (D_k È {*}), the objective is to find all the records r Î F such that r_i = q_i when q_i ¹ *. The probabilistic assumption is that all the coordinates of the records and the queries (which are not *) are independent uniform random variables. For the discussion below, it is easy to see that this assumption may be replaced by a weaker assumption that all variables are independent with the same continuous distribution. The specification pattern consists of the configuration in {*,S}^k of specified and unspecified variables.

There exist several comparison-based trees:

Quadtrees. Each record x has 2^k subtrees which correspond to all possible elements of {<,>}^k. Thus (y₀,y₁,y₂) will belong to the (<,>,<) subtree of (x₀,x₁,x₂) if y₀ < x₀, y₁ > x₁, y₂ < x₂. See Figure 1.

Figure 1: A quadtree: the data partition the unit-cube recursively into quadrants; the quadtree corresponds to this partitioning.
kD trees. Each record x at level l has two subtrees corresponding to x_l_mod_k > y_l_mod_k and x_l_mod_k < y_l_mod_k, respectively.
Randomised kD trees. Each record x at level l has two children corresponding to x_l(x) > y_l(x) and x_l(x) < y_l(x) where l(x) are i.i.d. uniform variables in the range 0,...,k-1.
Squarish kD trees. This is another version in which every node has two subtrees, but the coordinate with respect to which we split depends more strongly on the tree structure.

The basic quantity we are after is the limit law of C_n where C_n is the random number of nodes we traverse when finding all records which match the query. Here n is the size of F.

2 Quadtrees in Two Dimensions

We let W = (U,V) be the first key to be inserted and q = (Y,*) be the query. So U, V, and Y are uniform i.i.d. variables. We also let Iⁿ be the vector of cardinalities for the subtrees of the root. We thus derive the following recursive distributional equation:

C_n = 1_{Y < U}

æ
ç
ç
è

I₁ⁿ

+ C

I₂ⁿ

ö
÷
÷
ø

+ 1_{Y > U}

æ
ç
ç
è

I₃ⁿ

+ C

I₄ⁿ

ö
÷
÷
ø

+ 1,

wherein the variables Y, U, V, and C_i^j are independent and all the C_i^j have the distribution of C_i. Given (U,V) the variable Iⁿ is multi-monomial with parameters (U,V) and n.

By previous works [1, 2, 6] there are known constants a, b, and g for which

[C_n] ~ g n

a - 1

, Var

[C_n] ~ b n

2 a - 2

Looking for a limit, we consider the variable: X_n = (C_n - E[C_n])/n^a-1.

In this way we obtain the equation:

X_n = 1_Y<U

æ
ç
ç
ç
ç
ç
è

æ
ç
ç
è

I₁ⁿ

ö
÷
÷
ø

a-1

æ
ç
ç
è

I₁ⁿ

ö
÷
÷
ø

ö
÷
÷
÷
÷
÷
ø

+ 1_Y<U

æ
ç
ç
ç
ç
ç
è

æ
ç
ç
è

I₂ⁿ

ö
÷
÷
ø

a-1

æ
ç
ç
è

I₂ⁿ

ö
÷
÷
ø

ö
÷
÷
÷
÷
÷
ø

+1_Y>U

æ
ç
ç
ç
ç
ç
è

æ
ç
ç
è

I₃ⁿ

ö
÷
÷
ø

a-1

æ
ç
ç
è

I₃ⁿ

ö
÷
÷
ø

ö
÷
÷
÷
÷
÷
ø

+ 1_Y>U

æ
ç
ç
ç
ç
ç
è

æ
ç
ç
è

I₄ⁿ

ö
÷
÷
ø

a-1

æ
ç
ç
è

I₄ⁿ

ö
÷
÷
ø

ö
÷
÷
÷
÷
÷
ø

-g+o(1).

(1)

By the law of large numbers we have

Iⁿ/n ® W =

(

UV,U(1-V),(1-U)V,(1-U)(1-V)

)

in probability. We thus obtain the following limiting equation:

X = 1_{Y < U} W

a-1

(X¹ + g) + 1_{Y < U} W

a-1

(X² + g)

+ 1_{Y > U} W

a-1

(X³ + g) + 1_{Y > U} W

a-1

(X⁴ + g) - g,

(2)

where the Xⁱ are independent copies of X.

This suggests that we should consider the following operator on random variables Z:

T(Z) = 1_{Y < U} W

a-1

(

Z¹ + g

)

+ 1_{Y < U} W

a-1

(

Z² + g

)

+ 1_{Y > U} W

a-1

(

Z³ + g

)

+ 1_{Y > U} W

a-1

(

Z⁴ + g

)

- g,

where the Zⁱ's are independent copies of Z.

We now work with the following metric (on the space of variables with zero mean and finite variance): l₂(Z,Z') = inf (E[Z-Z']²)^1/2 where the infimum is taken over all couplings of Z and Z'. It turns out that this space equipped with this metric is a Banach space. Moreover, using the representation of T one can see that T is a contraction on this space. It therefore follows that there exists a unique random variable Z which satisfies T(Z) = Z.

The main technical part of the proof is showing that we obtain the same limit if we work with the exact equations (1) instead of the approximate equations (2). This essentially uses the known estimates that E[C_n] = g n^a-1 (1 + o(1)). In this way we obtain the following theorem.

Theorem 1 Let X_n be the normalised number of traversed nodes and X the variable such that T(X) = X, then l₂(X_n,X) ® 0.

3 Other Trees

3.1 Multidimensional quadtree

In a similar manner one can prove the same kind of result for multidimensional quadtree. One of the differences is that in this case the variance Var[C_n] is not known beforehand. Instead, we guess that the right normalisation should be

X_n =

C_n - E[C_n]

a-1

In this way we obtain again a limit law similar to the above: the limit X depends only on the number of *'s in the query. Given this limit law we can now compute a constant which depends only on the number of *'s in the query such that Var[C_n] = b n^2a-2.

3.2 kD Trees

Vaguely speaking, the difference between quadtrees and kD trees, is that for kD trees different levels behave differently. Thus, in order to obtain a theorem similar to the above, a single recursion step should go k levels forward instead of just one. Doing that, we obtain a result similar to the above.

3.3 Randomised kD tree

The randomisation allows one to use one-level recursion, therefore obtaining a theorem and a proof similar to the case of quadtrees.

3.4 Squarish kD tree

It seems like the above methods do not work in this case. This is because the coordinate with respect to which we split depends on the structure of the tree and on the data stored in it.

4 Internal Path Length in Random Trees

In the previous sections we studied the cost of a query. In this section we consider the cost of building the tree which is nothing but the sum of depths of nodes in the tree. For the quadtrees we obtain the following recursive equation:

Y_n =

2^d-1

k=0

I_kⁿ

+ n.

The article [3] gives the expectation E[Y_n]=(2/d) nln n+u_dn+o(n), but the variance was not derived there. We guess the normalisation: X_n=(Y_n-E[Y_n])/n. We therefore obtain the equation:

X_n =

2^d-1

i=0

I_kⁿ

+ C_n(Iⁿ)

where

C_n(i₀,...,i

2^d-1

) = 1 +

2^d-1

i=0

i_k

] - E[Y_n].

Using the expectation formula we obtain:

C_n(i) = 1 +

2^d-1

i=0

i_k

+ o(1).

We now continue in the same route as before to obtain the l₂ limit and the asymptotic variance.

5 Find Algorithm

We consider the following version of quicksort. We want to sort the values {1,...,n} which are given in a random uniform permutation. In order to perform the sort we pick a pivotal element p and continue sorting the elements larger than this element, and the elements smaller than this element. The way to pick p is by taking three independent uniform keys k₁, k₂, k₃ and taking p to be their median. We thus obtain the following recursion equation:

C_n = 1

Z_n > M_n

Z_n-1

+ 1

Z_n < M_n

C''

n-Z_n

+ n - 1

where M_n is uniform in {1,...,n} and Z_n is a median of three uniform variables in {1,...,n}. We now continue in a similar way: it is known [4, 5] that E[C_n]=5n/2+O(ln n), we guess that the normalisation is: Y_n=(C_n-E[C_n])/n to obtain a limit law. This limit law enables us to give asymptotic form for all the moments: E[C_n^k] ~ m_k n^k where we have a closed formula for m_k.

References

[1]: Flajolet (Philippe), Gonnet (Gaston), Puech (Claude), and Robson (J. M.). -- The analysis of multidimensional searching in quad-trees. In Proceedings of the Second Annual ACM-SIAM Symposium on Discrete Algorithms (San Francisco, CA, 1991). pp. 100--109. -- ACM, New York, 1991.
[2]: Flajolet (Philippe), Gonnet (Gaston), Puech (Claude), and Robson (J. M.). -- Analytic variations on quadtrees. Algorithmica, vol. 10, n°6, 1993, pp. 473--500.
[3]: Flajolet (Philippe), Labelle (Gilbert), Laforest (Louise), and Salvy (Bruno). -- Hypergeometrics and the cost structure of quadtrees. Random Structures & Algorithms, vol. 7, n°2, 1995, pp. 117--144.
[4]: Kirschenhofer (P.), Prodinger (H.), and Martínez (C.). -- Analysis of Hoare's FIND algorithm with median-of-three partition. Random Structures & Algorithms, vol. 10, n°1-2, 1997, pp. 143--156. -- Average-case analysis of algorithms (Dagstuhl, 1995).
[5]: Kirschenhofer (Peter), Martínez (Conrado), and Prodinger (Helmut). -- Analysis of an optimized search algorithm for skip lists. Theoretical Computer Science, vol. 144, n°1-2, 1995, pp. 199--220. -- Special volume on mathematical analysis of algorithms.
[6]: Martínez (Conrado), Panholzer (Alois), and Prodinger (Helmut). -- On the number of descendants and ascendants in random search trees. Electronic Journal of Combinatorics, vol. 5, n°1, 1998, pp. Research Paper 20, 36 pp.
[7]: Neininger (Ralph). -- Asymptotic distributions for partial match queries in kD trees. -- 1999. Preprint.
[8]: Neininger (Ralph) and Rüschendorf (Ludger). -- On the internal path length of d-dimensional quad trees. Random Structures & Algorithms, vol. 15, n°1, 1999, pp. 25--41.

This document was translated from L^AT_EX by H^EV^EA.