The Analysis of Multiple Quickselect

Helmut Prodinger

Technical University of Vienna

Algorithms Seminar

June 30, 1997

[summary by Philippe Flajolet]

A properly typeset version of this document is available in postscript and in pdf.

If some fonts do not look right on your screen, this might be fixed by configuring your browser (see the documentation here).

Abstract
Quickselect is a version of Quicksort that makes it possible to find efficiently any element in an unsorted file given its rank. ``Multiple Quickselect'' is designed to find simultaneously a collection of elements, also specified by their ranks. This talk shows how to analyse Multiple Quickselect when the underlying permutation is random and the collection of ranks is a random p-subset (p fixed). The analysis provides a nice illustration of the use of multivariate generating functions.



1   Algorithms

The principle of Quicksort is of course extremely well known. Given an array T[1... n] of numbers to be sorted, choose a ``pivot'', say T[1], then partition (by a linear scan) the original array into the two subarrays, T< and T>, formed by elements that are respectively smaller and larger than the pivot, and proceed recursively. The algorithm was first proposed by Hoare in 1962. Its characteristics are fairly well understood: the algorithms sorts in place (it uses O(1) auxiliary memory in addition to the recursion stack), its average number of comparisons is ~ 2nlog n, while the implementation constants are especially low. For these reasons, Quicksort is a method of choice amongst sorting algorithms, and it has been adopted as the basis of the Unix sort command. Note however that it is still unknown whether the limit distribution of the number of comparisons that has been proved to exist can be characterized in terms of standard special functions of analysis. (Existence proofs comprise the martingale argument of Rgnier, the moment method of Hennequin, and the contraction metrics approach of Rsler; see [4] and references therein.)

Quickselect is a simplified version of Quicksort adapted so as to locate the jth ranked element in a file. In that case, it suffices to effect one recursive descent in one of the two subfiles T<,T> created by the basic partitioning stage of Quicksort. Knuth's book [2] provides the analysis of the two parameters of interest, the average number of passes (recursive calls) P[n;j] and the average number of comparisons C[n;j],
P[n;j]=Hj+Hn+1-j-1,
C[n;j]=2(n+3+(n+1)Hn-(j+2)Hj-(n+3-j)Hn+1-j)
where Hn=1+1/2+...+1/n denotes the nth harmonic number. In particular, the mean number of comparisons is O(n), uniformly for all j. From the complexity standpoint, the algorithm thus has the same type of cost as a fixed number of scans of the input file---a highly appealing feature. (Contrast the simplicity of Quickselect with algorithms like median-finding that are constructed specifically to achieve linear complexity in the worst-case!)

Multiple Quickselect is an extension of Hoare's idea that works as follows: Assume that we simultaneously search for p elements with ranks J=(j1,j2,...,jp) where 1 j1<j2<...<jp n is a fixed set of p values. Then, depending on the interval determined by j1,...,jp in which the pivot element falls, we continue recursively on both subfiles T<,T>, but with smaller J-index sets. The principle is quite simple and a nice description of the algorithm can be found in [3]. Notice that Multiple Quickselect can be used for instance to determine quantiles of distributions efficiently.

In a previous work, Prodinger [7] has given explicit formul for both the average number of passes P[n;j1,...,jp] and the average number of comparisons, C[n;j1,...,jp]. As a consequence, the so-called ``grand averages'',
Pn,p=
1

n
p

 
1 j1<j2<...<jp n
P[n;j1,...,jp],
Cn,p=
1

n
p

 
1 j1<j2<...<jp n
C[n;j1,...,jp]
have been determined (see also [3]). However, the approach used in [7] was a mixture of guessing and proofs by induction.

The present paper by Panholzer and Prodinger [6] develops a direct generating function approach to the analysis of the grand averages. This gives access to variances that were out of reach with the old method.

2   Moments

In both instances, namely number of passes and of comparisons, the problem is modelled by a trivariate generating function F(z,u,v), where the coefficient n! [znupvm]F is the number of permutations of {1,2,...,n} and of subsets of p elements of {1,2,...,n} such that the parameter of interest has value m. Under the random permutation model, the splitting probabilities of Quicksort and Quickselect are given by
Pr{K=k}=
1
n
,
where the random variable K[1,n] denotes the rank of the pivot. The model is thus isomorphic to that of binary search trees (BST's) or heap-ordered trees [2, 4].

Number of passes.
The trivariate generating function (GF) where z records the size n of the input permutation, v records the number of passes (recursive calls), and u records the number of elements being simultaneously selected satisfies
F'(z,u,v)=v(1+u)F2(z,u,v)+
1-v
(1-z)2
    (1)
with F(0,u,v)=1 and F' denotes a partial derivative with respect to z. The relation (1) reads off almost directly from the problem. We may also view it as expressing the size of a generalized common ancestor tree in binary search trees.

As is usual in distributional analyses of additive parameters on BST's, we have a Riccati equation; see for instance, the analysis of patterns in BST's in [1]. Such equations are systematically linearized by a change of variable of the form F(z)=a(z)W'(z)/W(z). Here,
F(z,u,v)=
W+1-2v+ (1-z )
W
 
(W- 1+2v )
(W+1-2v(1+u)+ (1-z )
W
 
(W- 1+ 2v(1+u) ) ) (1-z )
    (2)
W=(1-4v (1+u ) (1-v ))1/2.
(This is well within the automatic capabilities of Maple.) By differentiating with respect to v and expanding, we obtain a result of [7].

Theorem 1   For p1 the average number Pn,p of passes employed by Multiple Quickselect in order to search for p random elements is
P n,p= ( Hn+1-Hp )
2p(n+1)2
(n+2-p)(n+1-p)
-
n(2p-1)+p
n+2-p
.     (3)
A particular case is the basic Quickselect algorithm (p=1), for which
Pn,1=2


1+
1
n



Hn-3.

By taking second derivatives, we have access to the variance. A careful expansion guided by educated guesses and patience then gives explicit expressions. We state the result from [6] as it is a good indication of the intricacies involved.

Theorem 2   The variance of the number of passes when searching for a random set of p2 elements with Multiple Quickselect is given by
Vn,p
=-
4p(n+1)2Y1
(n+4-p)(n+3-p)(n+2-p)2(n+1-p)2
(Hn+1 -Hp)2
 
+
2(n+1)2Y2
(n+4-p)(n+3-p)(n+2-p)2(n+1-p)
(Hn+1 -Hp)
 
-
4p(n+2)(n+1)2(pn+p+2)
(n+4-p)(n+3-p)(n+2-p)(n+1-p)
(Hn+1(2) -Hp(2))
 
+
2(n+1)2Y3
(n+4-p)(n+3-p)(n+2-p)2
Y1 =(3p-2)n3-2(p2-9p+5)n2-(p3+5p2-33p+16)n-p3-5p2+20p-8
Y2 =pn3+(11p2-13p+8)n2-(9p3-54p2+62p-32)n-3p4-p3+38p2-56p+32
Y3 =(2p-1)n2-2(5p2-15p+8)n+8p3-45p2+72p-32.

For p=1, the variance simplifies to
Vn,1=-4
n+1
n2
Hn+12+
2(n+4)(n+1)
n2
Hn+1 -4
n+1
n
Hn+1(2)+
2(2n2-n-2)
n2
.

Number of comparisons.
The process for moment calculations is similar, only the expressions become a little more complicated. The fundamental functional equation is now of the difference-differential type,
F'(z,u,v)=(1+u)F2(zv,u,v)+
1
(1-z)2
-
1
(1-zv)2
,
with F(0,u,v)=1. Contrary to the previous case, this equation is not known to have a closed form solution. It is however still possible to ``pump'' GF's of moments by differentiation. We must omit details here and simply state the average-case result while referring to [6] for a daunting variance computation that involves the dilogarithm Li2(z):= zn/n2.

Theorem 3   For p1 the average number Cn,p of comparisons to search for p random elements with Multiple Quickselect is given by
Cn,p =-
2p(n+1)(4n-p+5)
(n+2-p)(n+1-p)
Hn+1 +
2(n+1)2(n+2p+2)
(n+2-p)(n+1-p)
Hp +
n2+(5p-1)n+3p
n+2-p
.
In particular for a basic (single element) Quickselect and a full sort, we get back the classical results,
Cn,1=3n-8


1+
1
n



Hn+13,      Cn,n=2(n+1)Hn-4n.

3   Distributions

The limiting distribution of Quicksort has been under attack for about 20 years. For basic Quickselect, the problem is in a way simpler since the recursion structure is a linear one. Here is what is known at present: The second result suggests that there is an interesting class of limit distributions for comparison costs when p is a fixed integer p 2.

References

[1]
Flajolet (Philippe), Gourdon (Xavier), and Martnez (Conrado). -- Patterns in random binary search trees. -- Research Report n°2997, Institut National de Recherche en Informatique et en Automatique, October 1996. 23 pages. To appear in Random Structures & Algorithms.

[2]
Knuth (Donald E.). -- The Art of Computer Programming. -- Addison-Wesley, 1973, vol. 3: Sorting and Searching.

[3]
Lent (Janice) and Mahmoud (Hosam M.). -- Average-case analysis of multiple Quickselect: an algorithm for finding order statistics. Statistics & Probability Letters, vol. 28, n°4, 1996, pp. 299--310.

[4]
Mahmoud (Hosam M.). -- Evolution of Random Search Trees. -- John Wiley & Sons Inc., New York, 1992, Wiley-Interscience Series in Discrete Mathematics and Optimization, xii+324p.

[5]
Mahmoud (Hosam M.), Modarres (Reza), and Smythe (Robert T.). -- Analysis of QUICKSELECT: an algorithm for order statistics. RAIRO Theoretical Informatics and Applications, vol. 29, n°4, 1995, pp. 255--276.

[6]
Panholzer (Alois) and Prodinger (Helmut). -- A generating functions approach for the analysis of grand averages for multiple quickselect. -- Preprint, 1997.

[7]
Prodinger (H.). -- Multiple quickselect --- Hoare's find algorithm for several elements. Information Processing Letters, vol. 56, 1995, pp. 123--129.

This document was translated from LATEX by HEVEA.