Research News, May 1999


Previous News Next News

Return to Research News Index


These pages are edited by Philippe.Flajolet@inria.fr.

The Barcelona Meeting on Average-Case Analysis of Algorithms


In June, Barcelona hosts the Fifth Seminar on Analysis of Algorithms. The programme features some 35 talks or so. Informations about the place, the schedule, and the abstracts of talks are given under the web site nicely maintained by Conrado Martinez who is the chief organizer for 1999.

In 1999, a number of original results should be presented. They deal to a large extent with some of the most basic algorithms--the ones that do deserve deep investigation. Examples are hashing, sorting, retrieval of simple or multidimensional information, algorithms from computational number theory, random generation, data compression, circuits, to name a few. Naturally, these algorithms often lead to what one might term ``combinatorial processes'' that are close in spirit to stochastic processes, safe that they have a strong algorithmic and combinatorial content. Sometimes the new combinatorial processes that arise (think of digital tries or binary search trees) are of considerable interest for their own sake; they are often found to intersect other models from classical probability theory (e.g., the connection between simple families of trees and branching processes) and combinatorics (e.g., binary search trees, heap-ordered trees, recursive trees against permutations); some of them even show up somewhat unexpectedly in areas like statistical physics, dynamical systems theory, or other exotic fields.

Sorts of sorts.

The opening talk will be given by Bob Sedgewick. I know a little bit what to expect as Bob was giving a similar lecture at the Polytechnique in Paris on May 26. Quicksort again! What happens is that a new general purpose sort had to be implemented for Unix systems in the 1990's. As we know from reading the Bell Technical Journal, Quicksort and Unices had engaged in a long flirtation. Well, some special files would still take quadratic time in reality not just on paper! A few years ago, Bentley engaged in redesigning a truly universal sort that would cope equally well with repeated keys and with distinct keys. It appears that, when suitably implemented, Quicksort can at the same time exhibit its familiar (O(n log(n) ))behaviour on distinct key while it agreably switches to linear time complexity (O(n)) on small key universes. The talk by Sedgewick should illustrate elegant algorithmic ideas together with the theoretical questions that they pose. An interesting question is: how does the new brand of Quicksort behave on small data universes? How close is it to various information-theoretic optima?

Quicksort with equal keys also belongs to a circle of ideas that lead to the Ternary Search Trie (TST) structure proposed by Bentley and Sedgewick at the SODA Conference in 1997. It is striking that TST's overperform hashing in OCR dictionary applications typically by a factor of 5. They have also been tested by Julien Clément on large text corpuses, like the novel Moby Dick (!!), where they gain quite a bit in terms of either time or storage requirements and Julien has designed an efficient spell checker based on them. The paper by Clément, Flajolet, and Vallée will present some of the corresponding analyses cast into a framework, based on fundamental intervals and modelling by transfer operators, that gives access to a number of parameters of trie structures and to a number of source models that include Bernoulli and Markov models as well as sources with unbounded correlations like continued frcations. Through this an earlier studies, Quicksort is once more confirmed as a method of choice for general purpose sorting.

For files that are not too large, common wisdom is that a good easy-to-programme choice is shellsort. Knuth's volume 3 of TAOCP discusses shellsort based on various ``increment sequences''. The worst case and the average case each pose challenging problems. The average-case analysis of (h,k,1)-shellsort (three increments) was done by Yao in a paper of 19?? that still deserves to be fully exploited I think. Now, what's under attack is the distribution. Bob Smythe will precisely present a distributional analysis of (h,1)-shellsort, or what amounts to the same thing, of the number of inversions in h-sorted permutations. Furthermore, there are interesting connections with area parameters of Brownian motion that surface again below in seemingly unrelated contexts.

Trees, and trees, and trees . . .

The binary search tree (BST) model is one of those ``new'' models generated by theoretical computer science. We know that there is complete probabilistic equivalence (under permutation models) between basic heap-ordered trees, binary search trees, so-called ``recursive'' trees, and the tree of recursive calls in quicksort executions. The results obtained in this area are thus applicable to comparison-based sorting, search strategies, and the selection problem (find the median, find quartiles, etc).

The height of a randomly built BST is known to be asymptotic to sqrt(n). The variance is O(1) as has been very recently announced in Spring 1999 by Devroye & Reed. Tsukiji had shown earlier that Devroye's original methods of 1986 designed to analyse height of a BST extend to random circuits. This year, he will present refined analyses of the depth of nodes and the number of output gates in such circuits.

Various specialized models are also of interest. Mahmoud and Pittel initiated the study of d-ary search trees. Chen will consider recurrences that arise from balanced versions of d-ary heaps and prove that the cost of a construction by means of Floyd's algorithm is asymptotically normal. Trees built with local reorganization rules are also of great interest as one easily approaches perfect balancing. The discussion is then related to median quicksort and sample-sort. Hofri will show how to come up with a very fast approximate median finding algorithm while Rösler will provide an overview of FIND methods and the new light that the contraction method on function spaces brings to selction algorithms. In this range of problems, expectations lead to probabilistic divide-and-conquer recurrences that are full-history recurrences; ``master theorems'' to be exposed by Martinez then provide a direct and simple access to first order asymptotics. This approach constitutes the average-case analysis counterpart of the (worst-case) divide-and-conquer paradigm that we all teach to our students.

Multidimensional data.

Basic search trees have been known for about 25 years to lead to data structures that can cope with multidimensional data. The subject itself is sometimes close to computational geometry or to data bases. See for instance Samet's books on spatial data structures. We have seen in the past approaches by probabilistic methods and by singularity analysis of differential systems. This year Devroye will show a general approach that takes care of orthogonal and convex range search, nearest neighbours and so on--presumably using probabilistic methods. Rueschendorf and Neininger will show the existence of limit laws for path length and the cost of partial match queries in multidimensional search trees; both studies relies on the contraction method applied in suitable probability spaces.

Operating with operators.

Richard Brent proposed in 1976 a heuristic analysis of the binary GCD algorithm. This algorithm operates by repeated substractions and elimination of powers of two, so that it is normally the fastest one in practice. Richard's heuristics were put on solid ground by Brigitte Vallée in 1998 and a first complete analysis was offered by her. Richard will present us with a panaroma of twenty years of research on the binary GCD algorithm. Brigitte has now assembled the methods that are instrumental in attacking GCD algorithms and shaped them into a coherent body of knowledge. It is interesting that the theory appeals to transfer operators on function spaces, a ``highjack'' of methods first introduced by Ruelle in the 1970's in connection with the thermodynamic formalism. Operator methods also turn out to be adequate for an increasingly wide class of information-theoretic problems, as shown by Brigitte and her collaborators: refer to the earlier discussion of TST's.

Hashing.

This is an orbit of problems where analysis is guaranteed to provide correct predictions of performances since we are precisely dealing with a class of randomized algorithms. We've already heard a bit about recent developments regarding linear probing by Poblete, Viola, Flajolet. Knuth even returned to the subject after 35 years (see Algorithmica, 1998) and he discovered deep connections with random graphs, random trees, and the area under random excursions, a subject examined this year from a purely analytic standpoint by Louchard.

A drawback of linear probing hashing (LPH) is that it degenerates into sequential search when the table fills up. Chassaing and Janson appeal to recent probabilistic tools in order to find the cut-off point where LPH becomes ``unreasonable''. This is when n becomes about m-sqrt(m), with n the number of data items and m the number of cells. The phase transition region, as physicists would call it, can be in fact very precisely quantified. Poblete will consider another type of dynamics and study the effect of long sequences of insertions and deletions: the behaviours corresponding to different deletion strategies can be indeed quite varied!

An indirect consequence of the recent surge of interest in hashing is Marckert's solution of a conjecture of Odlyzko and Wilf going back to 1987. In a nutshell, LPH (full) tables are labelled trees in disguise and the probabilistic tools that connect them to Brownian excursion can be put to use to analyse the width of trees. The net result is that the width of a random labelled tree is O(sqrt(n), with precise constants and full characterization of the distribution being now available.

Combinatorial tree models.

The uniform tree models of combinatorics where all trees of a given size are taken as equally likely have a long history. Catalan numbers were discovered by Euler and Segner around 1753, developed by Catalan and further investigated around 1850 by Schroeder, for reasons considered at the time as being of a logical nature.

Sometimes uniform tree models apply directly, as a sort of ``default model''; such is the case in computer algebra and symbolic manipulation where we can start by considering all expression trees as equally likely. Often, they are useful as a first approach to the analysis of more complex data models (like BST's and digital trees) since they replace difference or integro-differential relations by plain algebraic ones.

Drmota will discuss the general singularity structure of uniform tree models. Given a finite set of operands, we're talking here about systems of polynomial equations, algebraic functions, algebraic singularities, and the like. Normal laws are then ubiquitous by singularity perturbation techniques. Particular cases of this general situation are the uniform (aka ``static'') model of tries reviewed by Nebel and for which height estimates are derived, and also leftist trees where Kemp, who had solved earlier Knuth's long-standing open problem of enumerating shapes, will present new statistics on leaves.

Information theory and such.

Patterns and pattern-matching algorithm have always been a subject of interest for this community. Salvy has contributed a general framework for analysing the number of patterns in a random string that obeys a Bernoulli or Markov model. Here the term ``pattern'' is to be taken in the wide sense of an arbitrary regular expression; Salvy proves that the number of occurrences is almost invariably normal. It is of interest that this study was initially motivated by computational biology questions and that a computer algebra based library accompanies the paper. Related information theoretic issues are discussed by Jacquet. The specialization of the discussion to a finite set of substring patterns and a Markovian source model is solved in explicit terms by Régnier.

Szpankowski and Jacquet coined the term of ``analytic information theory''. By this is meant the application of analytic combinatorics methods to information theory questions. The new field has had for instance spectacular success with a very complete characterization of the behaviour of data compression algorithms of the Lempel-Ziv variety. Szpankowski will present here new applications to very precise redundancy estimates for a large class of sources.

A closed world?

There are a number of lectures that I haven't reviewed so far. They serve to illustrate the fact that the field is moving and that, healthily enough, it does not limit itself to a few classical problems.

Arratia will discuss probabilistic issues arising from the problem of reconstructing a long sequence given a large number of its small fragments--a key problem in DNA sequencing. Methods of analytic combinatorics that were, as we know, largely triggered by the demands from analysis of algorithms, are put to use by Hwang in the study of consecutive records, a problem closely related to classical statistics. Karonski proposes a matching algorithm that is particularly efficient in a distributed environment context. Lugosi develops new concentration inequalities that have a lot of applications in random combinatorics including entropy estimates. Urn models pose challenging problems and a complete solution to the 2x2 case will be presented by Mahmoud. Analytic methods apply to a number of ``seminumerical'' algorithms in the sense of Vol. 2 of TAOCP. We've already seen the case of GCD algorithms; other examples are the structure of random polynomlials and polynomial factorization algorithms as well as a number of related algorithms from computational number theory. This provides the motivation of Panario's lecture who offers a general discussion of smallest and largest components under a combinatorial schema of broad applicability. Prodinger has devised a complete analysis of runs, records, inversions, etc, in sequences of geometrically distributed random variables. This is closely related to probabilistic counting and to skip lists, themselves a highly practical alternative to balanced trees. A branching tree model for the analysis of effective speedup from pipelining is the subject of Shachnai's lecture.

Last, this year will benefit from the presence of Herb Wilf, the pioneer and founding father of random generation in combinatorics. See the book by Nijenhuis and Wilf from 1975 and the follow-ups by Wilf along the years. Wilf will present a panorama of methods and a set of corresponding open problems in this area. Liebehenschel will focus on the ranking problem for general parenthesis systems while Krattenthaler will discuss the tough problem of generating plane partitions uniformly at random.

I hope I have made mysellf clear: 1999 should be a great vintage for the AofA meetings!


What next??

The Barcelona seminar fits into a series of similar events that have taken place in 1993, 1995, 1997 (Dagstuhl), before moving to a yearly basis: Princeton for 1998 and Barcelona for 1999. The next event will take place in late June or early July 2000 near the city of Gdansk in Poland, organized by Wojtek Szpankowski; the meeting of 2001 will be organized by Brigitte Vallée, perhaps in the Tatihou Island in Normandy.


The City of Gdansk

The Island of Tatihou

[A warning to Wojtek: Tatihou is not Tahiti! :-) :-)]