Nearest-Neighbour Search in High Dimension and Molecular Clustering

Nearest-Neighbour Search in High Dimension and Molecular Clustering

Frédéric Cazals

Algorithms project, INRIA Rocquencourt

Algorithms Seminar

June 30, 1997

[summary by Frédéric Cazals]

A properly typeset version of this document is available in postscript and in pdf.

If some fonts do not look right on your screen, this might be fixed by configuring your browser (see the documentation here).

1 Introduction and prerequisites

1.1 Problem statement

Given a set of points P={p₁,...,p_n} Ì R^d, the nearest-neighbour (NN) and k nearest neighbours (k-NN) problems can be stated as follows: pre-process P in order to return as fast as possible the nearest or k nearest neighbour(s) of an arbitrary point q according to any Euclidian metric d(p,q) = (å_i=1^d (p_i - q_i)²)^1/2. A weakened version of the NN problem consists in returning a point p' which e-approximates the NN p of q in the sense d(p',q) d(p,q) £ 1+e for any e > 0. If one denotes p_i_₁,...,p_i_{_n} the points of P sorted by increasing distance to q, an equivalent formulation for the k-NN problem consists in returning a subset S={s₁,...,s_k} with d(q,s_j) £ (1+e)d(q,p_i_{_j}) for j=1,...,k.

The naive algorithm to compute the NN of a point q consists in checking all the points of P and returning the closest, which has complexity O(dn). On the other hand, the most sophisticated algorithms known until recently had complexities in O(exp(d) log n) with exp(d) a function growing at least as quickly as 2^d---see e.g., [1]. So that whenever d ³ log n nothing better than the brute force method was known!

Kleinberg's break-through [4] has been to get around the exponential difficulty by an heavy use of random sampling aiming at ``comparing'' the points of P through their projections on random lines passing through the origin rather than decomposing the d-dimensional space containing them. The first result is an algorithm returning an approximation of the k-NN in a deterministic way but with an exponential time/space pre-processing. The second algorithm returns an approximation of the NN in a randomised way but with a polynomial pre-processing only. This talk presents these two algorithms and discusses their potential use to a clustering problem arising in chemistry---see section 4.

1.2 Prerequisites

1.2.1 A geometric lemma

The core idea of Kleinberg's method lies in the following property:

Lemma 1 Let x and y be two vectors of R^d such that ||y||/||x|| ³ 1+g with g £ 1/2. Then, if v is a random vector on the unit sphere S^d-1 we have Pr[|x · v| ³ |y · v|] £ 1/2 - g/3.

Intuitively, short vectors ``should not defeat too often'' longer ones when comparing their projection on a random line determined by a vector on S^d-1. In order to compare two vectors from their projections, the key point is therefore to use a large enough set of lines to capture the probabilistic property contained in the above theorem.

1.2.2 Empirical measures and Vapnik-Chervonenkis bounds

Let (W, F,µ) be a probability space, S Ì F a set of events, and X₁,X₂,...,X_n n random variables following the law µ. If one calls the empirical measure of an event s the fraction of X_i's falling into s, the quantity

D_Sⁿ =

sup

s Î S

½
½
½
½

1_s(X₁)+...+1_s(X_n)

-µ(s)

½
½
½
½

measures the maximum difference over the class S between the empirical measure and the probability. It is a random variable and Vapnik-Chervonenkis's contribution [5] has been to elucidate the conditions under which it converges in probability to zero, that is the conditions under which lim_{n ® ¥} Pr[D_Sⁿ > e] = 0 for any e. To sketch this contribution, let a range-space be a couple ( P, R)= ((W, F,µ), R Ì F ). We shall say that a set A of finite cardilality is shattered by R if " a Î 2^A $ rÎ R such that a=r Ç A. The dimension of Vapnik-Chervonenkis of ( P, R) is the cardinality of the biggest A Ì W shattered by R.

Definition 1 A g-sample for ( P, R) is a finite set A Ì W such that |µ(r)-| r È A |/| A || £ g, " rÎ R.

Theorem 1 [[5]] For a range space of dimension k, a random sample of size

l ³

g²

(k log

16k

g²

+ log

)

is a g-sample with a probability at least 1-d.

1.2.3 Exceptional and r-distinguishing sets

As pointed out above, we are interested in comparing points with respect to their projections on vectors from S^d-1. For two vectors x and y with ||y|| / ||x|| ³ 1+g we call their exceptional set

x,y

= { v Î S^d-1 such that | x · v| ³ | y · v | }.

And a random set of vectors V from S^d-1 is called r-distinguishing if

" W

x,y

, µ(W

x,y

)<r Þ | V Ç W

x,y

| / | V | < 1/2.

More prosaically, a set V is r-distinguishing if a majority of its points do not fall into some exceptional set of size smaller than r.

1.2.4 Hyper-planes arrangements

An arrangement of n hyper-planes in R^d is said to be in general position if any d hyper-planes have a unique point in common, and any d+1 hyper-planes do not share a point. Given a hyper-plane h and a point p, p is either above, on, or below h, which is called its position. A face on an arrangement is the set of points having the same position with respect to all the hyper-planes. The dimension of a face is its affine dimension. It is known from [2] that

Theorem 2 The number of d-faces of an arrangement of n hyper-planes in general position is f_d(n)=å_i=0^d (

1.2.5 Digraphs

A complete digraph G on n vertices 1,2,...,n is a directed graph which contains for any pair of vertices {i,j} either the edge (i,j) or (i,j). An apex of G is a vertex with a directed path of length at most two to any vertex. At least, an apex ordering of G is an ordering i₁,...,i_n of its vertices such that i_k is an apex for the sub-digraph G[i_k,i_k+1,...,i_n]. The following is straightforward:

Theorem 3 Every n-node complete digraph has an apex ordering computable in O(n²).

1.3 First results

The following theorems can be proved:

Lemma 2 Let µ be the uniform measure on S^d-1. The dimension of the range-space

((S^d-1, F

, µ), { W

x,y

|µ(W

x,y

) £ r })

is less than d' = 8(d+1)log (4d+4).

Lemma 3 There exists c₀ such that a random sample of S^d-1 of size f(d,g) is a g/2-sample for the range-space ((S^d-1, F, µ), { W_x_,_y|µ(W_x_,_y) £ r }) with f(d,g)= c₀/g² (d' log d'/g² + log 1/d)= q(d log² d).

Corollary 1 A set V of f(g,d) vectors from S^d-1 is (1/2-g)-distinguishing with a probability at least 1-d.

2 First algorithm

2.1 Construction of the data structure

This algorithm returns an approximation of the k-NN of a point q. To build the data structure from which it does so, we first draw uniformly at random a set V of L = f(e/3,d) = q(d log² d) vectors from S^d-1. Then for each vector v_l Î V, the following is done: 1. compute v_l · p_ij with p_ij=(p_i+p_j)/2, 1£ i,j£ n; 2. sort the p_ij according to the values of v_l · p_ij and denote S_l the list obtained. The list of lists S₁,...,S_L is denoted S.

For each such list, a pair of consecutive entries is called a primitive interval, and a sequence of primitive intervals is called a trace---see Figure 1. The maximum number of traces is upper-bounded by (n²)^L = n^O(d^log^²^d). But a trace is realizable if

$ q Î R^d, " k=1,...,L, v_k ·

(k)

i₁i₂

< v_k · q < v_k ·

(k)

i₃i₄

So that realizable traces are defined with respect to the Ln² hyper-planes v_k · (p_i_₁_i_₂^(k) - x). And from theorem 2, the number of such traces is å_i=0^d (

Ln²

) = O(n log d)^2d.

Definition 2 For a realizable trace s=s₁ ··· s_k ··· s_L, p_i is said to s-dominate p_j in S_k if p_ij lies in the interval (s_k,...,p_jj).

For each realizable trace s, the construction is as follows: (1.) build a complete digraph G_s on {1,2,...,n} with the edge (i,j) if p_i s-dominates p_j in half of the lists of S; (2.) build an apex ordering (s,G_s) of the nodes of G_s.

The idea behind the domination definition is depicted on Figure 2: if p_ij falls in the desired interval denoted I₁, then p_i Î I₂ and we have |v_k · (p_i-q)| < |v_k · (p_j-q)|.

2.2 Algorithm

To process a query associated to a point q: 1. compute s(q)=s₁(q) ··· s_L(q) with s_k(q) the primitive interval from S_l containing v_l · q; 2. retrieve the apex ordering associated to s(q) and return the k first entries.

This algorithm therefore returns an e-approximation of the k nearest neighbours of a point q in a deterministic fashion, so that the answer is guaranteed to be correct if the random sample V is actually (1/2-e/3)-distinguishing---see §1.2.3 and Corollary 1. Roughly speaking, the correctness of the algorithm comes from the fact that if a non-desired point p₁ has been returned instead of a desired point p₂, then p₁ dominated p₂ in more than half of the lists of S, and the sample V was not distinguishing enough with respect to the exceptional set W_q-p_₁_,_q-p_₂.

The pre-processing requires O (Ln² (n log d)^(2d) ) time and O (n (n log d)^(2d) ) space. The cost of a query is O(k+L (d+log n))=O(k + (d log² d) (d + log n)).

3 Second algorithm

As opposed to the first algorithm, the second one does not try to compute a partition of the requests' space and proceeds in two steps. The first one consists in drawing a random sample from P of the appropriate size and returning the closest point to q. The second one compares iteratively q to pairs of points of P in a tournament way depicted on Figure 3. Assuming that n=2^m, the overview of this second stage is the following:

a random sample V of Q(d log² n (log² d + log d log log n)) vectors is drawn uniformly on S^d-1, and a multi-set G_v of V is assigned to each internal node v of the binary tree whose leaves are a permutation of the points of P. For a query point q, two points p_i and p_j of P, and a multi-set G_v, p_i is said to dominate p_j if |v_k · (p_i-q)| < |v_k · (p_j-q)| holds for a majority of vectors v_k in G_v. Otherwise p_j dominates p_i;
each internal node v of the tree is assigned its dominating child for the multi-set G_v.

The point eventually returned is the best candidate from the two points returned by the two steps. It can be shown that an e-approximation is returned with a probability greater than 1-d in O(n + d log³ n) time with a space requirement of n| V |.

Figure 1: Tournament tree

Figure 2: Molecule and fragments

4 Application to molecular clustering

Suppose we are given a set of d molecular fragments, say F={f₁,...,f_d} and a set M={m₁,...,m_N_{_m}} of N_m molecules, each described by a set of fragments of F---see Figure 4. We shall represent a molecule m_i by a sequence of d boolean values m_i=b_{i, 1} b_{i, 2} ··· b_i,d with b_{i, j}=1 if m_i contains f_j and b_{i, j}=0 otherwise. This formulation fails to report multiple occurrences of a given fragment in a molecule, but it has the nice property that a molecule is represented by a point on the hyper-cube H^d={0,1}^d. Given two molecules, we call their similarity the number of common fragments that is the quantity sim(m_i,m_j) = å_k=1^d b_{i, k}· b_{j, k}.

Given the set M we are interested in the following problem: find a partition of M into subsets of neighbours or clusters. To do so, one way to proceed see [3] consists in first building a Minimum Spanning Tree on the input data set, second removing those ``too long'' edges from the MST, and third computing the connected components we are left with. The key point lies in the MST computation, and it is shown in [6] that

Theorem 4 A MST on n points in dimension d can be found in
O(2^dn^2-1/2^{^d+1} (log n)^1-1/2^{^d+1}) time.

Unfortunately for our concern where d can range from 500 to 2000, the 2^d constant is prohibitive. Kleinberg's algorithms customised to the hyper-cube setting could make Yao's algorithm interesting in practice.

References

[1]: Arya (S.), Mount (D. M.), Netanyahu (N. S.), Silverman (R.), and Wu (A. Y.). -- An optimal algorithm for approximate nearest neighbour searching. In Proceedings of the Fifth Annual ACM-SIAM Symposium on Discrete Algorithms. pp. 573--582. -- New York, 1994.
[2]: Edelsbrunner (Herbert). -- Algorithms in combinatorial geometry. -- Springer-Verlag, Berlin, 1987, EATCS Monographs on Theoretical Computer Science, vol. 10, xvi+423p.
[3]: Jain (Anil K.) and Dubes (Richard C.). -- Algorithms for clustering data. -- Prentice-Hall Inc., Englewood Cliffs, NJ, 1988, Prentice-Hall Advanced Reference Series, xiv+320p.
[4]: Kleinberg (J.). -- Two algorithms for nearest-neighbour search in high dimension. In ACM STOC. -- El Paso, Texas, USA, 1997.
[5]: Vapnik (N.) and Chervonenkis (A.). -- On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, vol. 16, n°2, 1971, pp. 264--280.
[6]: Yao (Andrew Chi Chih). -- On constructing minimum spanning trees in k-dimensional spaces and related problems. SIAM Journal on Computing, vol. 11, n°4, 1982, pp. 721--736.

This document was translated from L^AT_EX by H^EV^EA.