NearestNeighbour Search in High Dimension and Molecular Clustering
Frédéric Cazals
Algorithms project, INRIA Rocquencourt
Algorithms Seminar
June 30, 1997
[summary by Frédéric Cazals]
A properly typeset version of this document is available in
postscript and in
pdf.
If some fonts do not look right on your screen, this might be
fixed by configuring your browser (see the documentation here).
1 Introduction and prerequisites
1.1 Problem statement
Given a set of points P={p_{1},...,p_{n}} Ì R^{d}, the
nearestneighbour (NN) and k nearest neighbours (kNN) problems
can be stated as follows: preprocess P in order to return as fast
as possible the nearest or k nearest neighbour(s) of an arbitrary
point q according to any Euclidian metric d(p,q) = (å_{i=1}^{d}
(p_{i}  q_{i})^{2})^{1/2}. A weakened version of the NN problem consists
in returning a point p' which eapproximates the NN p of
q in the sense d(p',q) d(p,q) £ 1+e for any e >
0. If one denotes p_{i}_{1},...,p_{i}_{n} the points of P sorted
by increasing distance to q, an equivalent formulation for the
kNN problem consists in returning a subset S={s_{1},...,s_{k}}
with d(q,s_{j}) £ (1+e)d(q,p_{i}_{j}) for j=1,...,k.
The naive algorithm to compute the NN of a point q consists in
checking all the points of P and returning the closest, which has
complexity O(dn). On the other hand, the most sophisticated
algorithms known until recently had complexities in O(exp(d) log n)
with exp(d) a function growing at least as quickly as 2^{d}see
e.g., [1]. So that whenever d ³ log n nothing better
than the brute force method was known!
Kleinberg's breakthrough [4] has been to get around the
exponential difficulty by an heavy use of random sampling aiming at
``comparing'' the points of P through their projections on random
lines passing through the origin rather than decomposing the
ddimensional space containing them. The first result is an
algorithm returning an approximation of the kNN in a deterministic
way but with an exponential time/space preprocessing. The second
algorithm returns an approximation of the NN in a randomised way but
with a polynomial preprocessing only. This talk presents these two
algorithms and discusses their potential use to a clustering problem
arising in chemistrysee section 4.
1.2 Prerequisites
1.2.1 A geometric lemma
The core idea of Kleinberg's method lies in the following property:
Lemma 1
Let x and y be two vectors of R^{d} such that y/x
³ 1+g with g £ 1/2. Then, if v is a random
vector on the unit sphere S^{d1} we have
Pr[x · v ³ y · v] £ 1/2  g/3.
Intuitively, short vectors ``should not defeat too often'' longer ones
when comparing their projection on a random line determined by a
vector on S^{d1}. In order to compare two vectors from their
projections, the key point is therefore to use a large enough set of
lines to capture the probabilistic property contained in the above
theorem.
1.2.2 Empirical measures and VapnikChervonenkis bounds
Let (W, F,µ) be a probability space, S Ì
F a set of events, and X_{1},X_{2},...,X_{n} n random
variables following the law µ. If one calls the empirical measure
of an event s the fraction of X_{i}'s falling into s, the quantity
D_{S}^{n} =


½ ½ ½ ½ 
1_{s}(X_{1})+...+1_{s}(X_{n}) 

n 

µ(s) 
½ ½ ½ ½ 
measures the maximum difference over the class S between the
empirical measure and the probability. It is a random variable and
VapnikChervonenkis's contribution [5] has been to
elucidate the conditions under which it converges in probability to
zero, that is the conditions under which
lim_{n ® ¥} Pr[D_{S}^{n} > e] = 0
for any e.
To sketch this contribution, let a rangespace be a couple ( P, R)=
((W, F,µ), R Ì F
).
We shall say that a set A of finite cardilality is shattered by
R if " a Î 2^{A} $ rÎ R
such that a=r Ç A. The dimension of VapnikChervonenkis of
( P, R) is the cardinality of the biggest A Ì W
shattered by R. Definition 1
A gsample for ( P, R) is a finite set A Ì W
such that
µ(r) r È A / A  £ g,
" rÎ R.
Theorem 1 [[5]]
For a range space of dimension k, a random sample of size
is a gsample with a probability at least 1d.
1.2.3 Exceptional and rdistinguishing sets
As pointed out above, we are interested in comparing points with
respect to their projections on vectors from S^{d1}. For two
vectors x and y with y / x ³ 1+g
we call their exceptional set
W 

= { v Î S^{d1} such that
 x · v ³  y · v 
}.

And a random set of vectors V from S^{d1} is called
rdistinguishing if
" W 

, µ(W 

)<r
Þ
 V Ç W 

 /  V  < 1/2.

More prosaically, a set V is rdistinguishing if a majority of
its points do not fall into some exceptional set of size smaller than
r.
1.2.4 Hyperplanes arrangements
An arrangement of n hyperplanes in R^{d} is said to be in general
position if any d hyperplanes have a unique point in common, and
any d+1 hyperplanes do not share a point. Given a hyperplane h
and a point p, p is either above, on, or below h, which is
called its position. A face on an arrangement is the set of points
having the same position with respect to all the hyperplanes. The
dimension of a face is its affine dimension. It is known from [2] that
Theorem 2
The number of dfaces of an arrangement of n hyperplanes in
general position is f_{d}(n)=å_{i=0}^{d} ().
1.2.5 Digraphs
A complete digraph G on n vertices 1,2,...,n is a directed
graph which contains for any pair of vertices {i,j} either the
edge (i,j) or (i,j). An apex of G is a vertex with a directed
path of length at most two to any vertex. At least, an apex ordering
of G is an ordering i_{1},...,i_{n} of its vertices such that
i_{k} is an apex for the subdigraph G[i_{k},i_{k+1},...,i_{n}]. The
following is straightforward:
Theorem 3
Every nnode complete digraph has an apex ordering computable in
O(n^{2}).
1.3 First results
The following theorems can be proved:
Lemma 2
Let µ be the uniform measure on S^{d1}. The dimension of the
rangespace
((S^{d1}, F 
, µ), { W 

µ(W 

) £ r }) 
is less than d' = 8(d+1)log (4d+4).
Lemma 3
There exists c_{0} such that a random sample of S^{d1} of size
f(d,g) is a g/2sample for the rangespace
((S^{d1}, F, µ), { W_{x}_{,}_{y}µ(W_{x}_{,}_{y}) £ r })
with
f(d,g)=
c_{0}/g^{2}
(d' log d'/g^{2} + log 1/d)=
q(d log^{2} d).
Corollary 1
A set V of f(g,d) vectors from S^{d1}
is (1/2g)distinguishing with a probability
at least 1d.
2 First algorithm
2.1 Construction of the data structure
This algorithm returns an approximation of the kNN of a point q.
To build the data structure from which it does so, we first draw
uniformly at random a set V of L = f(e/3,d) =
q(d log^{2} d) vectors from S^{d1}. Then for each vector v_{l}
Î V, the following is done: 1. compute v_{l} · p_{ij} with
p_{ij}=(p_{i}+p_{j})/2, 1£ i,j£ n; 2. sort the
p_{ij} according to the values of v_{l} · p_{ij} and denote
S_{l} the list obtained. The list of lists S_{1},...,S_{L} is denoted
S.
For each such list, a pair of consecutive entries is called a
primitive interval, and a sequence of primitive intervals is called a
tracesee Figure 1. The maximum number of
traces is upperbounded by (n^{2})^{L} = n^{O(d }^{log}^{2}^{ d)}. But a trace
is realizable if
$ q Î R^{d}, "
k=1,...,L,
v_{k} · 
p 

< v_{k} · q
< v_{k} · 
p 

.

So that realizable traces are defined with respect to the Ln^{2}
hyperplanes v_{k} · (p_{i}_{1}_{i}_{2}^{(k)}  x). And from theorem
2, the number of such traces is å_{i=0}^{d}
() = O(n log d)^{2d}.
Definition 2
For a realizable trace s=s_{1} ··· s_{k} ···
s_{L}, p_{i} is said to sdominate p_{j} in S_{k} if
p_{ij} lies in the interval (s_{k},...,p_{jj}).
For each realizable trace s, the construction is as follows:
(1.) build a complete digraph G_{s} on {1,2,...,n} with
the edge (i,j) if p_{i} sdominates p_{j} in half of the
lists of S; (2.) build an apex ordering (s,G_{s}) of
the nodes of G_{s}.
The idea behind the domination definition is depicted on Figure
2: if p_{ij} falls in the desired interval
denoted I_{1}, then p_{i} Î I_{2} and we have
v_{k} · (p_{i}q) < v_{k} · (p_{j}q).
2.2 Algorithm
To process a query associated to a point q: 1. compute
s(q)=s_{1}(q) ··· s_{L}(q) with s_{k}(q) the
primitive interval from S_{l} containing v_{l} · q; 2. retrieve
the apex ordering associated to s(q) and return the k first
entries.
This algorithm therefore returns an eapproximation of the k
nearest neighbours of a point q in a deterministic fashion, so that
the answer is guaranteed to be correct if the random sample V is
actually (1/2e/3)distinguishingsee §1.2.3
and Corollary 1.
Roughly speaking, the correctness of the algorithm comes from the fact
that if a nondesired point p_{1} has been returned instead of a
desired point p_{2}, then p_{1} dominated p_{2} in more than half of
the lists of S, and the sample V was not distinguishing
enough with respect to the exceptional set W_{qp}_{1}_{,}_{qp}_{2}.
The preprocessing requires O (Ln^{2} (n log d)^{(2d)} )
time and O (n (n log d)^{(2d)} ) space. The cost of a
query is O(k+L (d+log n))=O(k + (d log^{2} d) (d + log n)).
3 Second algorithm
As opposed to the first algorithm, the second one does not try to
compute a partition of the requests' space and proceeds in two
steps. The first one consists in drawing a random sample from P of
the appropriate size and returning the closest point to q. The
second one compares iteratively q to pairs of points of P in a
tournament way depicted on Figure 3. Assuming that
n=2^{m}, the overview of this second stage is the following:

a random sample V of Q(d log^{2} n (log^{2} d + log d
log log n)) vectors is drawn uniformly on S^{d1}, and a
multiset G_{v} of V is assigned to each internal node v of
the binary tree whose leaves are a permutation of the points of P.
For a query point q, two points p_{i} and p_{j} of P, and a
multiset G_{v}, p_{i} is said to dominate p_{j} if
v_{k} · (p_{i}q) < v_{k} · (p_{j}q)
holds for a majority of vectors v_{k} in G_{v}. Otherwise p_{j}
dominates p_{i};
 each internal node v of the tree is assigned its dominating
child for the multiset G_{v}.
The point eventually returned is the best candidate from the two
points returned by the two steps. It can be shown that an
eapproximation is returned with a probability greater than
1d in O(n + d log^{3} n) time with a space requirement of
n V .
Figure 1: Tournament tree
Figure 2: Molecule and fragments
4 Application to molecular clustering
Suppose we are given a set of d molecular fragments, say
F={f_{1},...,f_{d}} and a set M={m_{1},...,m_{N}_{m}} of N_{m}
molecules, each described by a set of fragments of Fsee
Figure 4. We shall represent a molecule m_{i} by a sequence
of d boolean values m_{i}=b_{i, 1} b_{i, 2} ··· b_{i,d} with
b_{i, j}=1 if m_{i} contains f_{j} and b_{i, j}=0 otherwise. This
formulation fails to report multiple occurrences of a given fragment
in a molecule, but it has the nice property that a molecule is
represented by a point on the hypercube H^{d}={0,1}^{d}. Given two
molecules, we call their similarity the number of common
fragments that is the quantity sim(m_{i},m_{j}) =
å_{k=1}^{d} b_{i, k}· b_{j, k}.
Given the set M we are interested in the following problem: find a
partition of M into subsets of neighbours or clusters. To do so,
one way to proceed see [3] consists in first building a
Minimum Spanning Tree on the input data set, second removing those
``too long'' edges from the MST, and third computing the connected
components we are left with. The key point lies in the MST
computation, and it is shown in [6] that
Theorem 4
A MST on n points in dimension d can be found in
O(2^{d}n^{21/2}^{d+1} (log n)^{11/2}^{d+1}) time.
Unfortunately for our concern where d can range from 500 to 2000,
the 2^{d} constant is prohibitive. Kleinberg's algorithms customised
to the hypercube setting could make Yao's algorithm interesting in
practice.
References
 [1]

Arya (S.), Mount (D. M.), Netanyahu (N. S.), Silverman (R.), and Wu (A. Y.). 
An optimal algorithm for approximate nearest neighbour searching. In
Proceedings of the Fifth Annual ACMSIAM Symposium on Discrete
Algorithms. pp. 573582. 
New York, 1994.
 [2]

Edelsbrunner (Herbert). 
Algorithms in combinatorial geometry. 
SpringerVerlag, Berlin, 1987, EATCS Monographs
on Theoretical Computer Science, vol. 10, xvi+423p.
 [3]

Jain (Anil K.) and Dubes (Richard C.). 
Algorithms for clustering data. 
PrenticeHall Inc., Englewood Cliffs, NJ, 1988, PrenticeHall Advanced Reference Series, xiv+320p.
 [4]

Kleinberg (J.). 
Two algorithms for nearestneighbour search in high dimension. In
ACM STOC. 
El Paso, Texas, USA, 1997.
 [5]

Vapnik (N.) and Chervonenkis (A.). 
On the uniform convergence of relative frequencies of events to their
probabilities. Theory of Probability and its Applications, vol. 16,
n°2, 1971, pp. 264280.
 [6]

Yao (Andrew Chi Chih). 
On constructing minimum spanning trees in kdimensional spaces and
related problems. SIAM Journal on Computing, vol. 11, n°4,
1982, pp. 721736.
This document was translated from L^{A}T_{E}X by
H^{E}V^{E}A.