A Probabilistic Algorithm for Molecular Clustering

A Probabilistic Algorithm for Molecular Clustering

Frédéric Cazals

Algorithm Project, Inria

Algorithms Seminar

December 15, 1997

[summary by Bruno Salvy]

A properly typeset version of this document is available in postscript and in pdf.

If some fonts do not look right on your screen, this might be fixed by configuring your browser (see the documentation here).

In order to design a drug curing a given pathology, an approach which is commonly used consists in first selecting those that work best among molecules known to treat similar symptoms. In view of the number of known molecules, an exhaustive approach is often impossible. Instead, it is a common strategy to pick at random a number of molecules in a large database; and then concentrate on those that are chemically close to the ones that performed well. It is therefore important to be able to find those molecules in such a database. The aim of this work is to present a probabilistic algorithm for this task.

1 Molecules and similarity

The database is represented as an array of n~10,000 molecules, each molecule being characterized by the presence or absence of d~1,500 molecular fragments. Molecules are close when they differ by few molecular fragments. More precisely, one defines the size s(m) of a molecule m as the number of its fragments and the similarity sim(m,M) between two molecules as the number of common fragments. Finally, two molecules m and M are called (a,b)-similar for aÎ[0,1] and b³1 when

sim(m,M)³amin(s(m),s(M)), max(s(m),s(M))£bmin(s(m),s(M)). (1)

Note that this is not an equivalence relation. Other measures of similarity might also be of interest in practice.

2 Algorithm

The aim of this work is to find efficiently as many (a,b)-similar pairs in the database as possible. Obviously, an exhaustive search in n(n-1)/2 operations is possible, but expensive. The idea instead is to use a divide-and-conquer partitioning process. A fragment is selected at random and the database is partitioned into two subsets according to the presence or absence of this fragment. When such a subset has less than a fixed number K of elements an exhaustive search is performed; otherwise, the same process is applied recursively.

This technique finds a proportion t N of the number N of (a,b)-similar pairs. Heuristically, repeating the same process from scratch yields t(1-t)^i-1N new pairs at the ith iteration. Thus a few iterations of this idea yield a very large proportion of N.

3 Implementation

The parameter K plays an important part in the efficiency of the algorithm. When K is small the search is faster but finds less pairs, so that the number of times it has to be repeated to obtain the same number of pairs can be larger than for higher values of K. The optimal value of K also depends on the efficiency of the different stages of the algorithm. Any improvement on the partitioning and exhaustive search part shift the optimum to larger values of K, while a good data-structure for checking whether a pair has already been found shifts it in the other direction. Here are implementation ideas that lead to an efficient program:

The database is stored as an array of bits;
the entries in the database are accessed by chunks (bytes or words), a constant array making it fast to count the number of bits equal to 1 in a chunk;
computing the similarity between to molecules is then performed by bitwise and;
the sizes of the molecules are computed once at the beginning;
the partitioning is done like in Quicksort, on an array of pointers to the molecules;
the set of pairs is stored as an array of binary search trees (a hash table would also do).

In practice, with this implementation and K around 150, then 4 or 5 runs of the partitioning yield more than 90% of the pairs in a matter of minutes. The exhaustive search would take several hours.

Conclusion

It would be nice to find the optimal value of K by a complexity analysis. However, the Bernoulli distribution for the bits in the database does not give a good model. It is necessary to take into account the fact that the database was arrived at by a historical process where many of the molecules are variants of each other.

References

[1]: Cazals (Frédéric). -- Effective nearest neighbours searching on the hyper-cube, with applications to molecular clustering. In ACM Symposium on Computational Geometry. pp. 222--230. -- ACM Press, 1998. Also available at the url http://www.inria.fr/prisme/personnel/cazals/xfc_research.html.

This document was translated from L^AT_EX by H^EV^EA.