Ricardo Baeza-Yates, Univ. of Chile and Univ. of Politecnica de Catalunya

Fast Approximate Pattern Matching

We present a new algorithm for on-line approximate string matching. The algorithm is based on the simulation of a non-deterministic finite automaton built from the pattern and using the text as input. This simulation uses bit operations on a RAM machine with word length $O(\log n)$, being $n$ the maximum size of the text. The running time achieved is $O(n)$ for patterns of small size $m$ (i.e. $m = O(\sqrt{\log n}))$, independently of the maximum number of errors allowed, $k$. This algorithm is then used to design two general algorithms. One of them partitions the problem into subproblems, while the other partitions the automaton into subautomata. These algorithms are combined to obtain a hybrid algorithm which on average is linear for moderate $k/m$ ratios, $O(\sqrt{mk/\log n}~n)$ for medium ratios, and $O((m-k)kn/\log n)$ for large ratios. We show experimentally that this hybrid algorithm is faster than previous ones for moderate size patterns, which is the case in text searching. This works is also a good example of the use of analysis of algorithms for algorithmic design.