Generalization of a Suffix Tree for RNA Structural Pattern Matching

In molecular biology, it is said that two biological sequences tend to have similar properties if they have similar 3-D structures.Hence, it is very important to find not only similar sequences in the string sense, but also structurally similar sequences from databases. In this paper, we propose a new data structure that is a generalization of a parameterized suffix tree (p-suffix tree for short) introduced by Baker. This data structure can be used for finding structurally related patterns of RNA or single-stranded DNA. Furthermore, we propose an $O(n(\log|\Sigma|+\log|\Pi|))$ on-line algorithm for constructing it, where $n$ is the sequence length, $|\Sigma|$ is the size of the normal alphabet, and $|\Pi|$ is that of the alphabet called ``parameter,'' which is related to the structure of the sequence. Our algorithm achieves a linear time when it is used to analyze RNA and DNA sequences. Furthermore, as an algorithm for constructing the p-suffix tree,it is the first on-line algorithm, though the computing bound of our algorithm is same as that of Kosaraju's best-known algorithm. The results of computational experiments using actual RNA and DNA sequences are also given to demonstrate our algorithm's practicality.

By: Tetsuo Shibuya

Published in: Algorithmica , volume 39, (no 1), pages 1-19 in 2004

Please obtain a copy of this paper from your local library. IBM cannot distribute this paper externally.

Questions about this service can be mailed to reports@us.ibm.com .