Dictionary-Driven Microbial Gene Finding

Gene identification is one of the most important problems in molecular biology and has been receiving increasing attention with the advent of large scale sequencing projects. Previous strategies for solving this problem can be categorized into essentially two schools of thought: statistical approaches such as hidden Markov models (HMMs), and methods based on database similarity searches. In this paper, we propose a new approach for tackling the gene identification problem. The approach employs the Bio-Dictionary [26,28], a database of patterns that cover essentially all of the currently available sample of natural protein sequence space, to determine gene candidates among the ORFs that can be identified in a given DNA strand; as a matter of fact, the method combines the best characteristics from each of the above-mentioned schools of thought. We additionally associate the patterns in the Bio-Dictionary with appropriately computed weights and this leads to further improvements in our gene identification ability. We demonstrate the method’s improved capabilities through an analysis and discussion of the results we obtain by processing 17 whole archaeal and bacterial genomes.

By: Teysuo Shibuya, Isidore Rigoutsos

Published in: Nucleic Acids Research, volume 30, (no ), pages 2710-25 in 2002

Please obtain a copy of this paper from your local library. IBM cannot distribute this paper externally.

Questions about this service can be mailed to reports@us.ibm.com .