Unsupervised Hierarchical Motif Discovery In the Sequence Space of Natural Proteins

        Using Teiresias, a novel pattern discovery method that identifies all motifs present in any given set of protein sequences without requiring alignment or explicit enumeration of the solution space, we have explored protein sequence databases and discovered the most frequently occurring sequence patterns. This deterministic identification of patterns has provided us with vistas of the so-called "Sequence space ", a much larger, but ill-defined set. The observed patterns, henceforth named seqlets, form a finite number of descriptors for this complex space and can be effectively used to describe almost every naturally occurring protein. Seqlets can be considered as building blocks of protein molecules that are a sufficient but not necessary condition for function or family equivalence memberships; thus, seqlets can either defined can serve the family signatures or cut across molecular families and undetected sequence signals deriving from functional convergence. Example of vicinity in the pattern spaceare shown above for well-known and some interesting newly discovered cases. This approach delineates the full extent of sequence based explores the limits of architectural constraints for functional and designed proteins. Motif discovery results are presented for: (a) a database of 600, (b) Swiss-ProtRelease 34.0, and (c) NCBI's non-redundant data base of proteins. the coverage obtained by the discovered seqlets wages from 74.0% for the sixth genome database, 98.3% for NCB I's non-redundant database. The availability of seqlets that have been derived in such an unsupervised, hierarchical manner is providing new opportunities for tackling a variety of problems which include reliable classification, correlation of fragments with functional categories, faster engines for homology-searches, and comparative genomic studies, among others.

By: Isidore Rigoutsos, Aris Floratos, Christos Ouzounis, Laxmi Parida, Gustavo Stolocitsky, Yuan Gao

Published in: RC21218 in 1998

This Research Report is not available electronically. Please request a copy from the contact listed below. IBM employees should contact ITIRC for a copy.

Questions about this service can be mailed to reports@us.ibm.com .