Dictionary-driven Protein Annotation

For many years, computational methods seeking to automatically determine the properties (functional, structural, physiochemical, etc.) of a protein directly from sequence have been the focus of numerous research groups, including ours. By general admission, this is a difficult problem and the methods that have been proposed over the years typically concentrated on the analysis of individual genes. With the advent of advanced sequencing methods and systems, the number of amino acid sequences and fragments being deposited in the public databases has been increasing steadily. This in turn generated a renewed demand for automated approaches that can quickly, exhaustively and objectively annotate individual sequences as well as complete genomes. In this paper, we present one such approach. The approach is centered around and exploits the Bio-Dictionary, an exhaustive collection of amino acid patterns (referred to as seqlets) that completely covers the natural sequence space of proteins to the extent that this space is sampled by the currently available public databases. We advocated this approach several years ago for the first time and in our earlier studies showed that the seqlets contained in this collection can capture both functional and structural signals that have been reused during evolution both within as well as across families of related proteins [58,59,60]; in this capacity, seqlets are ideal elements for use in the context of protein annotation. In this presentation, we describe and discuss a protein annotation method that is based on the Bio-Dictionary and employs a weighted, position-specific scoring scheme. The method uses the complete collection of seqlets and can determine, in a single pass, the following: all local and global similarities between the query and any protein already present in a public database; the likeness of the query to all available archaeal/bacterial/eukaryotic/viral sequences in the database as a function of amino acid position within the query; the character of secondary structure of the query as a function of amino acid position within the query, the cytoplasmic, transmembrane or extracellular behavior of the query, the nature and position of binding domains, active sites, post-translationally modified sites, signal peptides, etc. The scoring scheme we use is not affected by the overrepresentation of well-conserved proteins and protein fragments that exist in the public databases. Many carefully selected annotation examples are presented and discussed in the experimental results section and comparisons are given with manually annotated sequences. The method which we are proposing is very fast, objective and exhaustive and allows one to quickly annotate complete genomes. As a matter of fact, we have applied our method and generated annotations for 20 complete genomes including S. solfataricus, P. falciparum, Y. pestis, M. musculus and H. Sapiens. All of these annotations can be interactively explored using either a sequence accession number or a regular-expression-based string search. The annotations are accessible through the World Wide Web beginning at HYPERLINK "http://cbcsrv.watson.ibm.com/Annotations/"http://cbcsrv.watson.ibm.com/Annotations/; a facility for cross-genomic comparisons and searches is also accessible through the same site.

By: Isidore Rigoutsos, Tien Huynh, Laxmi P. Parida, Daniel E. Platt, Aris Floratos

Published in: Nucleic Acids Research, volume 30, (no 17), pages 3901-16 in 2002

Please obtain a copy of this paper from your local library. IBM cannot distribute this paper externally.

Questions about this service can be mailed to reports@us.ibm.com .