Accurate Phylogenetic Classification of DNA Fragments Based on Sequence Composition

Metagenome studies have retrieved vast amounts of sequence out of a variety of environments, leading to novel discoveries and great insights into the uncultured microbial world. Except for very simple communities, diversity makes sequence assembly and analysis a very challenging problem. To understand the structure and function of microbial communities, a taxonomic characterization of the 5 obtained sequence fragments is highly desirable, yet currently limited mostly to those sequences that contain phylogenetic marker genes. We show that for clades at the rank of domain down to genus, sequence composition allows the very accurate phylogenetic characterization of genomic sequence. We developed a composition-based classifier, PhyloPythia, for de novo phylogenetic sequence 10 characterization and have trained it on a data set of 340 genomes. By extensive evaluation experiments we show that the method is accurate across all taxonomic ranks considered, even for sequences that originate from novel organisms and are as short as 1kb. Application to two metagenome datasets obtained from samples of phosphorus-removing sludge showed that the method allows the 15 accurate classification at genus level of most sequence fragments from the dominant populations, while at the same time correctly characterizing even parts of the samples at higher taxonomic levels.

By: Alice C. McHardy; Hector Garcia Martin; Aristotelis Tsirigos; Philip Hugenholtz; Isidore Rigoutsos

Published in: RC23930 in 2006

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

rc23930.pdf

Questions about this service can be mailed to reports@us.ibm.com .