Effect of Sampling on the Extent and Accuracy of the Inferred Genetic History of Recombining Genome

Background: Accessible biotechnology is enabling the cataloging of genetic variants in individuals in populations at unprecedented scales. The use of phylogeny of the individuals within populations allows a model-based approach to studying these variations, which is important in understanding relationships between and across populations. For the somatic genome, however, the phylogeny must take recombinations (and other genetic mixing events) into account. Hence the resulting topology is more complex than a tree. Unlike a tree topology, it is not as apparent which events are visible from the extant samples. An earlier work presented a mathematical model (called the minimal descriptor) for teasing apart the inherent visible information from that which any specific algorithm might see. We use this framework to study the effect of sampling sizes on the overall inferred genetic history. In this paper, we seek to understand the extent, characteristics (in terms of recent versus ancient genetic events) and reliability of what was resolvable within field samples drawn from modern populations.
Results: We observed that most of the visible ancient events are recoverable from relatively small sample sizes. However, without identification of this relatively small minority of ancient genetic events, most of the signal will appear to reflect modern events and admixtures. We also found that the more ancient events are likely to be reproduced with higher fidelity between multiple samplings, and that the identified older events are less likely to yield false positive discrimination between populations.
Conclusions: We conclude that a recombinant phylogenetic reconstruction is necessary to identify which markers are most likely to discriminate ancient events, and to discriminate between populations with lower risk of false positives. Secondly, on a broader note, this study also provides a general methodology for a critical assessment of the inferred common genetic history of populations (say, in plant cultivars or animal populations).

By: Daniel E. Platt, Filippo Utro, Laxmi Parida

Published in: RC25407 in 2013


This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.


Questions about this service can be mailed to reports@us.ibm.com .