Asymptotic Distribution for Sequence Alignment Scores

Alignment score distributions have become very important for computing p-values of alignments built on database searches based on adjustable measures of similarity. Such scores provide a level of flexibility allowing for different measures of similarity, as well as for seeking candidates for RNA binding sites. However, computation of the probability distribution over its full range has proven complex, with a great deal of interest focusing on the scaling dependence of probabilities on alignment sequence lengths. This paper presents a distribution function derived from first principles that describes the sequence-length dependence and extreme-value behavior of an arbitrary matrix of alignment scores given their alignment species frequencies. The derivation is based on the asymptotic Sterling’s formula applied in the continuum limit. The results are compared to simulation results for alignment sequence lengths ranging from 10 to 10,000, with observed binned frequencies down to the order of 10−7 , and seen to produce good approximations for probabilities of practical sequence alignment lengths. This distribution function provides a practical and fairly easy to compute baseline against which the behavior of real and simulated data may be measured.

By: Daniel E. Platt

Published in: RC24969 in 2010

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

rc24969.pdf

Questions about this service can be mailed to reports@us.ibm.com .