Genomic Regions Tools for High-throughput Analytics in Genomics

Following the dramatic reduction of sequencing cost, research laboratories have been producing huge amounts of data, measuring DNA variations, RNA abundances, protein-DNA interactions, DNA methylation levels, and even chromosomal conformations. Making sense of terabytes of data requires reliable data management, computational resources, and, eventually, efficient computational methods for preprocessing, quality control, analysis and meta-analysis. In this work, we present a flexible computational platform for accomplishing such computational tasks. Genomic data is represented in our platform as genomic regions, i.e. sets of intervals, effectively covering most popular types of genome-wide data, such as transcripts/genes, exons/introns, promoter sites, sequences, multiple sequence alignments, transcription factor binding sites, intergenic regions, repeat elements, microarray probes (expression, SNP, CNV, etc), sequencing data (RNA-seq, ChIP-seq, DNA-seq, etc), chromosomal conformations (3C-seq, 4C-seq, etc), or inter-chromosomal associations. Our computational platform implements a variety of elementary and composite mathematical operations between sets of regions, so as to enable the prototyping of computational pipelines that can address a wide spectrum of computational tasks, from preprocessing and quality control to meta-analyses. More specifically, the user can easily create average read profiles across transcriptional start sites or enhancer sites, quickly prototype customized peak discovery methods for ChIP-seq experiments, perform genome-wide statistical tests such as enrichment analyses, design controls via user-designed randomization schemes, among other applications.

By: A. Tsirigos; N. Haiminen; E. Bilal

Published in: RC25125 in 2011

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

rc25125.pdf

Questions about this service can be mailed to reports@us.ibm.com .