Home / Software


We introduce a new algorithm based on data-adaptive shrinkage and semi-Nonnegative Matrix Factorization (NMF) for the detection of unknown batch effects. We test our algorithm on three different datasets – 1) Sequencing Quality Control (SEQC), 2) Topotecan RNA-Seq and 3) Single-cell RNA-Seq on Glioblastoma Multiforme (GBM). We have demonstrated a superior performance in identifying hidden batch effects as compared to existing algorithms for batch detection in all three datasets. In the Topotecan study, we were able to identify a new batch factor that has been missed by the original study, leading to under-representation of differentially expressed genes. For scRNA-Seq, we demonstrated the power of our method in detecting subtle batch effects.


We present a user-friendly, cloud-based, data analysis pipeline for the deconvolution of pooled screening data. This tool serves a dual purpose of extracting, clustering and analyzing raw next generation sequencing files derived from pooled screening experiments while at the same time presenting them in a user-friendly way on a secure web-based platform. Moreover, CRISPRcloud serves as a useful web-based analysis pipeline for reanalysis of pooled CRISPR screening datasets. Taken together, the framework described in this study is expected to accelerate development of web-based bioinformatics tool for handling all studies which include next generation sequencing data.


MARRVEL (Model organism Aggregated Resources for Rare Variant ExpLoration) aims to facilitate the use of public genetic resources to prioritize rare human gene variants for study in model organisms. To automate the search process and gather all the data in a simple display we extract data from human data bases (OMIM, ExAC, Geno2MP, DGV, and DECIPHER) for efficient variant prioritization. The protein sequences for six organisms (S. cerevisiae, C. elegans, D. melanogaster, D. rerio, M. musculus, and H. sapiens) are aligned with highlighted protein domain information via collaboration with DIOPT. The key biological and genetic features are then extracted from existing model organism databases (SGD, PomBase, WormBase, FlyBase, ZFIN, and MGI).


Alternative splicing of RNA is the key mechanism by which a single gene codes for multiple functionally diverse proteins. Recent studies identified previously unknown class of exons, ‘cryptic’ exons, in RNA transcripts. These cryptic exons are often associated with various human cancers and neurological disorders. Genome-wise detection of cryptic splice sites can facilitate a comprehensive understanding of the underlying disease mechanisms and develop strategies that hope to resolve cryptic splicing with the ultimate goal of therapeutic applications. CrypSplic is a novel cryptic splice site detection method. It uses beta-binomial distribution to model junction count data. Every junction is subjected to a beta binomial test w.r.t conditions and classified to aid molecular inferences.

TCGA2STAT: A TCGA data widget for statistical analysis in R.

Large amount of high-throughput data profiled from tumor patients were made publicly available by national projects. However, the process of getting and having these data ready for analyses is intricate for computational researchers, such as statisticians and mathematicians, which hinder them from fully utilizing these abundant resources. We present an open source package, TCGA2STAT, to
obtain TCGA data and prepare the data into format ready for statistical
analysis in R environment. This package can be seamlessly
integrated into computational analyses.

XMRF: An R Package to Fit Markov Networks to High-Throughput Genomics Data

Technological advances in medicine have led to a rapid proliferation of high-throughput “omics” data. Tools to mine this data and discover disrupted disease networks are needed, as they hold the key to understanding complicated interactions between genes, mutations and aberrations, and epi-genetic markers.

Therefore, we developed an R software package, XMRF, that can be used to fit Markov Networks to various types of high-throughput genomics data. Encoding the models and estimation techniques of the recently proposed exponential family Markov Random Fields, our software can be used to learn genetic networks from RNA-sequencing data (counts via Poisson graphical models), mutation and copy number variation data (categorical via Ising models), and methylation data (continuous via Gaussian graphical models).

Combinatorial Therapy Discovery using Mixed Integer Linear Programming

Combinatorial therapies play increasingly important roles in combating complex diseases. Due to the huge cost associated with experimental methods in identifying optimal drug combinations, computational approaches can provide a guide to limit the search space and reduce cost. However, few computational approaches have been developed for this purpose and thus there is a great need of new algorithms for drug combination prediction.

Here we proposed to formulate the optimal combinatorial therapy problem into two complementary mathematical algorithms, Balanced Target Set Cover (BTSC) and Minimum Off-Target Set Cover (MOTSC). Given a disease gene set, BTSC seeks a balanced solution that maximizes the coverage on the disease genes and minimizes the off-target hits at the same time. MOTSC seeks a full coverage on the disease gene set while minimizing the off-target set. Through simulation, both BTSC and MOTSC demonstrated a much faster running time over exhaustive search with the same accuracy. When applied to real disease gene sets, our algorithms not only identified known drug combinations, but also predicted novel drug combinations that are worth further testing. In addition, we developed a web-based tool to allow users to iteratively search for optimal drug combinations given a user-defined gene set.


Selecting genes and pathways indicative of disease is a central problem in computational biology. This problem is especially challenging when parsing multi-dimensional genomic data. A number of tools, such as L1-norm based regularization and its extensions elastic net and fused lasso, have been introduced to deal with this challenge. However, these approaches tend to ignore the vast amount of a priori biological network information curated in the literature.

We propose the use of graph Laplacian regularized logistic regression to integrate biological networks into disease classification and pathway association problems. Simulation studies demonstrate that the performance of the proposed algorithm is superior to elastic net and lasso analyses. Utility of this algorithm is also validated by its ability to reliably differentiate breast cancer subtypes using a large breast cancer dataset recently generated by the Cancer Genome Atlas (TCGA) consortium. Many of the protein-protein interaction modules identified by our approach are further supported by evidence published in the literature. Source code of the proposed algorithm is freely available at Github.

DSA: Digital Sorting Algorithm for heterogeneous samples

Cellular heterogeneity is present in almost all gene expression profiles. However, transcriptome analysis of tissue specimens often ignores the cellular heterogeneity present in these samples. Standard deconvolution algorithms require prior knowledge of the cell type frequencies within a tissue or their in vitro expression profiles. Furthermore, these algorithms tend to report biased estimations.
Here, we describe a Digital Sorting Algorithm (DSA) for extracting cell-type specific gene expression profiles from mixed tissue samples that is unbiased and does not require prior knowledge of cell type frequencies.
Source code of the proposed algorithm is freely available at Github.