Posts by Collection




Statistical and computational methods for the meta-analysis and resemblance analysis of transcriptomic studies


Advancement in high-throughput technologies has generated a large amount of “-omics” data that become an inevitable component of modern biomedical and public health research. Practical statistical and computational methods are needed to meta-analyze and compare “-omics” data from different studies or experiments. In this talk, I will introduce two problem-driven methods and one software for the meta-analysis and resemblance analysis of multiple transcriptomic studies. In the first part, we proposed a Bayesian hierarchical model for RNA-seq meta-analysis by modeling count data, integrating information across genes and across studies, and modeling differential signals across studies via latent variables. In the second part, as motivated by two PNAS papers presenting contradicting conclusions of mouse model resemblance to human studies, we proposed a novel method to quantify the continuous measure of resemblance across model organisms and characterize in what pathways they most agree or disagree. In addition, I will also briefly introduce a R-Shiny based modularized software suite called “MetaOmics” to meta-analyze multiple transcriptomic studies for seven biological purposes.

High-dimensional variable screening: from single study to multiple studies


Advancement in technology has generated abundant high-dimensional data from many studies. Due to huge computational advantage, variable screening methods based on marginal association have become promising alternatives to the popular regularization methods. However, all screening methods are limited to single study so far. We consider a general framework for variable screening with multiple related studies, and further propose a novel two-step screening procedure for high-dimensional regression analysis under this framework. Compared to the one-step procedure, our procedure greatly reduces false negative errors while keeping a low false positive rate. Theoretically, we show that our procedure possesses the sure screening property with weaker assumptions on signal strengths and allows the number of features to grow at an exponential rate of the sample size. Post screening, the dimension is greatly reduced so common regularization methods such as group lasso can be applied to identify the final set of variables. Under the same framework, we also extend the screening procedure to Cox proportional hazards model to detect survival-associated biomarkers from multiple studies, while allowing censoring proportions and baseline hazard rates to vary across studies. Simulations and application to cancer transcriptomic data has illustrated the advantage of our proposed methods.

Congruence evaluation for model organisms in transcriptomic response


Model organisms are instrumental substitute for human studies to expedite basic and clinical research. Despite their indispensable role in mechanistic investigation and drug development, resemblance of animal models to human has long been questioned and debated. Little effort has been made for an objective and quantitative congruence evaluation system for model organisms. We hereby propose a framework, namely Congruence Analysis for Model Organisms (CAMO), for transcriptomic response analysis by developing threshold-free differential expression analysis, quantitative resemblance score controlling data variabilities, pathway-centric downstream investigation and knowledge retrieval by text mining. Instead of a genome-wide dichotomous answer of “poorly/greatly” mimicking, CAMO assists researchers to quantify and visually identify biological functions that are best or least mimicked by model organisms, providing foundations for hypothesis generation and subsequent translational decisions.

Novel variable screening methods for omics data integration


Sure screening are a series of simple and effective dimension reduction methods to reduce noise accumulation for variable selection in high-dimensional regression and classification problems. Since the first method proposed by Fan and Lv (2008), numerous sure screening methods have been developed for various model settings and showed their advantage for big data analysis with desired scalability and theoretical guarantees. However, none of the methods are directly applicable to reduce dimension and select variables in omics data integration problems. In this talk, I will introduce two novel variable screening methods recently developed in our group for both horizontal and vertical omics data integration. In the first project, we proposed a general framework and a two-step procedure to perform variable screening when combining the same type of omics data from multiple related studies and showed the inclusion of multiple studies provided more evidence to reduce dimension. In the second project, we developed a fast and robust variable screening method to detect epigenetic regulators of gene expression over the whole genome by combining epigenomic and transcriptomic data, where both predictor and response spaces are of high-dimension. We used extensive simulations and real data to demonstrate the strengths of our methods as compared to existing screening methods.