Posts by Collection




Statistical and computational methods for the meta-analysis and resemblance analysis of transcriptomic studies


Advancement in high-throughput technologies has generated a large amount of “-omics” data that become an inevitable component of modern biomedical and public health research. Practical statistical and computational methods are needed to meta-analyze and compare “-omics” data from different studies or experiments. In this talk, I will introduce two problem-driven methods and one software for the meta-analysis and resemblance analysis of multiple transcriptomic studies. In the first part, we proposed a Bayesian hierarchical model for RNA-seq meta-analysis by modeling count data, integrating information across genes and across studies, and modeling differential signals across studies via latent variables. In the second part, as motivated by two PNAS papers presenting contradicting conclusions of mouse model resemblance to human studies, we proposed a novel method to quantify the continuous measure of resemblance across model organisms and characterize in what pathways they most agree or disagree. In addition, I will also briefly introduce a R-Shiny based modularized software suite called “MetaOmics” to meta-analyze multiple transcriptomic studies for seven biological purposes.

High-dimensional variable screening: from single study to multiple studies


Advancement in technology has generated abundant high-dimensional data from many studies. Due to huge computational advantage, variable screening methods based on marginal association have become promising alternatives to the popular regularization methods. However, all screening methods are limited to single study so far. We consider a general framework for variable screening with multiple related studies, and further propose a novel two-step screening procedure for high-dimensional regression analysis under this framework. Compared to the one-step procedure, our procedure greatly reduces false negative errors while keeping a low false positive rate. Theoretically, we show that our procedure possesses the sure screening property with weaker assumptions on signal strengths and allows the number of features to grow at an exponential rate of the sample size. Post screening, the dimension is greatly reduced so common regularization methods such as group lasso can be applied to identify the final set of variables. Under the same framework, we also extend the screening procedure to Cox proportional hazards model to detect survival-associated biomarkers from multiple studies, while allowing censoring proportions and baseline hazard rates to vary across studies. Simulations and application to cancer transcriptomic data has illustrated the advantage of our proposed methods.