High-dimensional variable screening: from single study to multiple studies


Advancement in technology has generated abundant high-dimensional data from many studies. Due to huge computational advantage, variable screening methods based on marginal association have become promising alternatives to the popular regularization methods. However, all screening methods are limited to single study so far. We consider a general framework for variable screening with multiple related studies, and further propose a novel two-step screening procedure for high-dimensional regression analysis under this framework. Compared to the one-step procedure, our procedure greatly reduces false negative errors while keeping a low false positive rate. Theoretically, we show that our procedure possesses the sure screening property with weaker assumptions on signal strengths and allows the number of features to grow at an exponential rate of the sample size. Post screening, the dimension is greatly reduced so common regularization methods such as group lasso can be applied to identify the final set of variables. Under the same framework, we also extend the screening procedure to Cox proportional hazards model to detect survival-associated biomarkers from multiple studies, while allowing censoring proportions and baseline hazard rates to vary across studies. Simulations and application to cancer transcriptomic data has illustrated the advantage of our proposed methods.