47èmes Journées de Statistique de la SFdS (JdS), Lille University, Lille (France)
Since few years, data analysis struggles with statistical issues related to the “curse of high dimensionality”. For instance, in genomics, next generation sequencing technologies provide larger and larger data, where the number of genomic units (e.g. genes) is huge compared to sample size. In this context, meaning when the number of considered variables is far larger than the number of observations in the sample, standard methods especially for classification are inappropriate. Indeed, high dimensionality is often accompanied by dependencies between variables, leading to singularities in optimization processes used for estimation, with no unique or stable solution.
This challenge calls for the development of specific statistical tools, using dimension reduction approaches, such as compression and variable selection. Sparse Partial Least Squares (SPLS) combines both. It introduces a selection step based on the Lasso in the Partial Least Squares (PLS) framework, constructing new components as sparse linear combinations of predictors. We develop an adaptive Sparse PLS version, based on component wise penalization (following the adaptive Lasso principle), with the purpose of improving the compression process and the accuracy in variable selection.
We propose a classification method that processes compression and variable selection with adaptive Sparse PLS in logistic regression framework.
As Sparse PLS is efficient to estimate regression coefficients in the high dimensional case, we aim at adapting it to classification, dealing with a categorical instead of continuous response. We focus on the logistic regression, a classification method derived from generalized linear models (GLMs) that manage binary response through maximum likelihood estimation. This optimization is iteratively achieved via the Iteratively Reweighted Least Squares (IRLS) algorithm. To ensure the IRLS convergence, we use a Ridge penalized version, called Ridge IRLS (RIRLS). We then regress the pseudo-response produced by RIRLS (considered as continuous) with adaptive Sparse PLS to estimate predictor coefficients in the logistic regression model. Our method therefore combines RIRLS followed by adaptive SPLS.
We evaluate our approach on simulated and real data (breast cancer relapse links to gene expression level), and compared it to state-of-the-art procedures, in order to point out the interest of pairing compression and variable selection for classification in the context of GLMs. Considering prediction performance, selection accuracy, convergence and cross-validation stability, our method turns to be appropriate for high dimensional data with binary response.