Adaptive Sparse PLS for Logistic Regression
Keywords: Statistics, Dimension reduction, Sparse PLS, Logistic regression, High-dimensional data, Classification, Gene expression, RNA-seq
Summary
For a few years, data analysis has been struggling with statistical issues related to the “curse of high dimensionality”. In this context, i.e. when the number of considered variables is far larger than the number of observations in the sample, standard methods of classification are inappropriate, thus calling for the development of new methodologies. I will present a new method suitable for classification in the high dimensional case. It uses Sparse Partial Least Squares (Sparse PLS) performing compression and variable selection combined to Ridge penalized logistic regression. In particular, we have developed an adaptive version of Sparse PLS to improve the dimension reduction process. I will illustrate the interest of our method by classification results on simulated and real data set, comparing to state-of-the-art approaches. The application focus on genomics where dimensions are huge, and especially on prediction of breast cancer relapse (binary) using gene expression level (quantitative). Eventually, our approach is implemented in the plsgenomics R-package.