In the era of large-scale (huge sample size) and/or high-dimensional (numerous variables/features) data, the question of data exploration and representation is central. A wide range of frameworks in statistics and machine learning are now available to solve supervised and unsupervised problems despite the data dimension and complexity. In particular, we will discuss sparsity in the context of dimension reduction, focusing on variable or feature selection and latent space projection. The presentation will be illustrated by various sparse methods designed for data visualization, regression or classification of high-dimensional data, and different examples of genomic data analysis.

The high dimensionality of genomic data calls for the development of specific classification methodologies, especially to prevent over-optimistic predictions. This challenge can be tackled by compression and variable selection, which can be combined to constitute a powerful framework for classification, as well as data visualization and interpretation. However, current proposed combinations lead to unstable and non convergent methods due to inappropriate computational frameworks. We hereby propose a computationally stable and convergent approach for classification in high dimensional based on sparse Partial Least Squares (sparse PLS).

The high dimensionality of genomic data calls for the development of specific classification methodologies, especially to prevent over-optimistic predictions. This challenge can be tackled by compression and variable selection, which can be combined to constitute a powerful framework for classification, as well as data visualization and interpretation. However, current proposed combinations lead to unstable and non convergent methods due to inappropriate computational frameworks. We hereby propose a computationally stable and convergent approach for classification in high dimensional based on sparse Partial Least Squares (sparse PLS).

MotivationThe high dimensionality of genomic data calls for the development of specific classification methodologies, especially to prevent over-optimistic predictions. This challenge can be tackled by compression and variable selection, which …

The statistical analysis of Next-Generation Sequencing (NGS) data has raised many computational challenges regarding modeling and inference. High-throughput technologies now allow to monitor the expression of thousands of genes at the single-cell level. Despite the increasing number of observations, genomic data remain characterized by their high-dimensionality. Analyzing such data requires the use of dimension reduction approaches. We will introduce hybrid dimension reduction methods that rely on both compression (representation of the data into a lower dimensional space) and variable selection, especially (i) sparse Partial Least Squares (PLS) regression for supervised classification, and (ii) sparse matrix factorization of zero-inflated count data for unsupervised exploration.

The statistical analysis of Next-Generation Sequencing data raises many computational challenges regarding modeling and inference, especially because of the high dimensionality of genomic data. The research work in this manuscript concerns hybrid …

For a few years, data analysis has been struggling with statistical issues related to the “curse of high dimensionality”. In this context, i.e. when the number of considered variables is far larger than the number of observations in the sample, standard methods of classification are inappropriate, thus calling for the development of new methodologies. I will present a new method suitable for classification in the high dimensional case. It uses Sparse Partial Least Squares (Sparse PLS) performing compression and variable selection combined to Ridge penalized logistic regression.

For a few years, data analysis has been struggling with statistical issues related to the “curse of high dimensionality”. In this context, i.e. when the number of considered variables is far larger than the number of observations in the sample, standard methods of classification are inappropriate, thus calling for the development of specific methodologies. I will present a new approach suitable for classification in the high dimensional cases. It uses sparse Partial Least Squares (sparse PLS) performing compression and variable selection combined to Ridge penalized logistic regression.

Since few years, data analysis struggles with statistical issues related to the “curse of high dimensionality”. For instance, in genomics, next generation sequencing technologies provide larger and larger data, where the number of genomic units (e.g. genes) is huge compared to sample size. In this context, meaning when the number of considered variables is far larger than the number of observations in the sample, standard methods especially for classification are inappropriate.

Supervised methods for dimension reduction in classification and regression framework (in particular PLS-based routines for genomic data analyses).

© 2019 Ghislain DURIF · Powered by the Academic theme for Hugo.