MotivationThe high dimensionality of genomic data calls for the development of specific classification methodologies, especially to prevent over-optimistic predictions. This challenge can be tackled by compression and variable selection, which …

The developpment of high throughput single-cell technologies now allows the investigation of the genome-wide diversity of transcription. This diversity has shown two faces : the expression dynamics (gene to gene variability) can be quantified more accurately, thanks to the measurement of lowly-expressed genes. Second, the cell-to-cell variability is high, with a low proportion of cells expressing the same gene at the same time/level. Those emerging patterns appear to be very challenging from the statistical point of view, especially to represent and to provide a summarized view of single-cell expression data.

The statistical analysis of Next-Generation Sequencing (NGS) data has raised many computational challenges regarding modeling and inference. High-throughput technologies now allow to monitor the expression of thousands of genes at the single-cell level. Despite the increasing number of observations, genomic data remain characterized by their high-dimensionality. Analyzing such data requires the use of dimension reduction approaches. We will introduce hybrid dimension reduction methods that rely on both compression (representation of the data into a lower dimensional space) and variable selection, especially (i) sparse Partial Least Squares (PLS) regression for supervised classification, and (ii) sparse matrix factorization of zero-inflated count data for unsupervised exploration.

For nearly 20 years, sequencing technologies have been on the rise, producing more and more data often characterized by their high dimensionality, meaning when the number $p$ of covariates like genes is far larger than the number $n$ of observations. Analysing such data is a statistical challenge and requires the use of dimension reduction approaches. Compression methods show particular abilities concerning data interpretation through visualisation or for clustering. Especially, projection-based methods such as principal component analysis (PCA) generally solve a problem of matrix factorization, for instance the PCA corresponds to a singular value decomposition (SVD).

For nearly 20 years, sequencing technologies have been on the rise, producing more and more data often characterized by their high dimensionality, meaning when the number $p$ of covariates like genes is far larger than the number $n$ of observations. Analysing such data is a statistical challenge and requires the use of dimension reduction approaches. Compression methods show particular abilities concerning data interpretation through visualisation or for clustering. Especially, projection-based methods such as principal component analysis (PCA) generally solve a problem of matrix factorization, for instance the PCA corresponds to a singular value decomposition (SVD).

For a few years, data analysis has been struggling with statistical issues related to the “curse of high dimensionality”. In this context, i.e. when the number of considered variables is far larger than the number of observations in the sample, standard methods of classification are inappropriate, thus calling for the development of new methodologies. I will present a new method suitable for classification in the high dimensional case. It uses Sparse Partial Least Squares (Sparse PLS) performing compression and variable selection combined to Ridge penalized logistic regression.

Since few years, data analysis struggles with statistical issues related to the “curse of high dimensionality”. For instance, in genomics, next generation sequencing technologies provide larger and larger data, where the number of genomic units (e.g. genes) is huge compared to sample size. In this context, meaning when the number of considered variables is far larger than the number of observations in the sample, standard methods especially for classification are inappropriate.

Since few years, data analysis struggles with statistical issues related to the “curse of high dimensionality”. For instance, in genomics, next generation sequencing technologies provide larger and larger data, where the number of genomic units (e.g. genes) is huge compared to sample size. In this context, meaning when the number of considered variables is far larger than the number of observations in the sample, standard methods especially for classification are inappropriate.

© 2019 Ghislain DURIF · Powered by the Academic theme for Hugo.