Matrix Factorization

Probabilistic Count Matrix Factorization for Single Cell Expression Data Analysis

The development of high throughput single-cell technologies now allows the investigation of the genome-wide diversity of transcription at different scopes. First, the gene-to-gene variability (expression dynamics) can be quantified more accurately, thanks to the measurement of lowly-expressed genes. Second, the cell-to-cell variability is high, with a low proportion of cells expressing the same gene at the same time/level. Those emerging patterns appear to be very challenging from the statistical point of view, especially to represent and to provide a summarized view of single-cell expression data like single-cell RNA-seq data.

Sparsity and dimension reduction

In the era of large-scale (huge sample size) and/or high-dimensional (numerous variables/features) data, the question of data exploration and representation is central. A wide range of frameworks in statistics and machine learning are now available to solve supervised and unsupervised problems despite the data dimension and complexity. In particular, we will discuss sparsity in the context of dimension reduction, focusing on variable or feature selection and latent space projection. The presentation will be illustrated by various sparse methods designed for data visualization, regression or classification of high-dimensional data, and different examples of genomic data analysis.

Probabilistic Count Matrix Factorization for Single Cell Expression Data Analysis

Motivation: The development of high throughput single-cell sequencing technologies now allows the investigation of the population diversity of cellular transcriptomes. The expression dynamics (gene-to-gene variability) can be quantified more …

Probabilistic Count Matrix Factorization for Single Cell Expression Data Analysis

The development of high throughput single-cell sequencing technologies now allows the investigation of the population level diversity of cellular transcriptomes. This diversity has shown two faces. First, the expression dynamics (gene to gene variability) can be quantified more accurately, thanks to the measurement of lowly-expressed genes. Second, the cell-to-cell variability is high, with a low proportion of cells expressing the same gene at the same time/level. Those emerging patterns appear to be very challenging from the statistical point of view, especially to represent and to provide a summarized view of single-cell expression data.

Probabilistic Count Matrix Factorization for Single Cell Expression Data Analysis

The development of high throughput single-cell sequencing technologies now allows the investigation of the population level diversity of cellular transcriptomes. This diversity has shown two faces. First, the expression dynamics (gene to gene …

Count-based Probabilistic PCA for single-cell data analysis

The developpment of high throughput single-cell technologies now allows the investigation of the genome-wide diversity of transcription. This diversity has shown two faces : the expression dynamics (gene to gene variability) can be quantified more accurately, thanks to the measurement of lowly-expressed genes. Second, the cell-to-cell variability is high, with a low proportion of cells expressing the same gene at the same time/level. Those emerging patterns appear to be very challenging from the statistical point of view, especially to represent and to provide a summarized view of single-cell expression data.

Multivariate statistics for single-data data analysis: zero-inflated count matrix factorization for data exploration and sparse PLS-based logistic regression for classification

The statistical analysis of Next-Generation Sequencing (NGS) data has raised many computational challenges regarding modeling and inference. High-throughput technologies now allow to monitor the expression of thousands of genes at the single-cell level. Despite the increasing number of observations, genomic data remain characterized by their high-dimensionality. Analyzing such data requires the use of dimension reduction approaches. We will introduce hybrid dimension reduction methods that rely on both compression (representation of the data into a lower dimensional space) and variable selection, especially (i) sparse Partial Least Squares (PLS) regression for supervised classification, and (ii) sparse matrix factorization of zero-inflated count data for unsupervised exploration.

Count Matrix Factorization and Single Cell Data Analysis

For nearly 20 years, sequencing technologies have been on the rise, producing more and more data often characterized by their high dimensionality, meaning when the number $p$ of covariates like genes is far larger than the number $n$ of observations. Analysing such data is a statistical challenge and requires the use of dimension reduction approaches. Compression methods show particular abilities concerning data interpretation through visualisation or for clustering. Especially, projection-based methods such as principal component analysis (PCA) generally solve a problem of matrix factorization, for instance the PCA corresponds to a singular value decomposition (SVD).

Count Matrix Factorization and Single Cell Data Analysis

Back

Count Matrix Factorization for Dimension Reduction and Data Visualization

For nearly 20 years, sequencing technologies have been on the rise, producing more and more data often characterized by their high dimensionality, meaning when the number $p$ of covariates like genes is far larger than the number $n$ of observations. Analysing such data is a statistical challenge and requires the use of dimension reduction approaches. Compression methods show particular abilities concerning data interpretation through visualisation or for clustering. Especially, projection-based methods such as principal component analysis (PCA) generally solve a problem of matrix factorization, for instance the PCA corresponds to a singular value decomposition (SVD).