Multivariate statistics for single-data data analysis: zero-inflated count matrix factorization for data exploration and sparse PLS-based logistic regression for classification
Keywords: “Statistics”, “Dimension reduction”, “Matrix Factorization”, “Data visualization”, “Single-cell”, “Gene expression”, “RNA-seq”, “Sparse PLS”, “Logistic regression”, “High-dimensional data”, “Classification”
Summary
The statistical analysis of Next-Generation Sequencing (NGS) data has raised many computational challenges regarding modeling and inference. High-throughput technologies now allow to monitor the expression of thousands of genes at the single-cell level. Despite the increasing number of observations, genomic data remain characterized by their high-dimensionality. Analyzing such data requires the use of dimension reduction approaches. We will introduce hybrid dimension reduction methods that rely on both compression (representation of the data into a lower dimensional space) and variable selection, especially (i) sparse Partial Least Squares (PLS) regression for supervised classification, and (ii) sparse matrix factorization of zero-inflated count data for unsupervised exploration. In both situations, we will focus on the reconstruction and visualization of the complex organization of the data.
In a first part, we will present a new sparse PLS approach, based on an adaptive sparsity-inducing penalty, that is suitable for logistic regression, i.e. to predict the label of a discrete outcome. For instance, such a method will be used to predict the specific type of unidentified single cells based on gene expression profiles. The main issue in such framework is to account for the response to discard irrelevant variables. We will highlight the direct link between the derivation of the algorithms and the reliability of the results.
In a second part, motivated by questions regarding the exploration of single-cell data.,we will focus on the framework of matrix factorization for count data. We propose a model-based approach that is very flexible, and that accounts for over-dispersion as well as zero-inflation (both characteristic of single-cell data). Our matrix factorization method relies on a Gamma-Poisson hierarchical model for which we derive an estimation procedure based on variational inference. In this scheme, we consider variable selection based on a spike-and-slab model suitable for count data. The interest of our procedure for data reconstruction, visualization and clustering is illustrated in simulation experiments and by results regarding an analysis of single-cell transcriptomic data.