Count Matrix Factorization and Single Cell Data Analysis
Keywords: Statistics, Dimension reduction, Matrix Factorization, Data visualization, Single-cell, Gene expression, RNA-seq
Summary
For nearly 20 years, sequencing technologies have been on the rise, producing more and more data often characterized by their high dimensionality, meaning when the number \(p\) of covariates like genes is far larger than the number \(n\) of observations. Analysing such data is a statistical challenge and requires the use of dimension reduction approaches. Compression methods show particular abilities concerning data interpretation through visualisation or for clustering. Especially, projection-based methods such as principal component analysis (PCA) generally solve a problem of matrix factorization, for instance the PCA corresponds to a singular value decomposition (SVD). However the nature of genomic data produced by Next Generation Sequencing (NGS) like gene expression profiles is very specific with count matrices and calls for the development of specific compression methods, that do not supposed the data to be (relatively) Gaussian.
We propose a Gamma-Poisson factor model, based on generalized PCA. In particular, the data matrix \(X_{n\times p}\) is supposed to depend on latent factors or components and its entries are supposed to follow a Poisson distribution, which is adapted for counts. The matrix \(\mbg\Lambda_{n\times p}\) of Poisson intensities is factorized into a product of two parameter matrices \(UV^t\) where \(U_{n\times K}\) and \(V_{n\times K}\) respectively quantify the observation and variable contributions to the \(K\) latent factors. To account for the covariance structure within the data matrix, and especially the possible correlation between covariates (e.g. genes), we introduce gamma priors onto the entries of the parameter matrices \(U\) and \(V\). This constitutes a more complete and flexible model than for instance Non-Negative Matrix Factorization, that is based on a Poisson model and that assumes independence between covariates. The Gamma-Poisson distribution also model over-dispersion, which often characterizes NGS data.
The parameter estimation is processed through variational inference in order to avoid optimization issues (the EM algorithm is intractable for example). Such approach appears to be scalable and very efficient computationally, especially with high dimensional data. Eventually, we propose an improvement to handle zero-inflated data, meaning when there is an amplification of zeros in the data. In particular an unknown proportion of zeros corresponds to missing values. We finally illustrate our work with results on data visualization and clustering. We present an application to gene expression profile analysis, and specifically single cell profiles. Such data are specifically zero-inflated as a zero may refers to an absence of read or to a failure in the experiment due to the short amount of genetic material available in a single cell.