Title: | Clustering High-Throughput Transcriptome Sequencing (HTS) Data |
---|---|
Description: | A Poisson mixture model is implemented to cluster genes from high- throughput transcriptome sequencing (RNA-seq) data. Parameter estimation is performed using either the EM or CEM algorithm, and the slope heuristics are used for model selection (i.e., to choose the number of clusters). |
Authors: | Andrea Rau, Gilles Celeux, Marie-Laure Martin-Magniette, Cathy Maugis- Rabusseau |
Maintainer: | Andrea Rau <[email protected]> |
License: | GPL (>= 3) |
Version: | 2.0.11 |
Built: | 2024-10-30 05:16:39 UTC |
Source: | https://github.com/andreamrau/htscluster |
A Poisson mixture model is implemented to cluster genes from high-throughput transcriptome sequencing (RNA-seq) data. Parameter estimation is performed using either the EM or CEM algorithm, and the slope heuristics are used for model selection (i.e., to choose the number of clusters).
Andrea Rau, Gilles Celeux, Marie-Laure Martin-Magniette, Cathy Maugis-Rabusseau
Maintainer: Andrea Rau
Rau, A., Maugis-Rabusseau, C., Martin-Magniette, M.-L., Celeux G. (2015). Co-expression analysis of high-throughput transcriptome sequencing data with Poisson mixture models. Bioinformatics, 31(9):1420-1427.
Rau, A., Celeux, G., Martin-Magniette, M.-L., Maugis-Rabusseau, C. (2011) Clustering high-throughput sequencing data with Poisson mixture models. Inria Research Report 7786. Available at https://inria.hal.science/inria-00638082.
set.seed(12345) ## Simulate data as shown in Rau et al. (2011) ## Library size setting "A", high cluster separation ## n = 2000 observations simulate <- PoisMixSim(n = 200, libsize = "A", separation = "high") y <- simulate$y conds <- simulate$conditions ## Run the PMM model for g = 3 ## "TC" library size estimate, EM algorithm run <- PoisMixClus(y, g=3, conds=conds, norm="TC") ## Estimates of pi and lambda for the selected model pi.est <- run$pi lambda.est <- run$lambda ## Not run: PMM for 4 total clusters, with one fixed class ## "TC" library size estimate, EM algorithm ## ## run <- PoisMixClus(y, g = 3, norm = "TC", conds = conds, ## fixed.lambda = list(c(1,1,1))) ## ## ## Not run: PMM model for 4 clusters, with equal proportions ## "TC" library size estimate, EM algorithm ## ## run <- PoisMixClus(y, g = 4, norm = "TC", conds = conds, ## equal.proportions = TRUE) ## ## ## Not run: PMM model for g = 1, ..., 10 clusters, Split Small-EM init ## ## run1.10 <- PoisMixClusWrapper(y, gmin = 1, gmax = 10, conds = conds, ## norm = "TC") ## ## ## Not run: PMM model for g = 1, ..., 10 clusters, Small-EM init ## ## run1.10bis <- <- PoisMixClusWrapper(y, gmin = 1, gmax = 10, conds = conds, ## norm = "TC", split.init = FALSE) ## ## ## Not run: previous model equivalent to the following ## ## for(K in 1:10) { ## run <- PoisMixClus(y, g = K, conds = conds, norm = "TC") ## }
set.seed(12345) ## Simulate data as shown in Rau et al. (2011) ## Library size setting "A", high cluster separation ## n = 2000 observations simulate <- PoisMixSim(n = 200, libsize = "A", separation = "high") y <- simulate$y conds <- simulate$conditions ## Run the PMM model for g = 3 ## "TC" library size estimate, EM algorithm run <- PoisMixClus(y, g=3, conds=conds, norm="TC") ## Estimates of pi and lambda for the selected model pi.est <- run$pi lambda.est <- run$lambda ## Not run: PMM for 4 total clusters, with one fixed class ## "TC" library size estimate, EM algorithm ## ## run <- PoisMixClus(y, g = 3, norm = "TC", conds = conds, ## fixed.lambda = list(c(1,1,1))) ## ## ## Not run: PMM model for 4 clusters, with equal proportions ## "TC" library size estimate, EM algorithm ## ## run <- PoisMixClus(y, g = 4, norm = "TC", conds = conds, ## equal.proportions = TRUE) ## ## ## Not run: PMM model for g = 1, ..., 10 clusters, Split Small-EM init ## ## run1.10 <- PoisMixClusWrapper(y, gmin = 1, gmax = 10, conds = conds, ## norm = "TC") ## ## ## Not run: PMM model for g = 1, ..., 10 clusters, Small-EM init ## ## run1.10bis <- <- PoisMixClusWrapper(y, gmin = 1, gmax = 10, conds = conds, ## norm = "TC", split.init = FALSE) ## ## ## Not run: previous model equivalent to the following ## ## for(K in 1:10) { ## run <- PoisMixClus(y, g = K, conds = conds, norm = "TC") ## }
This function is used to calculate Adjusted Rand Index (ARI) values for high-dimensional data.
highDimensionARI(x, y, splits = 2, verbose = FALSE)
highDimensionARI(x, y, splits = 2, verbose = FALSE)
x |
Vector of classification labels |
y |
Vector of classification labels |
splits |
Number of subsets data should be split into |
verbose |
|
Value of Adjusted Rand Index for samples x
and y
Andrea Rau
Rau, A., Maugis-Rabusseau, C., Martin-Magniette, M.-L., Celeux G. (2015). Co-expression analysis of high-throughput transcriptome sequencing data with Poisson mixture models. Bioinformatics, 31(9):1420-1427.
Rau, A., Celeux, G., Martin-Magniette, M.-L., Maugis-Rabusseau, C. (2011). Clustering high-throughput sequencing data with Poisson mixture models. Inria Research Report 7786. Available at https://inria.hal.science/inria-00638082.
Finds the location of the HTSCluster User's Guide and optionally opens it.
HTSClusterUsersGuide(view=TRUE)
HTSClusterUsersGuide(view=TRUE)
view |
logical, should the document be opened using the default PDF document reader? |
The function vignette("HTSCluster")
will find the short HTSCluster Vignette which describes how to obtain the HTSCluster User's Guide.
The User's Guide is not itself a true vignette because it is not automatically generated using Sweave
during the package build process.
This means that it cannot be found using vignette
, hence the need for this special function.
If the operating system is other than Windows, then the PDF viewer used is that given by Sys.getenv("R_PDFVIEWER")
.
The PDF viewer can be changed using Sys.putenv(R_PDFVIEWER=)
.
Note that this function was adapted from that defined by Gordon Smyth in the edgeR package.
Character string giving the file location.
If view=TRUE
, the PDF document reader is started and the User's Guide is opened, as a side effect.
Gordon Smyth
# To get the location: HTSClusterUsersGuide(view=FALSE) # To open in pdf viewer: ## Not run: HTSClusterUsersGuide()
# To get the location: HTSClusterUsersGuide(view=FALSE) # To open in pdf viewer: ## Not run: HTSClusterUsersGuide()
These functions implement a variety of initialization methods for the parameters of a Poisson mixture model: the Small EM initialization strategy (emInit
) described in Rau et al. (2011), a K-means initialization strategy (kmeanInit
) that is itself used to initialize the small EM strategy, the splitting small-EM initialization strategy (splitEMInit
) based on that described in Papastamoulis et al. (2014), and a function to initialize a small-EM strategy using the posterior probabilities (probaPostInit
) obtained from a previous run with one fewer cluster following the splitting strategy.
emInit(y, g, conds, norm, alg.type = "EM", init.runs, init.iter, fixed.lambda, equal.proportions, verbose) kmeanInit(y, g, conds, norm, fixed.lambda, equal.proportions) splitEMInit(y, g, conds, norm, alg.type, fixed.lambda, equal.proportions, prev.labels, prev.probaPost, init.runs, init.iter, verbose) probaPostInit(y, g, conds, norm, alg.type = "EM", fixed.lambda, equal.proportions, probaPost.init, init.iter, verbose)
emInit(y, g, conds, norm, alg.type = "EM", init.runs, init.iter, fixed.lambda, equal.proportions, verbose) kmeanInit(y, g, conds, norm, fixed.lambda, equal.proportions) splitEMInit(y, g, conds, norm, alg.type, fixed.lambda, equal.proportions, prev.labels, prev.probaPost, init.runs, init.iter, verbose) probaPostInit(y, g, conds, norm, alg.type = "EM", fixed.lambda, equal.proportions, probaPost.init, init.iter, verbose)
y |
(n x q) matrix of observed counts for n observations and q variables |
g |
Number of clusters. If |
conds |
Vector of length q defining the condition (treatment group) for each variable (column) in |
norm |
The type of estimator to be used to normalize for differences in library size: (“ |
alg.type |
Algorithm to be used for parameter estimation (“ |
init.runs |
In the case of the Small-EM algorithm, the number of independent runs to be performed. In the case of the splitting Small-EM algorithm, the number of cluster splits to be performed in the splitting small-EM initialization. |
init.iter |
The number of iterations to run within each Small-EM algorithm |
fixed.lambda |
If one (or more) clusters with fixed values of lambda is desires, a list containing vectors of length d (the number of conditions). Note that the values of lambda chosen must satisfy the constraint noted in the technical report. |
equal.proportions |
If |
prev.labels |
A vector of length n of cluster labels obtained from the previous run (g-1 clusters) |
prev.probaPost |
An n x (g-1) matrix of the conditional probabilities of each observation belonging to each of the g-1 clusters from the previous run |
probaPost.init |
An n x (g) matrix of the conditional probabilities of each observation belonging to each of the
g clusters following the splitting strategy in the |
verbose |
If |
In practice, the user will not directly call the initialization functions described here; they are indirectly called
for a single number of clusters through the PoisMixClus
function (via init.type
) or via the
PoisMixClusWrapper
function for a sequence of cluster numbers (via gmin.init.type
and split.init
).
To initialize parameter values for the EM and CEM algorithms, for the Small-EM strategy (Biernacki et al., 2003) we use the emInit
function as follows. For a given number of independent runs (given by init.runs
), the following procedure is used to obtain parameter values: first, a K-means algorithm (MacQueen, 1967) is run to partition the data into g
clusters (). Second, initial parameter values
and
are calculated (see Rau et al. (2011) for details). Third, a given number of iterations of an EM algorithm are run (defined by
init.iter
), using and
as initial values. Finally, among the
init.runs
sets of parameter values, we use and
corresponding to the highest log likelihood or completed log likelihood to initialize the subsequent full EM or CEM algorithms, respectively.
For the splitting small EM initialization strategy, we implement an approach similar to that described in Papastamoulis et al. (2014), where the cluster from the previous run (with g-1 clusters) with the largest entropy is chosen to be split into two new clusters, followed by a small EM run as described above.
pi.init |
Vector of length |
lambda.init |
(d x |
lambda |
(d x |
pi |
Vector of length |
log.like |
Log likelihood arising from the splitting initialization and small EM run for a single split. |
Andrea Rau
Anders, S. and Huber, W. (2010) Differential expression analysis for sequence count data. Genome Biology, 11(R106), 1-28.
Biernacki, C., Celeux, G., Govaert, G. (2003) Choosing starting values for the EM algorithm for getting the highest likelhiood in multivariate Gaussian mixture models. Computational Statistics and Data Analysis, 41(1), 561-575.
MacQueen, J. B. (1967) Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, number 1, pages 281-297. Berkeley, University of California Press.
Papastamoulis, P., Martin-Magniette, M.-L., and Maugis-Rabusseau, C. (2014). On the estimation of mixtures of Poisson regression models with large number of components. Computational Statistics and Data Analysis: 3rd special Issue on Advances in Mixture Models, DOI: 10.1016/j.csda.2014.07.005.
Rau, A., Celeux, G., Martin-Magniette, M.-L., Maugis-Rabusseau, C. (2011). Clustering high-throughput sequencing data with Poisson mixture models. Inria Research Report 7786. Available at https://inria.hal.science/inria-00638082.
Rau, A., Maugis-Rabusseau, C., Martin-Magniette, M.-L., Celeux G. (2015). Co-expression analysis of high-throughput transcriptome sequencing data with Poisson mixture models. Bioinformatics, 31(9):1420-1427.
Robinson, M. D. and Oshlack, A. (2010) A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology, 11(R25).
PoisMixClus
for Poisson mixture model estimation for a given number of clusters,
PoisMixClusWrapper
for Poisson mixture model estimation and model selection for a sequence of cluster numbers.
set.seed(12345) ## Simulate data as shown in Rau et al. (2011) ## Library size setting "A", high cluster separation ## n = 500 observations simulate <- PoisMixSim(n = 500, libsize = "A", separation = "high") y <- simulate$y conds <- simulate$conditions ## Calculate initial values for lambda and pi using the Small-EM ## initialization (4 classes, PMM-II model with "TC" library size) ## ## init.values <- emInit(y, g = 4, conds, ## norm = "TC", alg.type = "EM", ## init.runs = 50, init.iter = 10, fixed.lambda = NA, ## equal.proportions = FALSE, verbose = FALSE) ## pi.init <- init.values$pi.init ## lambda.init <- init.values$lambda.init
set.seed(12345) ## Simulate data as shown in Rau et al. (2011) ## Library size setting "A", high cluster separation ## n = 500 observations simulate <- PoisMixSim(n = 500, libsize = "A", separation = "high") y <- simulate$y conds <- simulate$conditions ## Calculate initial values for lambda and pi using the Small-EM ## initialization (4 classes, PMM-II model with "TC" library size) ## ## init.values <- emInit(y, g = 4, conds, ## norm = "TC", alg.type = "EM", ## init.runs = 50, init.iter = 10, fixed.lambda = NA, ## equal.proportions = FALSE, verbose = FALSE) ## pi.init <- init.values$pi.init ## lambda.init <- init.values$lambda.init
Functions to calculate the log likelihood for a Poisson mixture model, the difference in log likelihoods for two different sets of parameters of a Poisson mixture model or the log-likelihood for each observation.
logLikePoisMix(y, mean, pi) logLikePoisMixDiff(y, mean.new, pi.new, mean.old, pi.old) mylogLikePoisMixObs(y, conds, s, lambda, pi)
logLikePoisMix(y, mean, pi) logLikePoisMixDiff(y, mean.new, pi.new, mean.old, pi.old) mylogLikePoisMixObs(y, conds, s, lambda, pi)
y |
(n x q) matrix of observed counts for n observations and q variables |
mean |
List of length g containing the (n x q) matrices of conditional mean expression for all observations, as calculated by the |
mean.new |
List of length g containing the (n x q) matrices of conditional mean expression for all observations for one set of parameters, as calculated by the |
mean.old |
List of length g containing the (n x q) matrices of conditional mean expression for all observations for another set of parameters, as calculated by the |
pi.new |
Vector of length g containing one estimate for |
pi.old |
Vector of length g containing another estimate for |
pi |
Vector of length g containing estimate for |
conds |
Vector of length q defining the condition (treatment group) for each variable (column) in |
s |
Estimate of normalized per-variable library size |
lambda |
(d x |
The logLikePoisMixDiff
function is used to calculate the difference in log likelihood for two different sets of parameters in a Poisson mixture model; it is used to determine convergence in the EM algorithm run by the PoisMixClus
function.
The logLikePoisMix
function (taken largely from the mylogLikePoisMix
function from the poisson.glm.mix
R package) calculates the log likelihood for a given set of parameters in a Poisson mixture model and is used in the PoisMixClus
function for the calculation of the BIC and ICL.
The mylogLikePoisMixObs
function calculates the log likelihood per observation for a given set of parameters in a Poisson mixture model.
ll |
(Depending on the context), the log likelihood, difference in log likelihoods for two different sets of parameters, or per-observation log-likelihood |
In the logLikePoisMixDiff
function, we make use of the alternative mass function for a Poisson density proposed by Loader (2000) to avoid computational difficulties. The logLikePoisMixDiff
function returns a default value of 100 if one or both of the log likelihoods associated with the two parameter sets takes on a value of .
Andrea Rau
Loader, C. (2000) Fast and accurate computation of binomial probabilities. Available at https://lists.gnu.org/archive/html/octave-maintainers/2011-09/pdfK0uKOST642.pdf.
Rau, A., Maugis-Rabusseau, C., Martin-Magniette, M.-L., Celeux G. (2015). Co-expression analysis of high-throughput transcriptome sequencing data with Poisson mixture models. Bioinformatics, 31(9):1420-1427.
Rau, A., Celeux, G., Martin-Magniette, M.-L., Maugis-Rabusseau, C. (2011) Clustering high-throughput sequencing data with Poisson mixture models. Inria Research Report 7786. Available at https://inria.hal.science/inria-00638082.
PoisMixClus
for Poisson mixture model estimation and model selection;
PoisMixMean
to calculate the per-cluster conditional mean of each observation
set.seed(12345) ## Simulate data as shown in Rau et al. (2011) ## Library size setting "A", low cluster separation ## n = 200 observations simulate <- PoisMixSim(n = 200, libsize = "A", separation = "low") y <- simulate$y conds <- simulate$conditions w <- rowSums(y) ## Estimate of w r <- table(conds) ## Number of replicates per condition d <- length(unique(conds)) ## Number of conditions s <- colSums(y) / sum(y) ## TC estimate of lib size s.dot <- rep(NA, d) ## Summing lib size within conditions for(j in 1:d) s.dot[j] <- sum(s[which(conds == unique(conds)[j])]); ## Initial guess for pi and lambda g.true <- 4 pi.guess <- simulate$pi ## Recalibrate so that (s.dot * lambda.guess) = 1 lambda.sim <- simulate$lambda lambda.guess <- matrix(NA, nrow = d, ncol = g.true) for(k in 1:g.true) { tmp <- lambda.sim[,k]/sum(lambda.sim[,k]) lambda.guess[,k] <- tmp/s.dot } ## Run the PMM-II model for g = 4 ## with EM algorithm and "TC" library size parameter run <- PoisMixClus(y, g = 4, norm = "TC", conds = conds) pi.est <- run$pi lambda.est <- run$lambda ## Mean values for each of the parameter sets mean.guess <- PoisMixMean(y, 4, conds, s, lambda.guess) mean.est <- PoisMixMean(y, 4, conds, s, lambda.est) ## Difference in log likelihoods LL.diff <- logLikePoisMixDiff(y, mean.guess, pi.guess, mean.est, pi.est) LL.diff ## -12841.11
set.seed(12345) ## Simulate data as shown in Rau et al. (2011) ## Library size setting "A", low cluster separation ## n = 200 observations simulate <- PoisMixSim(n = 200, libsize = "A", separation = "low") y <- simulate$y conds <- simulate$conditions w <- rowSums(y) ## Estimate of w r <- table(conds) ## Number of replicates per condition d <- length(unique(conds)) ## Number of conditions s <- colSums(y) / sum(y) ## TC estimate of lib size s.dot <- rep(NA, d) ## Summing lib size within conditions for(j in 1:d) s.dot[j] <- sum(s[which(conds == unique(conds)[j])]); ## Initial guess for pi and lambda g.true <- 4 pi.guess <- simulate$pi ## Recalibrate so that (s.dot * lambda.guess) = 1 lambda.sim <- simulate$lambda lambda.guess <- matrix(NA, nrow = d, ncol = g.true) for(k in 1:g.true) { tmp <- lambda.sim[,k]/sum(lambda.sim[,k]) lambda.guess[,k] <- tmp/s.dot } ## Run the PMM-II model for g = 4 ## with EM algorithm and "TC" library size parameter run <- PoisMixClus(y, g = 4, norm = "TC", conds = conds) pi.est <- run$pi lambda.est <- run$lambda ## Mean values for each of the parameter sets mean.guess <- PoisMixMean(y, 4, conds, s, lambda.guess) mean.est <- PoisMixMean(y, 4, conds, s, lambda.est) ## Difference in log likelihoods LL.diff <- logLikePoisMixDiff(y, mean.guess, pi.guess, mean.est, pi.est) LL.diff ## -12841.11
A function to visualize the clustering results obtained from a Poisson mixture model.
## S3 method for class 'HTSCluster' plot(x, file.name = FALSE, graphs = c("map", "map.bycluster", "lambda"), data=NA, ...) ## S3 method for class 'HTSClusterWrapper' plot(x, file.name = FALSE, graphs = c("capushe", "ICL", "BIC"), capushe.validation=NA, ...)
## S3 method for class 'HTSCluster' plot(x, file.name = FALSE, graphs = c("map", "map.bycluster", "lambda"), data=NA, ...) ## S3 method for class 'HTSClusterWrapper' plot(x, file.name = FALSE, graphs = c("capushe", "ICL", "BIC"), capushe.validation=NA, ...)
x |
An object of class |
file.name |
Optional file name if plots are to be saved in a PDF file. |
graphs |
Type of graph to be included in plots. May be equal to
|
capushe.validation |
Optional number of clusters to use for capushe validation (should be less than the maximum number of clusters
specificed in the |
data |
(n x q) matrix of observed counts for n observations and q variables (only required for the plotting of weighted histograms) |
... |
Additional arguments (mainly useful for plotting) |
For objects of class "HTSCluster"
, the plotting function provides the possibility for the following visualizations:
1) A histogram of maximum conditional probabilities across all clusters.
2) Per-cluster boxplots of maximum conditional probabilities.
3) Weighted histograms of observation profiles (with weights equal to the corresponding conditional probability for each observation in each cluster), plotted independently for each variable. Fitted densities after fitting the Poisson mixture model are overlaid in red.
4) A global view of and
values for the selected model. When the
number of conditions <= 2, bar heights represent the value of
for each
cluster, and bar width corresponds to the value of
.
For objects of class "HTSClusterWrapper"
, the plotting function provides the possibility for one or all of the following visualizations:
1) ICL plot for all fitted models.
2) BIC plot for all fitted models.
5) Capushe diagnostic plots.
Andrea Rau
Rau, A., Maugis-Rabusseau, C., Martin-Magniette, M.-L., Celeux G. (2015). Co-expression analysis of high-throughput transcriptome sequencing data with Poisson mixture models. Bioinformatics, 31(9):1420-1427.
Andrea Rau, Gilles Celeux, Marie-Laure Martin-Magniette, and Cathy Maugis-Rabusseau (2011). Clustering high-throughput sequencing data with Poisson mixture models. Technical report RR-7786, Inria Saclay – Ile-de-France.
PoisMixClus
, PoisMixClusWrapper
set.seed(12345) ## Simulate data as shown in Rau et al. (2011) ## Library size setting "A", high cluster separation ## n = 2000 observations simulate <- PoisMixSim(n = 200, libsize = "A", separation = "high") y <- simulate$y conds <- simulate$conditions ## Run the PMM-II model for g = 3 ## "TC" library size estimate, EM algorithm run <- PoisMixClus(y, g = 3, norm = "TC", conds = conds, init.type = "small-em") ## Visualization of results (not run): ## plot(run)
set.seed(12345) ## Simulate data as shown in Rau et al. (2011) ## Library size setting "A", high cluster separation ## n = 2000 observations simulate <- PoisMixSim(n = 200, libsize = "A", separation = "high") y <- simulate$y conds <- simulate$conditions ## Run the PMM-II model for g = 3 ## "TC" library size estimate, EM algorithm run <- PoisMixClus(y, g = 3, norm = "TC", conds = conds, init.type = "small-em") ## Visualization of results (not run): ## plot(run)
These functions implement the EM and CEM algorithms for parameter estimation in a Poisson mixture model for clustering high throughput sequencing observations (e.g., genes) for a single number of clusters (PoisMixClus
) or a sequence of cluster numbers (PoisMixClusWrapper
). Parameters are initialized using a Small-EM strategy as described in Rau et al. (2011) or the splitting small-EM strategy described in Papastamoulis et al. (2014), and model selection is performed using the ICL criteria. Note that these functions implement the PMM-I and PMM-II models described in Rau et al. (2011).
PoisMixClus(y, g, conds, norm = "TMM", init.type = "small-em", init.runs = 1, init.iter = 10, alg.type = "EM", cutoff = 10e-6, iter = 1000, fixed.lambda = NA, equal.proportions = FALSE, prev.labels = NA, prev.probaPost = NA, verbose = FALSE, interpretation = "sum", EM.verbose = FALSE, wrapper = FALSE, subset.index = NA) PoisMixClusWrapper(y, gmin = 1, gmax, conds, norm = "TMM", gmin.init.type = "small-em", init.runs = 1, init.iter = 10, split.init = TRUE, alg.type = "EM", cutoff = 10e-6, iter = 1000, fixed.lambda = NA, equal.proportions = FALSE, verbose = FALSE, interpretation = "sum", EM.verbose = FALSE, subset.index = NA)
PoisMixClus(y, g, conds, norm = "TMM", init.type = "small-em", init.runs = 1, init.iter = 10, alg.type = "EM", cutoff = 10e-6, iter = 1000, fixed.lambda = NA, equal.proportions = FALSE, prev.labels = NA, prev.probaPost = NA, verbose = FALSE, interpretation = "sum", EM.verbose = FALSE, wrapper = FALSE, subset.index = NA) PoisMixClusWrapper(y, gmin = 1, gmax, conds, norm = "TMM", gmin.init.type = "small-em", init.runs = 1, init.iter = 10, split.init = TRUE, alg.type = "EM", cutoff = 10e-6, iter = 1000, fixed.lambda = NA, equal.proportions = FALSE, verbose = FALSE, interpretation = "sum", EM.verbose = FALSE, subset.index = NA)
y |
(n x q) matrix of observed counts for n observations and q variables |
g |
Number of clusters (a single value). If |
gmin |
The minimum number of clusters in a sequence to be tested. In cases where clusters are included with a fixed value
of lambda, |
gmax |
The maximum number of clusters in a sequence to be tested. In cases where clusters are included with a fixed value
of lambda, |
conds |
Vector of length q defining the condition (treatment group) for each variable (column) in |
norm |
The type of estimator to be used to normalize for differences in library size: (“ |
init.type |
Type of initialization strategy to be used (“ |
gmin.init.type |
Type of initialization strategy to be used for the minimum number of clusters in a sequence ( |
init.runs |
Number of runs to be used for the Small-EM strategy described in Rau et al. (2011), with a default value of 1 |
init.iter |
Number of iterations to be used within each run for the Small-EM strategry, with a default value of 10 |
split.init |
If |
alg.type |
Algorithm to be used for parameter estimation (“ |
cutoff |
Cutoff to declare algorithm convergence (in terms of differences in log likelihoods from one iteration to the next) |
iter |
Maximum number of iterations to be run for the chosen algorithm |
fixed.lambda |
If one (or more) clusters with fixed values of lambda is desired, a list containing vectors of length d (the number of conditions). specifying the fixed values of lambda for each fixed cluster. |
equal.proportions |
If |
prev.labels |
A vector of length n of cluster labels obtained from the previous run (g-1 clusters) to be used with the splitting small-EM strategy described in described in Papastamoulis et al. (2014). For other initialization strategies, this parameter takes the value NA |
prev.probaPost |
An n x (g-1) matrix of the conditional probabilities of each observation belonging to each of the g-1 clusters from the previous run, to be used with the splitting small-EM strategy of described in Papastamoulis et al. (2012). For other initialization strategies, this parameter takes the value NA |
verbose |
If |
interpretation |
If |
EM.verbose |
If |
subset.index |
Optional vector providing the indices of a subset of genes that should be used for the co-expression analysis (i.e., row indices
of the data matrix |
wrapper |
|
Output of PoisMixClus
is an S3 object of class HTSCluster
, and output of PoisMixClusWrapper
is an S3 object
of class HTSClusterWrapper
.
In a Poisson mixture model, the data are assumed to come from g distinct subpopulations (clusters), each of which is modeled separately; the overall population is thus a mixture of these subpopulations. In the case of a Poisson mixture model with g components, the model may be written as
for observations in
replicates of
conditions (treatment groups), where
is the standard Poisson density,
,
contains all of the parameters in
assumed to be distinct, and
are the mixing proportions such that
is in (0,1) for all k and
.
We consider the following parameterization for the mean . We consider
where corresponds to the expression level of observation i,
corresponds to the clustering parameters that define the profiles of the genes in cluster k across all variables, and
is the normalized library size (a fixed constant) for replicate l of condition j.
There are two approaches to estimating the parameters of a finite mixture model and obtaining a clustering of the data: the estimation approach (via the EM algorithm) and the clustering approach (via the CEM algorithm). Parameter initialization is done using a Small-EM strategy as described in Rau et al. (2011) via the emInit
function. Model selection may be performed using the BIC or ICL criteria, or the slope heuristics.
lambda |
(d x g) matrix containing the estimate of |
pi |
Vector of length g containing the estimate of |
labels |
Vector of length n containing the cluster assignments of the n observations |
probaPost |
Matrix containing the conditional probabilities of belonging to each cluster for all observations |
log.like |
Value of log likelihood |
BIC |
Value of BIC criterion |
ICL |
Value of ICL criterion |
alg.type |
Estimation algorithm used; matches the argument |
norm |
Library size normalization factors used |
conds |
Conditions specified by user |
iterations |
Number of iterations run |
logLikeDiff |
Difference in log-likelihood between the last and penultimate iterations of the algorithm |
subset.index |
If provided by the user, the indices of subset of genes used for co-expression analyses |
loglike.all |
Log likelihoods calculated for each of the fitted models for cluster sizes |
capushe |
Results of capushe model selection, an object of class |
ICL.all |
ICL values calculated for each of the fitted models for cluster sizes |
ICL.results |
Object of class |
BIC.results |
Object of class |
DDSE.results |
Object of class |
Djump.results |
Object of class |
all.results |
List of objects of class |
model.selection |
Type of criteria used for model selection, equal to |
Note that the fixed.lambda
argument is primarily intended to be used in the case when a single cluster is fixed to
have equal clustering parameters lambda across all conditions (i.e., ); this is particularly useful
when identifying genes with non-differential expression across all conditions (see the
HTSDiff
R package for more details).
Alternatively, this argument could be used to specify a cluster for which genes are only expressed in a single condition
(e.g., and
for all
). Other possibilities could be considered,
but note that the fixed values of lambda must satisfy the constraint
for all
imposed in the model; if this is not the case, a warning message will be printed.
Andrea Rau
Anders, S. and Huber, W. (2010) Differential expression analysis for sequence count data. Genome Biology, 11(R106), 1-28.
Papastamoulis, P., Martin-Magniette, M.-L., and Maugis-Rabusseau, C. (2014). On the estimation of mixtures of Poisson regression models with large number of components. Computational Statistics and Data Analysis: 3rd special Issue on Advances in Mixture Models, DOI: 10.1016/j.csda.2014.07.005.
Rau, A., Maugis-Rabusseau, C., Martin-Magniette, M.-L., Celeux G. (2015). Co-expression analysis of high-throughput transcriptome sequencing data with Poisson mixture models. Bioinformatics, 31(9):1420-1427.
Rau, A., Celeux, G., Martin-Magniette, M.-L., Maugis-Rabusseau, C (2011). Clustering high-throughput sequencing data with Poisson mixture models. Inria Research Report 7786. Available at https://inria.hal.science/inria-00638082.
probaPost
for the calculation of the conditional probability of belonging to a cluster;
PoisMixMean
for the calculation of the per-cluster conditional mean of each observation;
logLikePoisMixDiff
for the calculation of the log likelihood of a Poisson mixture model;
emInit
and kmeanInit
for the Small-EM parameter initialization strategy
set.seed(12345) ## Simulate data as shown in Rau et al. (2011) ## Library size setting "A", high cluster separation ## n = 200 observations simulate <- PoisMixSim(n = 200, libsize = "A", separation = "high") y <- simulate$y conds <- simulate$conditions ## Run the PMM model for g = 3 ## "TC" library size estimate, EM algorithm run <- PoisMixClus(y, g = 3, conds = conds, norm = "TC") ## Estimates of pi and lambda for the selected model pi.est <- run$pi lambda.est <- run$lambda ## Not run: PMM for 4 total clusters, with one fixed class ## "TC" library size estimate, EM algorithm ## ## run <- PoisMixClus(y, g = 3, norm = "TC", conds = conds, ## fixed.lambda = list(c(1,1,1))) ## ## ## Not run: PMM model for 4 clusters, with equal proportions ## "TC" library size estimate, EM algorithm ## ## run <- PoisMixClus(y, g = 4, norm = "TC", conds = conds, ## equal.proportions = TRUE) ## ## ## Not run: PMM model for g = 1, ..., 10 clusters, Split Small-EM init ## ## run1.10 <- PoisMixClusWrapper(y, gmin = 1, gmax = 10, conds = conds, ## norm = "TC") ## ## ## Not run: PMM model for g = 1, ..., 10 clusters, Small-EM init ## ## run1.10bis <- <- PoisMixClusWrapper(y, gmin = 1, gmax = 10, conds = conds, ## norm = "TC", split.init = FALSE) ## ## ## Not run: previous model equivalent to the following ## ## for(K in 1:10) { ## run <- PoisMixClus(y, g = K, conds = conds, norm = "TC") ## }
set.seed(12345) ## Simulate data as shown in Rau et al. (2011) ## Library size setting "A", high cluster separation ## n = 200 observations simulate <- PoisMixSim(n = 200, libsize = "A", separation = "high") y <- simulate$y conds <- simulate$conditions ## Run the PMM model for g = 3 ## "TC" library size estimate, EM algorithm run <- PoisMixClus(y, g = 3, conds = conds, norm = "TC") ## Estimates of pi and lambda for the selected model pi.est <- run$pi lambda.est <- run$lambda ## Not run: PMM for 4 total clusters, with one fixed class ## "TC" library size estimate, EM algorithm ## ## run <- PoisMixClus(y, g = 3, norm = "TC", conds = conds, ## fixed.lambda = list(c(1,1,1))) ## ## ## Not run: PMM model for 4 clusters, with equal proportions ## "TC" library size estimate, EM algorithm ## ## run <- PoisMixClus(y, g = 4, norm = "TC", conds = conds, ## equal.proportions = TRUE) ## ## ## Not run: PMM model for g = 1, ..., 10 clusters, Split Small-EM init ## ## run1.10 <- PoisMixClusWrapper(y, gmin = 1, gmax = 10, conds = conds, ## norm = "TC") ## ## ## Not run: PMM model for g = 1, ..., 10 clusters, Small-EM init ## ## run1.10bis <- <- PoisMixClusWrapper(y, gmin = 1, gmax = 10, conds = conds, ## norm = "TC", split.init = FALSE) ## ## ## Not run: previous model equivalent to the following ## ## for(K in 1:10) { ## run <- PoisMixClus(y, g = K, conds = conds, norm = "TC") ## }
This function is used to calculate the conditional per-cluster mean expression for all observations. This value corresponds to
for the PMM-I model and
for the PMM-II model.
PoisMixMean(y, g, conds, s, lambda)
PoisMixMean(y, g, conds, s, lambda)
y |
(n x q) matrix of observed counts for n observations and q variables |
g |
Number of clusters |
conds |
Vector of length q defining the condition (treatment group) for each variable (column) in |
s |
Estimate of normalized per-variable library size |
lambda |
(d x |
A list of length g
containing the (n x q) matrices of mean expression for all observations, conditioned on each of the g
clusters
Andrea Rau
Rau, A., Maugis-Rabusseau, C., Martin-Magniette, M.-L., Celeux G. (2015). Co-expression analysis of high-throughput transcriptome sequencing data with Poisson mixture models. Bioinformatics, 31(9):1420-1427.
Rau, A., Celeux, G., Martin-Magniette, M.-L., Maugis-Rabusseau, C. (2011). Clustering high-throughput sequencing data with Poisson mixture models. Inria Research Report 7786. Available at https://inria.hal.science/inria-00638082.
PoisMixClus
for Poisson mixture model estimation and model selection
set.seed(12345) ## Simulate data as shown in Rau et al. (2011) ## Library size setting "A", high cluster separation ## n = 200 observations simulate <- PoisMixSim(n = 200, libsize = "A", separation = "high") y <- simulate$y conds <- simulate$conditions s <- colSums(y) / sum(y) ## TC estimate of lib size ## Run the PMM-II model for g = 3 ## "TC" library size estimate, EM algorithm run <- PoisMixClus(y, g = 3, norm = "TC", conds = conds) pi.est <- run$pi lambda.est <- run$lambda ## Calculate the per-cluster mean for each observation means <- PoisMixMean(y, g = 3, conds, s, lambda.est)
set.seed(12345) ## Simulate data as shown in Rau et al. (2011) ## Library size setting "A", high cluster separation ## n = 200 observations simulate <- PoisMixSim(n = 200, libsize = "A", separation = "high") y <- simulate$y conds <- simulate$conditions s <- colSums(y) / sum(y) ## TC estimate of lib size ## Run the PMM-II model for g = 3 ## "TC" library size estimate, EM algorithm run <- PoisMixClus(y, g = 3, norm = "TC", conds = conds) pi.est <- run$pi lambda.est <- run$lambda ## Calculate the per-cluster mean for each observation means <- PoisMixMean(y, g = 3, conds, s, lambda.est)
This function simulates data from a Poisson mixture model, as described by Rau et al. (2011). Data are simulated with varying expression level () for 4 clusters. Clusters may be simulated with “high” or “low” separation, and three different options are available for the library size setting: “equal”, “A”, and “B”, as described by Rau et al. (2011).
PoisMixSim(n = 2000, libsize, separation)
PoisMixSim(n = 2000, libsize, separation)
n |
Number of observations |
libsize |
The type of library size difference to be simulated (“ |
separation |
Cluster separation (“ |
y |
(n x q) matrix of simulated counts for n observations and q variables |
labels |
Vector of length n defining the true cluster labels of the simulated data |
pi |
Vector of length 4 (the number of clusters) containing the true value of |
lambda |
(d x 4) matrix of |
w |
Row sums of |
conditions |
Vector of length q defining the condition (treatment group) for each variable (column) in |
If one or more observations are simulated such that all variables have a value of 0, those rows are removed from the data matrix; as such, in some cases the simulated data y
may have less than n
rows.
The PMM-I model includes the parameter constraint , where
is the number of replicates in condition (treatment group)
. Similarly, the parameter constraint in the PMM-II model is
, where
is the library size for replicate l of condition j. The value of
lambda
corresponds to that used to generate the simulated data, where the library sizes were set as described in Table 2 of Rau et al. (2011). However, due to variability in the simulation process, the actually library sizes of the data y
are not exactly equal to these values; this means that the value of lambda
may not be directly compared to an estimated value of as obtained from the
PoisMixClus
function.
Andrea Rau
Rau, A., Celeux, G., Martin-Magniette, M.-L., Maugis-Rabusseau, C. (2011). Clustering high-throughput sequencing data with Poisson mixture models. Inria Research Report 7786. Available at https://inria.hal.science/inria-00638082.
set.seed(12345) ## Simulate data as shown in Rau et al. (2011) ## Library size setting "A", high cluster separation ## n = 200 observations simulate <- PoisMixSim(n = 200, libsize = "A", separation = "high") y <- simulate$y conds <- simulate$conditions
set.seed(12345) ## Simulate data as shown in Rau et al. (2011) ## Library size setting "A", high cluster separation ## n = 200 observations simulate <- PoisMixSim(n = 200, libsize = "A", separation = "high") y <- simulate$y conds <- simulate$conditions
This function computes the conditional probabilities that an observation i arises from the
component for the current value of the mixture parameters.
probaPost(y, g, conds, pi, s, lambda)
probaPost(y, g, conds, pi, s, lambda)
y |
(n x q) matrix of observed counts for n observations and q variables |
g |
Number of clusters |
conds |
Vector of length q defining the condition (treatment group) for each variable (column) in |
pi |
Vector of length |
s |
Vector of length q containing the estimates for the normalized library size parameters for each of the q variables in |
lambda |
(d x |
t |
(n x |
If all values of are 0 (or nearly zero), the observation is assigned with probability one to belong to the cluster with the closest mean (in terms of the Euclidean distance from the observation). To avoid calculation difficulties, extreme values of
are smoothed, such that those smaller than 1e-10 or larger than 1-1e-10 are set equal to 1e-10 and 1-1e-10, respectively.
Andrea Rau
Rau, A., Maugis-Rabusseau, C., Martin-Magniette, M.-L., Celeux G. (2015). Co-expression analysis of high-throughput transcriptome sequencing data with Poisson mixture models. Bioinformatics, 31(9):1420-1427.
Rau, A., Celeux, G., Martin-Magniette, M.-L., Maugis-Rabusseau, C. (2011). Clustering high-throughput sequencing data with Poisson mixture models. Inria Research Report 7786. Available at https://inria.hal.science/inria-00638082.
PoisMixClus
for Poisson mixture model estimation and model selection;
PoisMixMean
to calculate the conditional per-cluster mean of each observation
set.seed(12345) ## Simulate data as shown in Rau et al. (2011) ## Library size setting "A", high cluster separation ## n = 200 observations simulate <- PoisMixSim(n = 200, libsize = "A", separation = "high") y <- simulate$y conds <- simulate$conditions s <- colSums(y) / sum(y) ## TC estimate of lib size ## Run the PMM-II model for g = 3 ## "TC" library size estimate, EM algorithm run <- PoisMixClus(y, g = 3, norm = "TC", conds = conds) pi.est <- run$pi lambda.est <- run$lambda ## Calculate the conditional probability of belonging to each cluster proba <- probaPost(y, g = 3, conds = conds, pi = pi.est, s = s, lambda = lambda.est) ## head(round(proba,2))
set.seed(12345) ## Simulate data as shown in Rau et al. (2011) ## Library size setting "A", high cluster separation ## n = 200 observations simulate <- PoisMixSim(n = 200, libsize = "A", separation = "high") y <- simulate$y conds <- simulate$conditions s <- colSums(y) / sum(y) ## TC estimate of lib size ## Run the PMM-II model for g = 3 ## "TC" library size estimate, EM algorithm run <- PoisMixClus(y, g = 3, norm = "TC", conds = conds) pi.est <- run$pi lambda.est <- run$lambda ## Calculate the conditional probability of belonging to each cluster proba <- probaPost(y, g = 3, conds = conds, pi = pi.est, s = s, lambda = lambda.est) ## head(round(proba,2))
A function to summarize the clustering results obtained from a Poisson mixture model.
## S3 method for class 'HTSCluster' summary(object, ...) ## S3 method for class 'HTSClusterWrapper' summary(object, ...)
## S3 method for class 'HTSCluster' summary(object, ...) ## S3 method for class 'HTSClusterWrapper' summary(object, ...)
object |
An object of class |
... |
Additional arguments |
The summary function for an object of class "HTSCluster"
provides the following summary of results:
1) Number of clusters and model selection criterion used, if applicable.
2) Number of observations across all clusters with a maximum conditional probability greater than 90 model.
3) Number of observations per cluster with a maximum conditional probability greater than 90 selected model.
4) values for the selected model.
5) values for the selected model.
The summary function for an object of class "HTSClusterWrapper"
provides the number of clusters selected for
the BIC, ICL, DDSE, and Djump model selection approaches.
Andrea Rau
Rau, A., Maugis-Rabusseau, C., Martin-Magniette, M.-L., Celeux G. (2015). Co-expression analysis of high-throughput transcriptome sequencing data with Poisson mixture models. Bioinformatics, 31(9):1420-1427.
Rau, A., Celeux, G., Martin-Magniette, M.-L., Maugis-Rabusseau, C. (2011). Clustering high-throughput sequencing data with Poisson mixture models. Inria Research Report 7786. Available at https://inria.hal.science/inria-00638082.
PoisMixClus
, PoisMixClusWrapper
set.seed(12345) ## Simulate data as shown in Rau et al. (2011) ## Library size setting "A", high cluster separation ## n = 2000 observations simulate <- PoisMixSim(n = 200, libsize = "A", separation = "high") y <- simulate$y conds <- simulate$conditions ## Run the PMM-II model for g = 3 ## "TC" library size estimate, EM algorithm run <- PoisMixClus(y, g = 3, norm = "TC", conds = conds, init.type = "small-em") ## Summary of results: summary(run)
set.seed(12345) ## Simulate data as shown in Rau et al. (2011) ## Library size setting "A", high cluster separation ## n = 2000 observations simulate <- PoisMixSim(n = 200, libsize = "A", separation = "high") y <- simulate$y conds <- simulate$conditions ## Run the PMM-II model for g = 3 ## "TC" library size estimate, EM algorithm run <- PoisMixClus(y, g = 3, norm = "TC", conds = conds, init.type = "small-em") ## Summary of results: summary(run)