Title: | Bayesian Estimation of Mixtures of Multivariate Bernoulli Distributions |
---|---|
Description: | Fully Bayesian inference for estimating the number of clusters and related parameters to heterogeneous binary data. |
Authors: | Panagiotis Papastamoulis |
Maintainer: | Panagiotis Papastamoulis <[email protected]> |
License: | GPL-2 |
Version: | 1.4.1 |
Built: | 2024-11-16 03:03:12 UTC |
Source: | https://github.com/cran/BayesBinMix |
Fully Bayesian inference for estimating the number of clusters and related parameters to heterogeneous binary data.
This package can be used in order to cluster multivariate binary data (NAs are allowed). The main function of the package is coupledMetropolis
.
The input is an binary array where
and
denote the number of observations and dimension of the data. The underlying model is a mixture of independent multivariate Bernoulli distributions with an unknown number of components:
with ;
, independent for
. The term
denotes the probability density function of the Bernoulli distribution with parameter
. The number of clusters
is a random variable with support
, where
is an upper bound for the number of clusters. The model uses the following prior assumptions:
The discrete distribution on the number of clusters it can be a truncated Poisson(1) or Uniform distribution. The model uses data augmentation by also considering the (latent) allocation variable , which a priori assigns observation
to cluster
with probability
independently for
.
In order to infer the parameters of the model, a Markov chain Monte Carlo (MCMC) approach is adopted. Given , the component-specific parameters
and
are integrated out and a collapsed allocation sampler which also updates the number of clusters (Nobile and Fearnside, 2007) is implemented. In the case that the observed data contains missing values, the algorithm simulates their values from the corresponding full conditional distribution. In order to improve the mixing of the simulated chain, a Metropolis-coupled MCMC sampler (Altekar et al., 2004) is incorporated. In particular, various heated chains are run in parallel and swaps are proposed between pairs of chains. The number of chains should be equal to the number of available cores. Each chain runs in parallel using the packages
foreach
and doParallel
.
After inferring the most probable number of clusters, the simulated parameters which correspond to this specific value of are post-processed in order to undo the label switching problem. For this purpose the
label.switching
package (Papastamoulis, 2016; see also Papastamoulis and Iliopoulos 2010, 2013 and Papastamoulis, 2014) is used.
Panagiotis Papastamoulis
Maintainer: Panagiotis Papastamoulis
Altekar G, Dwarkadas S, Huelsenbeck JP, Ronquist F. (2004): Parallel Metropolis coupled Markov chain Monte Carlo for Bayesian phylogenetic inference. Bioinformatics 20(3): 407-415.
Nobile A and Fearnside A (2007): Bayesian finite mixtures with an unknown number of components: The allocation sampler. Statistics and Computing, 17(2): 147-162.
Papastamoulis P. and Iliopoulos G. (2010). An artificial allocations based solution to the label switching problem in Bayesian analysis of mixtures of distributions. Journal of Computational and Graphical Statistics, 19: 313-331.
Papastamoulis P. and Iliopoulos G. (2013). On the convergence rate of Random Permutation Sampler and ECR algorithm in missing data models. Methodology and Computing in Applied Probability, 15(2): 293-304.
Papastamoulis P. (2014). Handling the label switching problem in latent class models via the ECR algorithm. Communications in Statistics, Simulation and Computation, 43(4): 913-927.
Papastamoulis P (2016): label.switching: An R package for dealing with the label switching problem in MCMC outputs. Journal of Statistical Software, 69(1): 1-24.
This function implements the collapsed allocation sampler of Nobile and Fearnside (2007) at the context of mixtures of multivariate Bernoulli distributions.
allocationSamplerBinMix(Kmax, alpha, beta, gamma, m, burn, data, thinning, z.true, ClusterPrior, ejectionAlpha, Kstart, outputDir, metropolisMoves, reorderModels, heat, zStart, LS, rsX, originalX, printProgress)
allocationSamplerBinMix(Kmax, alpha, beta, gamma, m, burn, data, thinning, z.true, ClusterPrior, ejectionAlpha, Kstart, outputDir, metropolisMoves, reorderModels, heat, zStart, LS, rsX, originalX, printProgress)
Kmax |
Maximum number of clusters (integer, at least equal to two). |
alpha |
First shape parameter of the Beta prior distribution (strictly positive). Defaults to 1. |
beta |
Second shape parameter of the Beta prior distribution (strictly positive). Defaults to 1. |
gamma |
|
m |
Number of MCMC iterations. |
burn |
The number of initial MCMC iterations that will be discarded as burn-in period. |
data |
Binary data array (NAs not allowed here). |
thinning |
Integer that defines a thinning of the reported MCMC sample. Under the default setting, every 5th MCMC iteration is saved. |
z.true |
An optional vector of cluster assignments considered as the ground-truth clustering of the observations. Useful for simulations. |
ClusterPrior |
Character string specifying the prior distribution of the number of clusters on the set |
ejectionAlpha |
Probability of ejecting an empty component. Defaults to 0.2. |
Kstart |
Initial value for the number of clusters. Defaults to 1. |
outputDir |
The name of the produced output folder. |
metropolisMoves |
A vector of character strings with possible values |
reorderModels |
Character string specifying whether to post-process the MCMC sample of each distinct generated value of |
heat |
The temperature of the simulated chain, that is, a scalar in the set |
zStart |
|
LS |
Boolean indicating whether to post-process the MCMC sample using the label switching algorithms. |
rsX |
Optional vector containing the row-sums of the observed data (NAs are allowed). It is required only in the case of missing values. |
originalX |
Optional array containing the observed data (containing NAs). It is required only in the case of missing values. |
printProgress |
Logical, indicating whether to print the progress of the sampler or not. Default: FALSE. |
The output is reordered according to the following label-switching solving algorithms: ECR, ECR-ITERATIVE-1 and STEPHENS. In most cases the results of these different algorithms are identical.
This function is recursively called inside the coupledMetropolis
function. There is no need to call it separately.
Panagiotis Papastamoulis
Nobile A and Fearnside A (2007): Bayesian finite mixtures with an unknown number of components: The allocation sampler. Statistics and Computing, 17(2): 147-162.
Papastamoulis P. and Iliopoulos G. (2010). An artificial allocations based solution to the label switching problem in Bayesian analysis of mixtures of distributions. Journal of Computational and Graphical Statistics, 19: 313-331.
Papastamoulis P. and Iliopoulos G. (2013). On the convergence rate of Random Permutation Sampler and ECR algorithm in missing data models. Methodology and Computing in Applied Probability, 15(2): 293-304.
Papastamoulis P. (2014). Handling the label switching problem in latent class models via the ECR algorithm. Communications in Statistics, Simulation and Computation, 43(4): 913-927.
Papastamoulis P (2016): label.switching: An R package for dealing with the label switching problem in MCMC outputs. Journal of Statistical Software, 69(1): 1-24.
This function applied collapsed Gibbs sampling assuming that the number of mixture components is known.
collapsedGibbsBinMix(alpha, beta, gamma, K, m, burn, data, thinning, z.true, outputDir)
collapsedGibbsBinMix(alpha, beta, gamma, K, m, burn, data, thinning, z.true, outputDir)
alpha |
First shape parameter of the Beta prior distribution (strictly positive). Defaults to 1. |
beta |
Second shape parameter of the Beta prior distribution (strictly positive). Defaults to 1. |
gamma |
|
K |
Number of clusters. |
m |
Number of MCMC iterations. |
burn |
The number of initial MCMC iterations that will be discarded as burn-in period. |
data |
Binary data array. |
thinning |
Integer that defines a thinning of the reported MCMC sample. Under the default setting, every 5th MCMC iteration is saved. |
z.true |
An optional vector of cluster assignments considered as the ground-truth clustering of the observations. Useful for simulations. |
outputDir |
The name of the produced output folder. |
Not really used.
Panagiotis Papastamoulis
Returns the complete log-likelihood of the mixture.
complete.loglikelihood(x, z, pars)
complete.loglikelihood(x, z, pars)
x |
Binary data. |
z |
Latent allocations vector. |
pars |
Parameters of the mixture. |
Complete log-likelihood value.
Panagiotis Papastamoulis
Main function of the package. The algorithm consists of the allocation sampler combined with a MC3 scheme.
coupledMetropolis(Kmax, nChains, heats, binaryData, outPrefix, ClusterPrior, m, alpha, beta, gamma, z.true, ejectionAlpha, burn)
coupledMetropolis(Kmax, nChains, heats, binaryData, outPrefix, ClusterPrior, m, alpha, beta, gamma, z.true, ejectionAlpha, burn)
Kmax |
Maximum number of clusters (integer, at least equal to two). |
nChains |
Number of parallel (heated) chains. Ideally, it should be equal to the number of available threads. |
heats |
|
binaryData |
The observed binary data (array). Missing values are allowed as long as the corresponding entries are denoted as |
outPrefix |
The name of the produced output folder. An error is thrown if the directory exists. |
ClusterPrior |
Character string specifying the prior distribution of the number of clusters on the set |
m |
The number of MCMC cycles. At the end of each cycle a swap between a pair of heated chains is attempted. Each cycle consists of 10 iterations. |
alpha |
First shape parameter of the Beta prior distribution (strictly positive). Defaults to 1. |
beta |
Second shape parameter of the Beta prior distribution (strictly positive). Defaults to 1. |
gamma |
|
z.true |
An optional vector of cluster assignments considered as the ground-truth clustering of the observations. Useful for simulations. |
ejectionAlpha |
Probability of ejecting an empty component. Defaults to 0.2. |
burn |
Optional integer denoting the number of MCMC cycles that will be discarded as burn-in period. |
In the case that the most probable number of clusters is larger than 1, the output is post-processed using the label.switching package. In addition to the objects returned to the user (see value
below), the complete output of the sampler is written to the directory outPrefix
. It consists of the following files:
K.allChains.txt
m
nChains
matrix containing the simulated values of the number of clusters () per chain.
K.txt
the m
simulated values of the number of clusters () of the cold chain (posterior distribution).
p.varK.txt the simulated values of the mixture weights (not identifiable).
rawMCMC.mapK.KVALUE.txt the raw MCMC output which corresponds to the most probable model (not identifiable).
reorderedMCMC-ECR-ITERATIVE1.mapK.KVALUE.txt
the reordered MCMC output which corresponds to the most probable model, reordered according to the ECR-ITERATIVE-1
algorithm.
reorderedMCMC-ECR.mapK.KVALUE.txt
the reordered MCMC output which corresponds to the most probable model, reordered according to the ECR
algorithm.
reorderedMCMC-STEPHENS.mapK.KVALUE.txt
the reordered MCMC output which corresponds to the most probable model, reordered according to the STEPHENS
algorithm.
reorderedSingleBestClusterings.mapK.KVALUE.txt the most probable allocation of each observation after reordering the MCMC sample which corresponds to the most probable number of clusters.
theta.varK.txt the simulated values of Bernoulli parameters (not identifiable).
z-ECR-ITERATIVE1.mapK.KVALUE.txt
the reordered simulated latent allocations which corresponds to the most probable model, reordered according to the ECR-ITERATIVE-1
algorithm.
z-ECR.mapK.KVALUE.txt
the reordered simulated latent allocations which corresponds to the most probable model, reordered according to the ECR
algorithm.
z-KL.mapK.KVALUE.txt
the reordered simulated latent allocations which corresponds to the most probable model, reordered according to the STEPHENS
algorithm.
z.varK.txt the simulated latent allocations (not identifiable).
classificationProbabilities.mapK.KVALUE.csv
the reordered classification probabilities per observation after reordering the most probable number of clusters with the ECR
algorithm.
xEstimated.txt Observed data with missing values estimated by their posterior mean estimate. This file is produced only in the case that the observed data contains missing values.
KVALUE
will be equal to the inferred number of clusters. Note that the label switching part is omitted in case that the most probable number of clusters is equal to 1.
The basic output of the sampler is returned to the following R
objects:
K.mcmc |
object of class |
parameters.ecr.mcmc |
object of class |
allocations.ecr.mcmc |
object of class |
classificationProbabilities.ecr |
data frame of the reordered classification probabilities per observation after reordering the most probable number of clusters with the |
clusterMembershipPerMethod |
data frame of the most probable allocation of each observation after reordering the MCMC sample which corresponds to the most probable number of clusters according to |
K.allChains |
|
chainInfo |
Number of cycles, burn-in period and acceptance rate of swap moves. |
Panagiotis Papastamoulis
Altekar G, Dwarkadas S, Huelsenbeck JP, Ronquist F. (2004): Parallel Metropolis coupled Markov chain Monte Carlo for Bayesian phylogenetic inference. Bioinformatics 20(3): 407-415.
Nobile A and Fearnside A (2007): Bayesian finite mixtures with an unknown number of components: The allocation sampler. Statistics and Computing, 17(2): 147-162.
Papastamoulis P. and Iliopoulos G. (2010). An artificial allocations based solution to the label switching problem in Bayesian analysis of mixtures of distributions. Journal of Computational and Graphical Statistics, 19: 313-331.
Papastamoulis P. and Iliopoulos G. (2013). On the convergence rate of Random Permutation Sampler and ECR algorithm in missing data models. Methodology and Computing in Applied Probability, 15(2): 293-304.
Papastamoulis P. (2014). Handling the label switching problem in latent class models via the ECR algorithm. Communications in Statistics, Simulation and Computation, 43(4): 913-927.
Papastamoulis P (2016): label.switching: An R package for dealing with the label switching problem in MCMC outputs. Journal of Statistical Software, 69(1): 1-24.
#generate dataset from a mixture of 2 ten-dimensional Bernoulli distributions. set.seed(1) d <- 10 # number of columns n <- 50 # number of rows (sample size) K <- 2 # true number of clusters p.true <- myDirichlet(rep(10,K)) # true weight of each cluster z.true <- numeric(n) # true cluster membership z.true <- sample(K,n,replace=TRUE,prob = p.true) #true probability of positive responses per cluster: theta.true <- array(data = NA, dim = c(K,d)) for(j in 1:d){ theta.true[,j] <- rbeta(K, shape1 = 1, shape2 = 1) } x <- array(data=NA,dim = c(n,d)) # data: n X d array for(k in 1:K){ myIndex <- which(z.true == k) for (j in 1:d){ x[myIndex,j] <- rbinom(n = length(myIndex), size = 1, prob = theta.true[k,j]) } } # number of heated paralled chains nChains <- 2 heats <- seq(1,0.8,length = nChains) ## Not run: cm <- coupledMetropolis(Kmax = 10,nChains = nChains,heats = heats, binaryData = x, outPrefix = 'BayesBinMixExample', ClusterPrior = 'poisson', m = 1100, burn = 100) # print summary using: print(cm) ## End(Not run) # it is also advised to use z.true = z.true in order to directly compare with # the true values. In general it is advised to use at least 4 chains with # heats <- seq(1,0.3,length = nChains)
#generate dataset from a mixture of 2 ten-dimensional Bernoulli distributions. set.seed(1) d <- 10 # number of columns n <- 50 # number of rows (sample size) K <- 2 # true number of clusters p.true <- myDirichlet(rep(10,K)) # true weight of each cluster z.true <- numeric(n) # true cluster membership z.true <- sample(K,n,replace=TRUE,prob = p.true) #true probability of positive responses per cluster: theta.true <- array(data = NA, dim = c(K,d)) for(j in 1:d){ theta.true[,j] <- rbeta(K, shape1 = 1, shape2 = 1) } x <- array(data=NA,dim = c(n,d)) # data: n X d array for(k in 1:K){ myIndex <- which(z.true == k) for (j in 1:d){ x[myIndex,j] <- rbinom(n = length(myIndex), size = 1, prob = theta.true[k,j]) } } # number of heated paralled chains nChains <- 2 heats <- seq(1,0.8,length = nChains) ## Not run: cm <- coupledMetropolis(Kmax = 10,nChains = nChains,heats = heats, binaryData = x, outPrefix = 'BayesBinMixExample', ClusterPrior = 'poisson', m = 1100, burn = 100) # print summary using: print(cm) ## End(Not run) # it is also advised to use z.true = z.true in order to directly compare with # the true values. In general it is advised to use at least 4 chains with # heats <- seq(1,0.3,length = nChains)
This is a wrapper for the label.switching
package. It is used to post-process the generated MCMC sample in order to undo the label switching problem. This function is called internally to the coupledMetropolis
command.
dealWithLabelSwitching(outDir, reorderModels, binaryData, z.true)
dealWithLabelSwitching(outDir, reorderModels, binaryData, z.true)
outDir |
The directory where the output of |
reorderModels |
Boolean value indicating whether to reorder the MCMC corresponding to each distinct generated value of number of clusters or not. |
binaryData |
The input data. |
z.true |
An optional vector of cluster assignments considered as the ground-truth clustering of the observations. Useful for simulations. |
See the label.switching
package.
Panagiotis Papastamoulis
Papastamoulis P. and Iliopoulos G. (2010). An artificial allocations based solution to the label switching problem in Bayesian analysis of mixtures of distributions. Journal of Computational and Graphical Statistics, 19: 313-331.
Papastamoulis P. and Iliopoulos G. (2013). On the convergence rate of Random Permutation Sampler and ECR algorithm in missing data models. Methodology and Computing in Applied Probability, 15(2): 293-304.
Papastamoulis P. (2014). Handling the label switching problem in latent class models via the ECR algorithm. Communications in Statistics, Simulation and Computation, 43(4): 913-927.
Papastamoulis P (2016): label.switching: An R package for dealing with the label switching problem in MCMC outputs. Journal of Statistical Software, 69(1): 1-24.
This is a wrapper for the label.switching
package. It is used to post-process the generated MCMC sample in order to undo the label switching problem. This function is called internally to the coupledMetropolis
command.
dealWithLabelSwitchingMissing(outDir, reorderModels, binaryData, z.true)
dealWithLabelSwitchingMissing(outDir, reorderModels, binaryData, z.true)
outDir |
The directory where the output of |
reorderModels |
Boolean value indicating whether to reorder the MCMC corresponding to each distinct generated value of number of clusters or not. |
binaryData |
The input data. |
z.true |
An optional vector of cluster assignments considered as the ground-truth clustering of the observations. Useful for simulations. |
See the label.switching
package.
Panagiotis Papastamoulis
Papastamoulis P. and Iliopoulos G. (2010). An artificial allocations based solution to the label switching problem in Bayesian analysis of mixtures of distributions. Journal of Computational and Graphical Statistics, 19: 313-331.
Papastamoulis P. and Iliopoulos G. (2013). On the convergence rate of Random Permutation Sampler and ECR algorithm in missing data models. Methodology and Computing in Applied Probability, 15(2): 293-304.
Papastamoulis P. (2014). Handling the label switching problem in latent class models via the ECR algorithm. Communications in Statistics, Simulation and Computation, 43(4): 913-927.
Papastamoulis P (2016): label.switching: An R package for dealing with the label switching problem in MCMC outputs. Journal of Statistical Software, 69(1): 1-24.
This function implements full Gibbs sampling to simulate an MCMC sample from the posterior distribution assuming known number of mixture components.
gibbsBinMix(alpha, beta, gamma, K, m, burn, data, thinning, z.true, outputDir)
gibbsBinMix(alpha, beta, gamma, K, m, burn, data, thinning, z.true, outputDir)
alpha |
First shape parameter of the Beta prior distribution (strictly positive). Defaults to 1. |
beta |
Second shape parameter of the Beta prior distribution (strictly positive). Defaults to 1. |
gamma |
|
K |
Number of clusters. |
m |
Number of MCMC iterations. |
burn |
Burn-in period. |
data |
Binary data. |
thinning |
Thinning of the simulated chain. |
z.true |
An optional vector of cluster assignments considered as the ground-truth clustering of the observations. Useful for simulations. |
outputDir |
Output directory. |
Not really used.
Panagiotis Papastamoulis
This function simulates random vectors from a Dirichlet distribution.
myDirichlet(alpha)
myDirichlet(alpha)
alpha |
Vector of positive numbers denoting the parameters of the Dirichlet distribution. |
Panagiotis Papastamoulis
This function prints a summary of objects returned by the coupledMetropolis
function.
## S3 method for class 'bbm.object' print(x, printSubset, ...)
## S3 method for class 'bbm.object' print(x, printSubset, ...)
x |
An object of class |
printSubset |
Logical indicating whether to print the header or the whole matrix of estimates. Default value: TRUE. |
... |
ignored. |
The function prints the estimated distribution of the number of clusters, the estimated number of observations assigned to each cluster after post-processing the output with three label switching algorithms, as well as the header of the posterior mean estimates of (probability of success for cluster
and feature
) (conditionally on the most probable number of clusters).
Panagiotis Papastamoulis
Approximately solve the equation (25) of Nobile and Fearnside (2007).
toSolve(a, n, p0)
toSolve(a, n, p0)
a |
Positive number. |
n |
Positive integer. |
p0 |
Probability. |
Panagiotis Papastamoulis