Package 'multinomialLogitMix' reference manual

Title:	Clustering Multinomial Count Data under the Presence of Covariates
Description:	Methods for model-based clustering of multinomial counts under the presence of covariates using mixtures of multinomial logit models, as implemented in Papastamoulis (2023) <DOI:10.1007/s11634-023-00547-5>. These models are estimated under a frequentist as well as a Bayesian setup using the Expectation-Maximization algorithm and Markov chain Monte Carlo sampling (MCMC), respectively. The (unknown) number of clusters is selected according to the Integrated Completed Likelihood criterion (for the frequentist model), and estimating the number of non-empty components using overfitting mixture models after imposing suitable sparse prior assumptions on the mixing proportions (in the Bayesian case), see Rousseau and Mengersen (2011) <DOI:10.1111/j.1467-9868.2011.00781.x>. In the latter case, various MCMC chains run in parallel and are allowed to switch states. The final MCMC output is suitably post-processed in order to undo label switching using the Equivalence Classes Representatives (ECR) algorithm, as described in Papastamoulis (2016) <DOI:10.18637/jss.v069.c01>.
Authors:	Panagiotis Papastamoulis [aut, cre]
Maintainer:	Panagiotis Papastamoulis <[email protected]>
License:	GPL-2
Version:	1.1
Built:	2025-03-07 02:48:13 UTC
Source:	https://github.com/cran/multinomialLogitMix

Clustering Multinomial Count Data under the Presence of Covariates

Description

Methods for model-based clustering of multinomial counts under the presence of covariates using mixtures of multinomial logit models, as implemented in Papastamoulis (2023) <DOI:10.1007/s11634-023-00547-5>. These models are estimated under a frequentist as well as a Bayesian setup using the Expectation-Maximization algorithm and Markov chain Monte Carlo sampling (MCMC), respectively. The (unknown) number of clusters is selected according to the Integrated Completed Likelihood criterion (for the frequentist model), and estimating the number of non-empty components using overfitting mixture models after imposing suitable sparse prior assumptions on the mixing proportions (in the Bayesian case), see Rousseau and Mengersen (2011) <DOI:10.1111/j.1467-9868.2011.00781.x>. In the latter case, various MCMC chains run in parallel and are allowed to switch states. The final MCMC output is suitably post-processed in order to undo label switching using the Equivalence Classes Representatives (ECR) algorithm, as described in Papastamoulis (2016) <DOI:10.18637/jss.v069.c01>.

Details

The DESCRIPTION file:

Package:	multinomialLogitMix
Type:	Package
Title:	Clustering Multinomial Count Data under the Presence of Covariates
Version:	1.1
Date:	2023-07-13
Authors@R:	c(person(given = "Panagiotis", family = "Papastamoulis", email = "[email protected]", role = c( "aut", "cre"), comment = c(ORCID = "0000-0001-9468-7613")))
Maintainer:	Panagiotis Papastamoulis <[email protected]>
Description:	Methods for model-based clustering of multinomial counts under the presence of covariates using mixtures of multinomial logit models, as implemented in Papastamoulis (2023) <DOI:10.1007/s11634-023-00547-5>. These models are estimated under a frequentist as well as a Bayesian setup using the Expectation-Maximization algorithm and Markov chain Monte Carlo sampling (MCMC), respectively. The (unknown) number of clusters is selected according to the Integrated Completed Likelihood criterion (for the frequentist model), and estimating the number of non-empty components using overfitting mixture models after imposing suitable sparse prior assumptions on the mixing proportions (in the Bayesian case), see Rousseau and Mengersen (2011) <DOI:10.1111/j.1467-9868.2011.00781.x>. In the latter case, various MCMC chains run in parallel and are allowed to switch states. The final MCMC output is suitably post-processed in order to undo label switching using the Equivalence Classes Representatives (ECR) algorithm, as described in Papastamoulis (2016) <DOI:10.18637/jss.v069.c01>.
License:	GPL-2
Imports:	Rcpp (>= 1.0.8.3), MASS, doParallel, foreach, label.switching, ggplot2, coda, matrixStats, mvtnorm, RColorBrewer
LinkingTo:	Rcpp, RcppArmadillo
NeedsCompilation:	yes
Packaged:	2023-07-14 07:55:32 UTC; panos
Author:	Panagiotis Papastamoulis [aut, cre] (<https://orcid.org/0000-0001-9468-7613>)
Date/Publication:	2023-07-17 05:00:02 UTC
Repository:	https://mqbssppe.r-universe.dev
RemoteUrl:	https://github.com/cran/multinomialLogitMix
RemoteRef:	HEAD
RemoteSha:	91a3e35ac2777aebcf9ff1c7fb83ef09728a901c

Index of help topics:

dealWithLabelSwitching
                        Post-process the generated MCMC sample in order
                        to undo possible label switching.
expected_complete_LL    Expected complete LL
gibbs_mala_sampler      The core of the Hybrid Gibbs/MALA MCMC sampler
                        for the multinomial logit mixture.
gibbs_mala_sampler_ppt
                        Prior parallel tempering scheme of hybrid
                        Gibbs/MALA MCMC samplers for the multinomial
                        logit mixture.
log_dirichlet_pdf       Log-density function of the Dirichlet
                        distribution
mala_proposal           Proposal mechanism of the MALA step.
mixLoglikelihood_GLM    Log-likelihood of the multinomial logit.
mix_mnm_logistic        EM algorithm
multinomialLogitMix     Main function
multinomialLogitMix-package
                        Clustering Multinomial Count Data under the
                        Presence of Covariates
multinomial_logistic_EM
                        Part of the EM algorithm for multinomial logit
                        mixture
myDirichlet             Simulate from the Dirichlet distribution
newton_raphson_mstep    M-step of the EM algorithm
shakeEM_GLM             Shake-small EM
simulate_multinomial_data
                        Synthetic data generator
splitEM_GLM             Split-small EM scheme.

See the main function of the package: multinomialLogitMix, which wraps automatically calls to the MCMC sampler gibbs_mala_sampler_ppt and the EM algorithm mix_mnm_logistic.

Author(s)

Panagiotis Papastamoulis [aut, cre] (<https://orcid.org/0000-0001-9468-7613>)

Maintainer: Panagiotis Papastamoulis <[email protected]>

References

Papastamoulis, P. Model based clustering of multinomial count data. Advances in Data Analysis and Classification (2023). https://doi.org/10.1007/s11634-023-00547-5

Papastamoulis, P. and Iliopoulos, G. (2010). An Artificial Allocations Based Solution to the Label Switching Problem in Bayesian Analysis of Mixtures of Distributions. Journal of Computational and Graphical Statistics, 19(2), 313-331. http://www.jstor.org/stable/25703571

Papastamoulis, P. (2016). label.switching: An R Package for Dealing with the Label Switching Problem in MCMC Outputs. Journal of Statistical Software, Code Snippets, 69(1), 1-24. https://doi.org/10.18637/jss.v069.c01

Rousseau, J. and Mengersen, K. (2011), Asymptotic behaviour of the posterior distribution in overfitted mixture models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73: 689-710. https://doi.org/10.1111/j.1467-9868.2011.00781.x

Post-process the generated MCMC sample in order to undo possible label switching.

Description

This function implements the Equivalence Classes Representatives (ECR) algorithm from the label.switching package in order to undo the label switching phenomenon.

Usage

dealWithLabelSwitching(gs, burn, thin = 10, zPivot = NULL, returnRaw = FALSE, maxM = NULL)
dealWithLabelSwitching(gs, burn, thin = 10, zPivot = NULL, returnRaw = FALSE, maxM = NULL)

Arguments

`gs`	An object generated by the main function of the package.
`burn`	Number of draws that will be discarder as burn-in.
`thin`	Thinning of the MCMC sample.
`zPivot`	Optional vector of allocations that will be used as the pivot of the ECR algorithm. If this is not supplied, the pivot will be selected as the allocation vector that corresponds to the iteration that maximized the log-likelihood of the model.
`returnRaw`	Boolean. If true, the function will also return the raw output.
`maxM`	Not used.

Details

See Papastamoulis (2016).

Value

`cluster`	Single best clustering of the data, according to the Maximum A Posteriori rule.
`nClusters_posterior`	Estimated posterior distribution of the number of clusters.
`mcmc`	Post-processed mcmc output.
`posteriorProbabilities`	Estimated posterior membership probabilities.

Author(s)

Panagiotis Papastamoulis

References

Papastamoulis, P. (2016). label.switching: An R Package for Dealing with the Label Switching Problem in MCMC Outputs. Journal of Statistical Software, 69(1), 1-24.

Expected complete LL

Description

This function is not used at the moment.

Usage

expected_complete_LL(y, X, b, w, pr)
expected_complete_LL(y, X, b, w, pr)

Arguments

`y`	count data.
`X`	design matrix.
`b`	Logit coefficients.
`w`	mixing proportions.
`pr`	mixing proportions.

Value

Complete log-likelihood of the model.

Author(s)

Panagiotis Papastamoulis

The core of the Hybrid Gibbs/MALA MCMC sampler for the multinomial logit mixture.

Description

This function implements Gibbs sampling to update the mixing proportions and latent allocations variables of the mixture model. The coefficients of the logit model are updated according to Metropolis-Hastings type move, based on a Metropolis adjusted Langevin (MALA) proposal.

Usage

gibbs_mala_sampler(y, X, tau = 3e-05, nu2, K, mcmc_iter = 100, 
	alpha_prior = NULL, start_values = "EM", em_iter = 10, 
	thin = 10, verbose = FALSE, checkAR = NULL, 
	probsSave = FALSE, ar_low = 0.4, ar_up = 0.6)
gibbs_mala_sampler(y, X, tau = 3e-05, nu2, K, mcmc_iter = 100, 
	alpha_prior = NULL, start_values = "EM", em_iter = 10, 
	thin = 10, verbose = FALSE, checkAR = NULL, 
	probsSave = FALSE, ar_low = 0.4, ar_up = 0.6)

Arguments

`y`	matrix of counts.
`X`	design matrix (including constant term).
`tau`	the variance of the normal prior distribution of the logit coefficients.
`nu2`	scale of the MALA proposal (positive).
`K`	number of components of the (overfitting) mixture model.
`mcmc_iter`	Number of MCMC iterations.
`alpha_prior`	Parameter of the Dirichlet prior distribution for the mixing proportions.
`start_values`	Optional list of starting values. Random initialization is used if this is not provided.
`em_iter`	Maximum number of iterations if an EM initialization is enabled.
`thin`	optional thinning of the generated MCMC output.
`verbose`	Boolean.
`checkAR`	Number of iterations to adjust the scale of the proposal in MALA mechanism during the initial warm-up phase of the sampler.
`probsSave`	Optional.
`ar_low`	Lowest threshold for the acceptance rate of the MALA proposal (optional) .
`ar_up`	Highest threshold for the acceptance rate of the MALA proposal (optional).

Value

`nClusters`	sampled values of the number of clusters (non-empty mixture components).
`allocations`	sampled values of the latent allocation variables.
`logLikelihood`	Log-likelihood values per MCMC iteration.
`mixing_proportions`	sampled values of mixing proportions.
`coefficients`	sapled values of the coefficients of the multinomial logit.
`complete_logLikelihood`	Complete log-likelihood values per MCMC iteration.
`class_probs`	Classification probabilities per iteration (optional).
`AR`	Acceptance rate of the MALA proposal.

Note

This function is used inside the prior tempering scheme, which is the main function.

Author(s)

Panagiotis Papastamoulis

Examples

#	Generate synthetic data
	K <- 2
	p <- 2
	D <- 2
	n <- 2
	set.seed(116)
	simData <- simulate_multinomial_data(K = K, p = p, D = D, n = n, size = 20, prob = 0.025)   


	gs <- gibbs_mala_sampler(y = simData$count_data, X = simData$design_matrix, 
		tau = 0.00035, nu2 = 100, K = 2, mcmc_iter = 3, 
		alpha_prior = rep(1,K), start_values = "RANDOM", 
		thin = 1, verbose = FALSE, checkAR = 100)

#	Generate synthetic data
	K <- 2
	p <- 2
	D <- 2
	n <- 2
	set.seed(116)
	simData <- simulate_multinomial_data(K = K, p = p, D = D, n = n, size = 20, prob = 0.025)   


	gs <- gibbs_mala_sampler(y = simData$count_data, X = simData$design_matrix, 
		tau = 0.00035, nu2 = 100, K = 2, mcmc_iter = 3, 
		alpha_prior = rep(1,K), start_values = "RANDOM", 
		thin = 1, verbose = FALSE, checkAR = 100)

Prior parallel tempering scheme of hybrid Gibbs/MALA MCMC samplers for the multinomial logit mixture.

Description

The main MCMC scheme of the package. Multiple chains are run in parallel and swaps between are proposed. Each chain uses different parameters on the Dirichlet prior of the mixing proportion. The smaller concentration parameter should correspond to the first chain, which is the one that used for inference. Subsequent chains should have larger values of concentration parameter for the Dirichlet prior.

Usage

gibbs_mala_sampler_ppt(y, X, tau = 3e-05, nu2, K, 
	mcmc_cycles = 100, iter_per_cycle = 10, dirPriorAlphas, 
	start_values = "EM", em_iter = 10, nChains = 4, nCores = 4, 
	warm_up = 100, checkAR = 50, probsSave = FALSE, 
	showGraph = 50, ar_low = 0.4, ar_up = 0.6, withRandom = TRUE)
gibbs_mala_sampler_ppt(y, X, tau = 3e-05, nu2, K, 
	mcmc_cycles = 100, iter_per_cycle = 10, dirPriorAlphas, 
	start_values = "EM", em_iter = 10, nChains = 4, nCores = 4, 
	warm_up = 100, checkAR = 50, probsSave = FALSE, 
	showGraph = 50, ar_low = 0.4, ar_up = 0.6, withRandom = TRUE)

Arguments

`y`	matrix of counts.
`X`	design matrix (including constant term).
`tau`	the variance of the normal prior distribution of the logit coefficients.
`nu2`	scale of the MALA proposal (positive).
`K`	number of components of the (overfitting) mixture model.
`mcmc_cycles`	Number of MCMC cycles. At the end of each cycle, a swap between chains is attempted.
`iter_per_cycle`	Number of iterations per cycle.
`dirPriorAlphas`	Vector of concentration parameters for the Dirichlet priors in increasing order.
`start_values`	Optional list of start values. Randomly generated values are used if this is not provided.
`em_iter`	Maximum number of iterations if an EM initialization is enabled.
`nChains`	Total number of parallel chains.
`nCores`	Total number of CPU cores for parallel processing of the `nChains`.
`warm_up`	Initial warm-up period of the sampler, in order to adaptively tune the scale of the MALA proposal (optional).
`checkAR`	Number of iterations to adjust the scale of the proposal in MALA mechanism during the initial warm-up phase of the sampler.
`probsSave`	Optional.
`showGraph`	Optional.
`ar_low`	Lowest threshold for the acceptance rate of the MALA proposal (optional) .
`ar_up`	Highest threshold for the acceptance rate of the MALA proposal (optional).
`withRandom`	Logical. If TRUE (default) then a random permutation is applied to the supplied starting values.

Details

See the paper for details.

Value

`nClusters`	sampled values of the number of clusters (non-empty mixture components).
`allocations`	sampled values of the latent allocation variables.
`logLikelihood`	Log-likelihood values per MCMC iteration.
`mixing_proportions`	sampled values of mixing proportions.
`coefficients`	sapled values of the coefficients of the multinomial logit.
`complete_logLikelihood`	Complete log-likelihood values per MCMC iteration.
`class_probs`	Classification probabilities per iteration (optional).
`AR`	Acceptance rate of the MALA proposal.

Note

The output of the MCMC sampler is not identifiable, due to possible label switching. In order to draw meaningful inferences, the output should be post-processed by dealWithLabelSwitching.

Author(s)

Panagiotis Papastamoulis

References

Papastamoulis, P (2022). Model-based clustering of multinomial count data.

Examples

#	Generate synthetic data

	K <- 2
	p <- 2
	D <- 3
	n <- 2
	set.seed(116)
	simData <- simulate_multinomial_data(K = K, p = p, D = D, n = n, size = 20, prob = 0.025)   



# apply mcmc sampler based on random starting values 

Kmax = 2
nChains = 2
dirPriorAlphas  = c(1, 1 + 5*exp((seq(2, 14, length = nChains - 1)))/100)/(200)
nCores <- 2
mcmc_cycles <- 2
iter_per_cycle = 2
warm_up <- 2

mcmc_random1 <-  gibbs_mala_sampler_ppt( y = simData$count_data, X = simData$design_matrix, 
		tau = 0.00035, nu2 = 100,  K = Kmax, dirPriorAlphas = dirPriorAlphas,
		mcmc_cycles = mcmc_cycles, iter_per_cycle = iter_per_cycle, 
		start_values = 'RANDOM', 
		nChains = nChains, nCores = nCores, warm_up = warm_up, showGraph = 1000, 
		checkAR = 1000)

#sampled values for the number of clusters (non-empty mixture components) per chain (columns)
mcmc_random1$nClusters
#	Generate synthetic data

	K <- 2
	p <- 2
	D <- 3
	n <- 2
	set.seed(116)
	simData <- simulate_multinomial_data(K = K, p = p, D = D, n = n, size = 20, prob = 0.025)   



# apply mcmc sampler based on random starting values 

Kmax = 2
nChains = 2
dirPriorAlphas  = c(1, 1 + 5*exp((seq(2, 14, length = nChains - 1)))/100)/(200)
nCores <- 2
mcmc_cycles <- 2
iter_per_cycle = 2
warm_up <- 2

mcmc_random1 <-  gibbs_mala_sampler_ppt( y = simData$count_data, X = simData$design_matrix, 
		tau = 0.00035, nu2 = 100,  K = Kmax, dirPriorAlphas = dirPriorAlphas,
		mcmc_cycles = mcmc_cycles, iter_per_cycle = iter_per_cycle, 
		start_values = 'RANDOM', 
		nChains = nChains, nCores = nCores, warm_up = warm_up, showGraph = 1000, 
		checkAR = 1000)

#sampled values for the number of clusters (non-empty mixture components) per chain (columns)
mcmc_random1$nClusters

Log-density function of the Dirichlet distribution

Description

Log-density function of the Dirichlet distribution

Usage

log_dirichlet_pdf(alpha, weights)
log_dirichlet_pdf(alpha, weights)

Arguments

`alpha`	Parameter vector
`weights`	Vector of weights.

Value

Log-density of the $D(\alpha_1,\ldots,\alpha_k)$ evaluated at $w_1,\ldots,w_k$ .

Author(s)

Panagiotis Papastamoulis

Proposal mechanism of the MALA step.

Description

Only the mala_proposal_cpp function is used in the package - which is written as an RCPP function.

Usage

mala_proposal(y, X, b, z, tau, A = FALSE, pr, nu2)
mala_proposal(y, X, b, z, tau, A = FALSE, pr, nu2)

Arguments

`y`	count data
`X`	design matrix
`b`	coefficients (array
`z`	allocation vector
`tau`	prior variance
`A`	A
`pr`	mixing proportions
`nu2`	parameter nu2

Value

`theta`	theta values
`b`	coeeficients
`acceptance`	log-likelihood.
`gradient`	log-likelihood.

Author(s)

Panagiotis Papastamoulis

EM algorithm

Description

Estimation of the multinomial logit mixture using the EM algorithm. The algorithm exploits a careful initialization procedure (Papastamoulis et al., 2016) combined with a ridge-stabilized implementation of the Newton-Raphson method (Goldfeld et al., 1966) in the M-step.

Usage

mix_mnm_logistic(y, X, Kmax = 10, maxIter = 100, emthreshold = 1e-08, 
	maxNR = 5, nCores, tsplit = 8, msplit = 5, split = TRUE, 
	shake = TRUE, random = TRUE, criterion = "ICL", 
	plotting = FALSE, R0 = 0.1, method = 5)
mix_mnm_logistic(y, X, Kmax = 10, maxIter = 100, emthreshold = 1e-08, 
	maxNR = 5, nCores, tsplit = 8, msplit = 5, split = TRUE, 
	shake = TRUE, random = TRUE, criterion = "ICL", 
	plotting = FALSE, R0 = 0.1, method = 5)

Arguments

`y`	matrix of counts
`X`	design matrix (including constant term).
`Kmax`	Maximum number of mixture components.
`maxIter`	Maximum number of iterations.
`emthreshold`	Minimum loglikelihood difference between successive iterations in order to terminate.
`maxNR`	maximum number of Newton Raphson iterations
`nCores`	number of cores for parallel computations.
`tsplit`	positive integer denoting the number of different runs for each call of the splitting small EM used by split-small EM initialization procedure.
`msplit`	positive integer denoting the number of different runs for each call of the splitting small EM.
`split`	Boolean indicating if the split initialization should be enabled in the small-EM scheme.
`shake`	Boolean indicating if the shake initialization should be enabled in the small-EM scheme.
`random`	Boolean indicating if random initializations should be enabled in the small-EM scheme.
`criterion`	set to "ICL" to select the number of clusters according to the ICL criterion.
`plotting`	Boolean for displaying intermediate graphical output.
`R0`	controls the step size of the update: smaller values result to larger step sizes. See description in paper.
`method`	this should be set to 5.

Value

`estimated_K`	selected value of the number of clusters.
`all_runs`	detailed output per run.
`BIC_values`	values of bayesian information criterion.
`ICL_BIC_values`	values of ICL-BIC.
`estimated_clustering`	Single best-clustering of the data, according to the MAP rule.

Author(s)

Panagiotis Papastamoulis

References

Papastamoulis P (2022). Model-based clustering of multinomial count data. arXiv:2207.13984 [stat.ME]

Examples

#	Generate synthetic data

	K <- 2
	p <- 2
	D <- 3
	n <- 2
	set.seed(116)
	simData <- simulate_multinomial_data(K = K, p = p, D = D, n = n, size = 20, prob = 0.025)   

	
	SplitShakeSmallEM <- mix_mnm_logistic(y = simData$count_data, 
		X = simData$design_matrix, Kmax = 2, maxIter = 1, 
		emthreshold = 1e-8, maxNR = 1, nCores = 2, tsplit = 1, 
		msplit = 2, split = TRUE, R0 = 0.1, method = 5, 
		plotting = FALSE)
	#selected number of clusters
	SplitShakeSmallEM$estimated_K
	#estimated single best-clustering, according to MAP rule
	SplitShakeSmallEM$estimated_clustering
	# detailed output for all parameters of the selected number of clusters
	SplitShakeSmallEM$all_runs[[SplitShakeSmallEM$estimated_K]]
	
#	Generate synthetic data

	K <- 2
	p <- 2
	D <- 3
	n <- 2
	set.seed(116)
	simData <- simulate_multinomial_data(K = K, p = p, D = D, n = n, size = 20, prob = 0.025)   

	
	SplitShakeSmallEM <- mix_mnm_logistic(y = simData$count_data, 
		X = simData$design_matrix, Kmax = 2, maxIter = 1, 
		emthreshold = 1e-8, maxNR = 1, nCores = 2, tsplit = 1, 
		msplit = 2, split = TRUE, R0 = 0.1, method = 5, 
		plotting = FALSE)
	#selected number of clusters
	SplitShakeSmallEM$estimated_K
	#estimated single best-clustering, according to MAP rule
	SplitShakeSmallEM$estimated_clustering
	# detailed output for all parameters of the selected number of clusters
	SplitShakeSmallEM$all_runs[[SplitShakeSmallEM$estimated_K]]

Log-likelihood of the multinomial logit.

Description

Log-likelihood of the multinomial logit.

Usage

mixLoglikelihood_GLM(y, theta, pi)
mixLoglikelihood_GLM(y, theta, pi)

Arguments

`y`	matrix of counts
`theta`	a three-dimensional array containing the multinomial probabilities per cluster, for each observation.
`pi`	a numeric vector of length K (the number of mixture components) containing the mixing proportions.

Value

Log-likelihood value.

Author(s)

Panagiotis Papastamoulis

Part of the EM algorithm for multinomial logit mixture

Description

Part of the EM algorithm for multinomial logit mixture

Usage

multinomial_logistic_EM(y, x, K, w_start, b_start, 
	maxIter = 1000, emthreshold = 1e-08, maxNR = 5, 
	nCores = NULL, verbose = FALSE, R0, method)
multinomial_logistic_EM(y, x, K, w_start, b_start, 
	maxIter = 1000, emthreshold = 1e-08, maxNR = 5, 
	nCores = NULL, verbose = FALSE, R0, method)

Arguments

`y`	y
`x`	X
`K`	K
`w_start`	w
`b_start`	b
`maxIter`	max
`emthreshold`	em
`maxNR`	maxnr
`nCores`	nc
`verbose`	verb
`R0`	or
`method`	method

Value

value

Author(s)

Panagiotis Papastamoulis

Main function

Description

The main function of the package.

Usage

multinomialLogitMix(response, design_matrix, method, 
	Kmax = 10, mcmc_parameters = NULL, em_parameters = NULL, 
	nCores, splitSmallEM = TRUE)
multinomialLogitMix(response, design_matrix, method, 
	Kmax = 10, mcmc_parameters = NULL, em_parameters = NULL, 
	nCores, splitSmallEM = TRUE)

Arguments

`response`	matrix of counts.
`design_matrix`	design matrix (including constant term).
`method`	character with two possible values: "EM" or "MCMC" indicating the desired method in order to estimate the model.
`Kmax`	number of components of the (overfitting) mixture model.
`nCores`	Total number of CPU cores for parallel processing.
`mcmc_parameters`	List with the parameter set-up of the MCMC sampler. See details for changing the defaults.
`em_parameters`	List with the parameter set-up of the EM algorithm. See details for changing the defaults.
`splitSmallEM`	Boolean value, indicating whether the split-small EM scheme should be used to initialize the `method`. Default: true (suggested).

Details

The details of the parameter setup of the EM algorithm and MCMC sampler. The following specification correspond to the minimal default settings. Larger values of tsplit will result to better performance.

em_parameters <- list(maxIter = 100, emthreshold = 1e-08, maxNR = 10, tsplit = 16, msplit = 10, split = TRUE, R0 = 0.1, plotting = TRUE)

mcmc_parameters <- list(tau = 0.00035, nu2 = 100, mcmc_cycles = 2600, iter_per_cycle = 20, nChains = 8, dirPriorAlphas = c(1, 1 + 5 * exp((seq(2, 14, length = nChains - 1)))/100)/(200), warm_up = 48000, checkAR = 500, probsSave = FALSE, showGraph = 100, ar_low = 0.15, ar_up = 0.25, burn = 100, thin = 1, withRandom = TRUE)

Value

`EM`	List with the results of the EM algorithm.
`MCMC_raw`	List with the raw output of the MCMC sampler - not identifiable MCMC output.
`MCMC_post_processed`	Post-processed MCMC, used for the inference.

Author(s)

Panagiotis Papastamoulis

References

Papastamoulis, P. Model based clustering of multinomial count data. Advances in Data Analysis and Classification (2023). https://doi.org/10.1007/s11634-023-00547-5

Examples

#	Generate synthetic data

	K <- 2	#number of clusters
	p <- 2	#number of covariates (constant incl)
	D <- 5	#number of categories
	n <- 20 #generated number of observations
	set.seed(1)
	simData <- simulate_multinomial_data(K = K, p = p, D = D, n = n, size = 20, prob = 0.025)   


	# EM parameters
em_parameters <- list(maxIter = 100, emthreshold = 1e-08, 
    maxNR = 10, tsplit = 16, msplit = 10, split = TRUE, 
    R0 = 0.1, plotting = TRUE)

	#  MCMC parameters - just for illustration
	#	typically, set `mcmc_cycles` and `warm_up`to a larger values
	#	such as` mcmc_cycles = 2500` or more 
	#	and `warm_up = 40000` or more.
	nChains <- 2 #(set this to a larger value, such as 8 or more)
	mcmc_parameters <- list(tau = 0.00035, nu2 = 100, mcmc_cycles = 260, 
	    iter_per_cycle = 20, nChains = nChains, dirPriorAlphas = c(1, 
		1 + 5 * exp((seq(2, 14, length = nChains - 1)))/100)/(200), 
	    warm_up = 4800, checkAR = 500, probsSave = FALSE, 
	    showGraph = 100, ar_low = 0.15, ar_up = 0.25, burn = 100, 
	    thin = 1, withRandom = TRUE)

	# run EM with split-small-EM initialization, and then use the output to 
	#	initialize MCMC algorithm for an overfitting mixture with 
	#	Kmax = 5 components (max number of clusters - usually this is 
	#	set to a larger value, e.g. 10 or 20).
	#	Note: 
	#		1. the MCMC output is based on the non-empty components
	#		2. the EM algorithm clustering corresponds to the selected 
	#			number of clusters according to ICL.
	#		3. `nCores` should by adjusted according to your available cores.
	
	mlm <- multinomialLogitMix(response = simData$count_data, 
		design_matrix = simData$design_matrix, method = "MCMC", 
             Kmax = 5, nCores = 2, splitSmallEM = TRUE, 
             mcmc_parameters = mcmc_parameters, em_parameters = em_parameters)
	# retrieve clustering according to EM
	mlm$EM$estimated_clustering
	# retrieve clustering according to MCMC
	mlm$MCMC_post_processed$cluster	
	
#	Generate synthetic data

	K <- 2	#number of clusters
	p <- 2	#number of covariates (constant incl)
	D <- 5	#number of categories
	n <- 20 #generated number of observations
	set.seed(1)
	simData <- simulate_multinomial_data(K = K, p = p, D = D, n = n, size = 20, prob = 0.025)   


	# EM parameters
em_parameters <- list(maxIter = 100, emthreshold = 1e-08, 
    maxNR = 10, tsplit = 16, msplit = 10, split = TRUE, 
    R0 = 0.1, plotting = TRUE)

	#  MCMC parameters - just for illustration
	#	typically, set `mcmc_cycles` and `warm_up`to a larger values
	#	such as` mcmc_cycles = 2500` or more 
	#	and `warm_up = 40000` or more.
	nChains <- 2 #(set this to a larger value, such as 8 or more)
	mcmc_parameters <- list(tau = 0.00035, nu2 = 100, mcmc_cycles = 260, 
	    iter_per_cycle = 20, nChains = nChains, dirPriorAlphas = c(1, 
		1 + 5 * exp((seq(2, 14, length = nChains - 1)))/100)/(200), 
	    warm_up = 4800, checkAR = 500, probsSave = FALSE, 
	    showGraph = 100, ar_low = 0.15, ar_up = 0.25, burn = 100, 
	    thin = 1, withRandom = TRUE)

	# run EM with split-small-EM initialization, and then use the output to 
	#	initialize MCMC algorithm for an overfitting mixture with 
	#	Kmax = 5 components (max number of clusters - usually this is 
	#	set to a larger value, e.g. 10 or 20).
	#	Note: 
	#		1. the MCMC output is based on the non-empty components
	#		2. the EM algorithm clustering corresponds to the selected 
	#			number of clusters according to ICL.
	#		3. `nCores` should by adjusted according to your available cores.
	
	mlm <- multinomialLogitMix(response = simData$count_data, 
		design_matrix = simData$design_matrix, method = "MCMC", 
             Kmax = 5, nCores = 2, splitSmallEM = TRUE, 
             mcmc_parameters = mcmc_parameters, em_parameters = em_parameters)
	# retrieve clustering according to EM
	mlm$EM$estimated_clustering
	# retrieve clustering according to MCMC
	mlm$MCMC_post_processed$cluster

Simulate from the Dirichlet distribution

Description

Generate a random draw from the Dirichlet distribution $D(\alpha_1,\ldots,\alpha_k)$ .

Usage

myDirichlet(alpha)
myDirichlet(alpha)

Arguments

alpha

Parameter vector

Value

Simulated vector

Author(s)

Panagiotis Papastamoulis

M-step of the EM algorithm

Description

Implements the maximization step of the EM algorithm based on a ridge-stabilized version of the Newton-Raphson algorithm, see Goldfeld et al. (1966).

Usage

newton_raphson_mstep(y, X, b, w, maxNR = 5, R0 = 0.1, method = 5, verbose = FALSE)
newton_raphson_mstep(y, X, b, w, maxNR = 5, R0 = 0.1, method = 5, verbose = FALSE)

Arguments

`y`	count data matrix
`X`	design matrix (including const).
`b`	coefficients of the multinomial logit mixture
`w`	mixing proportions
`maxNR`	threshold
`R0`	inital value for the parameter that controls the step-size of the update.
`method`	set to 5. Always.
`verbose`	Boolean.

Value

`b`	coefficients
`theta`	theta values
`ll`	log-likelihood.

Author(s)

Panagiotis Papastamoulis

References

Goldfeld, S. M., Quandt, R. E., and Trotter, H. F. (1966). Maximization by quadratic hill-climbing. Econometrica: Journal of the Econometric Society, 541-551.

Shake-small EM

Description

Assume that there are at least two clusters in the fitted model. We randomly select 2 of them and propose to randomly re-allocate the assigned observations within those 2 clusters.

Usage

shakeEM_GLM(y, x, K, equalModel, tsplit = 10, maxIter = 20, 
	emthreshold = 1e-08, maxNR = 5, nCores, 
	split = TRUE, R0, method)
shakeEM_GLM(y, x, K, equalModel, tsplit = 10, maxIter = 20, 
	emthreshold = 1e-08, maxNR = 5, nCores, 
	split = TRUE, R0, method)

Arguments

`y`	y
`x`	X
`K`	K
`equalModel`	eq
`tsplit`	tsplit
`maxIter`	maxiter
`emthreshold`	em
`maxNR`	max
`nCores`	nc
`split`	spl
`R0`	ro
`method`	met

Value

valu

Author(s)

Panagiotis Papastamoulis

Synthetic data generator

Description

This function simulates data from mixture of multinomial logistic regression models.

Usage

simulate_multinomial_data(K, p, D, n, size = 20, prob = 0.025, betaTrue = NULL)
simulate_multinomial_data(K, p, D, n, size = 20, prob = 0.025, betaTrue = NULL)

Arguments

`K`	Number of clusters.
`p`	Number of covariates, including constant.
`D`	Number of multinomial categories.
`n`	Number of data points to simulate.
`size`	Negative Binomial parameter (number of successes). Default: 20.
`prob`	Negative Binomial parameter (probability of success). Default: 0.025.
`betaTrue`	An array which contains the true values of the logit coefficients per cluster. Default: randomly generated values.

Value

`count_data`	matrix of data counts.
`design_matrix`	design matrix.
`clustering`	Ground-truth partition of the data.

Author(s)

Panagiotis Papastamoulis

Split-small EM scheme.

Description

Split two randomly selected clusters based on a model with one component smaller than the current one. This procedure is repeated within a small-EM scheme. The best split is chose to initialize the model.

Usage

splitEM_GLM(y, x, K, smallerModel, tsplit = 10, maxIter = 20, 
	emthreshold = 1e-08, maxNR = 5, nCores, 
	split = TRUE, R0, method)
splitEM_GLM(y, x, K, smallerModel, tsplit = 10, maxIter = 20, 
	emthreshold = 1e-08, maxNR = 5, nCores, 
	split = TRUE, R0, method)

Arguments

`y`	y
`x`	x
`K`	k
`smallerModel`	smla
`tsplit`	tsp
`maxIter`	max
`emthreshold`	thr
`maxNR`	maxn
`nCores`	nc
`split`	spi
`R0`	ro
`method`	meth

Value

val

Author(s)

Panagiotis Papastamoulis

References

Papastamoulis, P., Martin-Magniette, M. L., and Maugis-Rabusseau, C. (2016). On the estimation of mixtures of Poisson regression models with large number of components. Computational Statistics & Data Analysis, 93, 97-106.

Package 'multinomialLogitMix'

Help Index

Clustering Multinomial Count Data under the Presence of Covariates

Description

Details

Author(s)

References

See Also

Post-process the generated MCMC sample in order to undo possible label switching.

Description

Usage

Arguments

Details

Value

Author(s)

References

Expected complete LL

Description

Usage

Arguments

Value

Author(s)

The core of the Hybrid Gibbs/MALA MCMC sampler for the multinomial logit mixture.

Description

Usage

Arguments

Value

Note

Author(s)

See Also

Examples

Prior parallel tempering scheme of hybrid Gibbs/MALA MCMC samplers for the multinomial logit mixture.

Description

Usage

Arguments

Details

Value

Note

Author(s)

References

Examples

Log-density function of the Dirichlet distribution

Description

Usage

Arguments

Value

Author(s)

Proposal mechanism of the MALA step.

Description

Usage

Arguments

Value

Author(s)

EM algorithm

Description

Usage

Arguments

Value

Author(s)

References

Examples

Log-likelihood of the multinomial logit.

Description

Usage

Arguments

Value

Author(s)

Part of the EM algorithm for multinomial logit mixture

Description

Usage

Arguments

Value

Author(s)

Main function

Description

Usage

Arguments

Details

Value

Author(s)