Estimation of the Mixture of Gaussians Using Mata

Dallakyan Aramayis
8 min readDec 23, 2020

--

In the series of articles, I will cover one of the vital topics in Statistical and Machine Learning literature: Mixtures of Gaussians. I will develop the topic in the context of Cluster analysis. For these purposes, three popular techniques: EM Algorithm, Variation Inference, and MCMC for the Mixture of Gaussians will be discussed. The first is frequentist, and the last two are the Bayesian approaches, respectively. Some prior knowledge of optimization and Bayesian analysis is required. For representational purposes, we will concentrate only on the univariate case. The extension to the multivariate case follows after some algebra. The coverage is based on two fundamental books ”Pattern Recognition and Machine Learning” by C. M. Bishop and ” Machine Learning: A Probabilistic Approach” by K. P. Murphy.

I will use Mata and Stataas the main software. I will assume that the reader is familiar with both languages. In particular, familiarity with the classes and user-written functions. If the reader is not a frequent user of these terms, I suggest reading the pithy introduction "The Mata book" by William W. Gould.

Mixture of Gaussians

The Mixture of Gaussians, as the name suggests, is a finite sum of Gaussian densities

where each Gaussian density

is called a component of the mixture and has it's own mean and covariance. The mixing coefficient 𝜋 parameter satisfies

For example, the plot below illustrates the histogram of the three-component mixture of Gaussians. The green line shows the erroneous Gaussian fit with one component which failed to capture the true structure.

The data in the plot above is generated using Mata written function genMixed. Details are given in the next section.

The problem we are going to solve is to estimate mixing coefficients, the mean and covariance of each kth component, with the final goal to find the clusters.

EM for Gaussian Mixtures

Introduction

As we discussed, the Gaussian mixture can be viewed as a linear superposition of Gaussian components, which expends the class of density models compared to the single Gaussian. The key idea is to formulate the Gaussian mixtures in terms of discrete latent variables. The introduction of latent variables follows from the clever idea of data augmentation. The main idea is to introduce a new variable z, and instead of looking at the marginal distribution of x, which is hard to estimate in some cases, deal with the tractable joint distribution. Details are given below.
Recall that

We augment the model with a K- dimensional binary random variable z, in which a particular jth element z_j takes 1 and all other elements are equal to 0. It is easy to see that

and the vector z has K possible states according to which its elements are nonzero.
The fundamental reason for bringing the latent variable in is, as we will see later in a mathematical form, to avoid working with intractable marginal distribution p(x) and instead introduce the joint distribution p(x, z).
We assume that the marginal distribution over z is specified in terms of the mixing

and since each observation is i.i.d the joint distribution takes the form

On the other hand, given the definition of the kth component z_k, as long as it is known that z_k = 1, we have information from which component the distribution of the data comes, i.e. the conditional distribution

and in a vector form

For example, if our data has three cluster K = 3, then z is 3 x 1 vector. Suppose a particular observation belongs to the second cluster, then z_2 = 1 and

A similar expression holds for p(x|z). Now, recalling that the joint distribution p(x,z) = p(z)p(x|z), the marginal distribution of x can be written as

In other words, we arrive at an equivalent formulation of the Gaussian mixture, but this time we explicitly incorporate the latent variable in it.

The innocent reader may claim that we did not gain anything significant from the augmentation and instead made our life harder. However, the old Statistical saying says, if you can choose the distribution, then always choose the joint one. Here, because of the above marginal representation, for a given several observations x_1, .., x_n, we represented the marginal distribution in the form

for which, first we see that for every observed data point x_n, there is a corresponding latent variable z and second, we can work with the joint distribution which leads to significant simplifications in the EM algorithms. Before diving into it, let me introduce one more significant quantity, the conditional probability of z|x, which we denote by 𝛾, it is easy to show that

This quantity is perceived as the responsibility that the component k takes for “explaining” the observation x.

EM Algorithm

The curious reader may object since the marginal distribution is known why we can’t use the maximum likelihood approach to estimate the parameters such as mixing coefficients, mean, and variance? Suppose we have i.i.d observations x_1, …, x_N then the log-likelihood takes the form

A person familiar with the optimization might see the black cloud hanging above the head when one tries to optimize the function above using first principles. The problem is that the sum is inside the log, and the optimization is intractable. That is when the expectation-maximization algorithm, a powerful and elegant method for finding maximum likelihood solutions for latent variable models, enters the arena.

We begin by writing down the first-order conditions that must be satisfied at a maximum of the likelihood function. In other words, we take derivatives with respect to parameters 𝑘, Σ, and 𝜋. After some algebra we obtain

The maximization with respect to the mixing coefficient 𝜋 needs a little bit more work since we need to respect the following conditions:

The latter can be easily incorporated into the objective function using Lagrange multipliers (omitted here), which gives

The critical point to remember is that the above three equations are not in a closed-form and cannot be estimated in parallel. Each of the equations has a common γ term, which depends on each parameter μ, Σ, π in an intertwined way. Therefore, we need to resort to an iterative procedure to estimate corresponding parameters. The EM algorithm for Gaussian mixture iterates the following steps.

One of the drawbacks of the following algorithm is that we have to pre-specify the number of clusters K. Unfortunately, in real-life applications, the latter usually is not available. The second drawback is that the estimated clusters are not identifiable. In the follow-up articles, we will see that Variational Inference and Gibbs sampling-based algorithm do not require pre-specification of the number of clusters but the identifiability issue still holds.

Mata and Stata Implementation
In this section, I am going to use Mata for the above algorithm. As mentioned before, I assume that the reader has a basic knowledge of working with Stata and Mata. Before diving into details, below we introduce the function genMixed, which was used to generate the plot in the first part of the article. genMixed generates the cluster data using mixture of Gaussians. The code lines from 15 - 44 are the main workhorse written in Mata. Lines 1 - 9 are the API for Stata. The function takes four inputs

1. n - number of observations to generate
2. prob - vector of probabilities for Gaussian mixture component
3. means - vector of mean values for components.
4. sds - vector of standard deviation.

and returns data and the corresponding class for each observation.

After solving the data generation issue, we are ready for the EM algorithm. The function EM takes four inputs

1. x - univariate data
2. nclass - number of clusters
3. tolerance - tolerance for the algorithm convergence.
4. niter - maximum number of iterations.

and returns μ, Σ, π, and assigned clusters.

Lines 1–17 introduces the EM function API in Stata. The main updates of the parameters are implemented in lines 74 -84. The rest are utility functions and are self-explanatory.

Measuring Performance

Here, we use the function genMixed to generate univariate data with K=5 classes and n=1000 observations. It is instructive to note that the true classes are available to us. Therefore, after EM function, we use the adjusted Rand Index (ARI) as a measure of performance for the EM algorithm (see Chapter 25 of Murphy (2013) for more details). ARI is a number between 0 and 1. The numbers closer to 1 indicates better performance. For the adjusted Rand Index estimation, we resort to ari function available form the package sadi.

The ARI index for this simulation is equal to 0.66. We fixed the random seed for reproducibility purposes.

What next?
In the follow-up articles, we will estimate the Mixture of Gaussians using Variational inference and Gibbs sampling. There we will not need to restrict ourselves with the knowledge of the number of clusters, and instead, we will resort to Dirichlet distribution to learn it.

Reference

Bishop M. C. (2006), Pattern Recognition and Machine Learning. Springer Science

Gould W. (2018), The Mata Book. Stata Press

Murphy P. K. (2012), Machine Learning: A Probabilistic Approach. The MIT Press

--

--

Dallakyan Aramayis

Currently, I am a P.h.D student in Statistics. My research interests are at the intersection of applied, computational, and theoretical statistics.