英语论文网

based on new cluster memberships.

Iterate through steps 2 and 3 until convergence.

The algorithm converges when cluster membership of data points remain unchanged. In this situation, other widely-used conditions such as centroid computed and sum of squared distances from data points to their centroids stay constant.

K Means clustering iteratively shifts objects to various clusters, seeking to minimize the sum of squared distances, denoted by J, between each object and its cluster centroid. The sum of squared distances, Ji, for the ith cluster, denoted as Ci is given by,

where:s the Squared Euclidean Distance from object x in Ci to its centroid yi

The sum of squared distances of all the k clusters is given by,

In step 1, different sets of initial centroids can ultimately result in different local minima of J. However, we would like to find the clusters solutions that can result in the global minimum. The best methodology involves utilizing all sets of initial centroids in the analysis, but it is expensive and thus not viable. As an alternative, we repeat K Means clustering (for k = 2 to 35) 500 times with 500 random sets of initial centroids, to find clusters solutions that are either global minima or at least local minima that is the closest to the global minima among the various local minima.

Expectation Maximization for Gaussian Mixture Model

In the Gaussian Mixture Model (GMM), Expectation maximization (EM) algorithm seeks to find the maximum likelihood estimates for mixture models when the model is dependent on unknown latent variables.

The main steps of the EM method are (Dempster et al., 1977):

Compute the parameters (mean and variance) for the k Gaussian distributions.

Using the probability density function of Gaussian distribution, calculate the probability density for each feature vector in each of the k clusters

With the probability densities calculated in step 2, re-compute the parameters for each of the k Gaussian distributions

Repeat Steps 2 and 3 until convergence.

We perform EM clustering using MIXMOD for Matlab (Biernacki et al., 2006) and the statistical documentation are as follows. Clustering using mixture models typically partitions x objects into K clusters denoted by labels , with and depending on whether xi is assigned to kth cluster or not. In a mixture model where n independent vectors of a dataset are represented by x = {x1,…,xn}, each xi arises from a probability distribution with density,

where: pk is the mixing proportions (0 < pk < 1 for all k = 1, …,K and p1 +…+pK = 1)

h(.|λk) is the d-dimensional distribution parameterized by λk.

As such, we can show how each xi arises from a probability distribution with density in a GMM by replacing λk with its associated d-dimensional Gaussian density with mean μk and variance matrix Σk ,

where:

= (p1…,pK, μ1,…,μK, ∑1,…, ∑K) is the vector of the mixture parameters

Clusters can be derived from the maximum likelihood estimates of the mixture parameters obtained by using the Expectation Maximization (EM) algorithm. The maximum likelihood estimation of the GMM is given by,

Each xiis assigned to the cluster that provides the largest conditional probability th