英语论文网

computes similarity of two clusters as the similarity of their most similar members. Complete linkage clustering measures the similarity of two clusters as the similarity of their most dissimilar members. In the analysis, we choose Ward's method (Ward, 1963), a method that is distinct from the aforementioned methods, as our linkage function. Ward's method (Ward, 1963) chooses each successive merging step through the criterion of minimizing the increase in the error sum of squares (ESS) at each step. The ESS of a set X of NX values is given by the functional relation,

where: |.| is the absolute value of a scalar value or the norm (the 'length') of a vector.

The linkage function referring to the distance between clusters X and Y is given by,

where: XY is the combined cluster resulting from merged clusters X and Y;

ESS(.) is the error sum of squares described above.

In addition to a linkage function, a metric for measuring distance between two objects is required. In our study, the Squared Euclidean Distance (SED) is chosen as our metric for distance measure for both Hierarchical Clustering and K Means clustering. If two objects, x1 and x2 in the Euclidean n-space is given by x1 = (x1i, x1i, …, x1n) and x2 = (x2i, x2i, …, x2n), then the SED between these two objects is given by,

While Agglomerative Hierarchical Clustering (AHC) does not require user to specify the number of clusters, k, a priori, a drawback of AHC is that it neglects the phenomenon of input order instability. In steps 2 and 3 of an AHC, a problem arises when two pairs of clusters are both calculated to have the smallest distance value. “In such cases arbitrary [italics added] decisions must be made” (Sneath & Sokal, 1973) to choose the pair of clusters that will be merged. These arbitrary decisions extend to computer programs (Spaans & Van der Kloot, 2005) and as a result, different input orders of objects in the proximity matrix can result in significantly different clusters solutions (Van der Kloot et al., 2005). To avoid this pitfall, we employ PermuCLUSTER for SPSS (Van der Kloot et al., 2005). This program repeats AHC for a user specified number of times by permuting the rows and columns of the proximity matrix. Thereafter, it evaluates the quality of each AHC solution by using a goodness-of-fit measure (SSDIFN) given by,

where: dij is the distances of the objects in the original proximity matrix

cij is the distances of the objects in the AHC tree

In our analysis, we first employ PermuCLUSTER where the number of AHC repetitions is set at 500, and evaluate the resultant optimal solutions obtained. Thereafter, we validate the clusters solutions (k = 2 to 35) of the optimal solution.

K Means clustering

In K Means clustering, K refers to the number of clusters, though unknown a priori, has to be specified by the user. There is a centroid in each cluster, usually computed as the mean of the variable vectors in that cluster. Clustering is decided based on the association to the nearest centroid.

The basic process of the K Means clustering (MacQueen, 1967) is:

Determine initial centroids.

Find the closest centroid to each object and assign the object to the cluster associated with this centroid.

Recalculate the centroid for each of new clusters