英语论文网

a clustering tendency map. In a clustering tendency map, high values (represented by dark-coloured hexagons) of the U-matrix indicate possible clusters borders while uniform areas of low values (represented by light-coloured hexagons) show possible clusters. Figure 4.3 illustrates a high clustering tendency map and a low clustering tendency map.

Internal Validation

We mainly make use of internal validation indices to evaluate the fitness of a clusters solution. Fitness measures are associated with the geometrical properties of clusters (i.e. compactness, separation and connectedness). These properties are utilized as most clustering methods usually optimize these properties to discover underlying group structure in the data (Johnson, 1967; Dempster et al., 1977; Kaufman and Rousseeuw, 1990; Handl and Knowles, 2006). Utilization of internal validation indices also allows us to find the optimal number of clusters (k), indicated by the clusters solution with the highest quality. For Hierarchical Clustering and K Means clustering, employing the program CVAP (Wang et al., 2009), we validate our clusters solutions with two different indices - Average Silhouette Width and C-Index, to ensure that our clustering results are robust to different validation measures.

Average Silhouette Width

Average Silhouette Width is a composite index which measures both compactness and separation of clusters (Kaufman and Rousseeuw, 1990). Silhouette width compares the similarity between an object and other objects in the same cluster with the similarity between the same object and other objects in a neighbour cluster. A neighbour cluster N(Xi) to object Xi in cluster C(Xi) is defined as the cluster whose objects have the shortest average distance to object Xi amid all the clusters beside cluster C. The neighbor cluster N(Xi) is given by,

where: Xiis the objects in the dataset d(Xi,Xj) is the distance between two objects Xi and Xj The silhouette width for Xi, as denoted by Si, is given by, where: is the average distance between Xi and the objects in cluster C(Xi) is the average distance between Xi and the objects in neighbour cluster N(Xi) Silhouette width, Si, ranges from -1 to 1. When Si is close to 1, the clustering solution give good clusters and that Xi is likely to be assigned to the appropriate cluster. When Si is close to 0, Xican likely be assigned to another cluster and when Si is close to -1, Xi is likely to be assigned to a wrong cluster. Average Silhouette Width (AS) is given as,

Thus, the best clusters solution associated with the optimal number of clusters (k) is given by the AS with the largest value.

C Index

C Index (Hubert and Levin, 1976) is a cluster similarity measure. The best clusters solution is identified as the solution that results in the lowest value. C Index (C) is given by,

where: S is the sum of pairwise dissimilarities between all pairs of objects in the same cluster

If the cluster has n such dissimilarities, then Smin is the sum of the n smallest pairwise dissimilarities

Similarly, Smaxis the sum of the n largest distance for all the pairs of pattern

In CVAP (Wang et al., 2009) however, the optimal k is given by the value which results in the steepest knee. Steepest knee refers the greatest jump of indices value between 2 k.

Bayesian Infor