Clustering Categorical Data Based on Within-Cluster Relative Mean Difference

Show more

1. Introduction

Data clustering is a technique for identifying groups with data instances in the same group which are more similar than the instances belonging to different groups. The issue of database clustering with categorical variables has received intensive attention ( [1] - [8] ) along with other publications in the same issue. The categorical data come from different areas of research, both social and nature sciences; this type of variable does not present a natural ordering for their po- ssible values, results in the difficulty in clustering process.

There are various algorithms available for clustering categorical data, but no algorithm can achieve the best result for all the data sets. some new techniques have been developed recently, for example CACTUS (CAtegorical Clus Tering Using Summaries, see [9] ), ROCK (RObust Clustering using linKs, see [10] ), and neural networks based approaches (e.g. self-organizing map network) are used for this purpose. Arias-Castro and Xiao ( [11] ) proposed a sparse version of clustering method. Zhang et al. ( [12] ) proposed a novel statistical procedure (HD Vector) for clustering categorical data based on frequency distributions of the hamming distance vector upon comparing with the uniform distributed sample. Unfortunately, their categorical sample space is still complex in compu- tation since the comparison of the total number of possible positions in a ca- tegorical sample space is need.

This paper devotes to find the distinctive attributes among the categorical dataset using pooled relative within-cluster mean difference, then the data is clustered upon a single distinctive attribute. At each iteration, our algorithm recognizes one distinctive attribute and then identifies only one cluster with minimum of within-cluster mean relative difference, which will then be deleted from the dataset at the next iteration; this procedure repeats until there are no more significant clusters in the remained data.

The rest of the paper is organized as follows: A motivation example is illustrated in Section 2, and in Section 3, the methodologies are discussed. The performance of the proposed algorithm is explored through a real dataset in Section 4. Section 5 gives our conclusions.

2. Motivation

Considering the soybean disease dataset from the Machine Learning Depository at the University of California at Irvine, it comprises of 47 objects with 35 categorical attributes ${\left\{{v}_{k}\right\}}_{k=1}^{35}$ (There are 14 attributes have the identity observed value for all objects, so they can be treated as noninformationed attributes and be suppressed, thereafter only 21 attributes to be considered). As [12] pointed out, these data points come from four clusters: diaporthe stem canker, charcoal rot, rhizoctonia root rot, and phytophthora rot, with sample sizes 10, 10, 10, and 17 respectively. Summary in Table 1 shows that some subgroups possess the attibutes with identical property (value) which can be defined as the subspace,

Table 1. Summary of the soybean data.

also each subgroup has some attibutes with the value differently from other subgroup which called as the distinctive attributes, for example, all ${v}_{18}$ equals to 2 in subgroup 2 whereas different in others subgroups. When this dataset is used to give a clustering partition upon ${v}_{18}$ , the original subgroup 2 can be separated effectively from the database.

Also from Table 2, we can see that when one partition the data using different attibutes, the subgroups has different number of clusters and whin-cluster mean relative differences(Defined in Section 3), therefore results in different pooled whin-cluster mean relative difference $\left(\stackrel{\xaf}{W}\right)$ . Since ${v}_{18}\left({v}_{19}\right)$ gives the minimum of $\stackrel{\xaf}{W}$ among other attrbutes, so it can be considered as a distinctive attributes. Therefore in this paper we devote to a simple clustering method to partition the data along such distinctive attributes, that is, the clustering precedure of cate- gorical dataset fully depends on these attributes.

3. Methodology

Suppose that we have the data set $X=\left({X}_{ij}\right)$ , where the element ${X}_{ij}\left(i=1,\cdots ,n\text{and}j=1,\cdots ,p\right)$ denotes the $j$ -th attribute of the $i$ -th object. Notice that each categorical attributes ${v}_{k}$ has a finite number of category levels $N\left({v}_{k}\right)$ .

3.1. Useful Measurements

While the Euclidean-based measure could yield satisfactory results for numeric attributes, it is not appropriate for data sets with categorical attributes. Therefore, some alternative measurements must be explored.

Hamming distance, named after Richard Hamming, is widely used to give the difference between two equal-length categorical vectors. The Hamming distance between the object ${x}_{i}$ and ${x}_{j}$ is defined as:

Table 2. The attribute-based clustering performance on soybean data.

${d}_{ij}={\displaystyle \underset{k=1}{\overset{p}{\sum}}}{1}_{\left({x}_{ik}\ne {x}_{jk}\right)}$ (3.1)

i.e., the hamming distance measures the number of attributes at which the corresponding objects are different.

Our proposed method is based on the pooled within-cluster mean difference of the clusters. Intuitively, when a $p$ -dimension dataset is divided to some subgroups ${C}_{1},{C}_{2},\cdots ,{C}_{r}$ according to the attribute ${v}_{r}$ , this attribute has the same value in some specified subgroup, so it has no information in such sub- groups, therefore the dimension ${d}_{r}$ of the cluster becomes smaller and smaller. In order to give the dispersion corresponding to this phenomenon, a relative version of dispersion must be adopted.

Provided that we have partitioned the data into $N\left({v}_{k}\right)$ clusters ${C}_{1},{C}_{2},\cdots ,{C}_{N\left({v}_{k}\right)}$ upon attribute ${v}_{k}$ , denote ${n}_{r}$ the number of objects in ${C}_{r}$ and ${d}_{r}$ the corresponding dimensions (after eliminate the identical attributers). Let

${\stackrel{\xaf}{W}}_{r}=\frac{1}{{d}_{r}}\frac{1}{{n}_{r}\left({n}_{r}-1\right)}{\displaystyle \underset{{x}_{i},{x}_{j}\in {C}_{r}}{\sum}}{d}_{ij}$ (3.2)

be the within-cluster mean relative difference (WCMRD) in cluster ${C}_{k}$ , and

$\stackrel{\xaf}{W}\left({v}_{k}\right)={\displaystyle \underset{r=1}{\overset{N\left({v}_{k}\right)}{\sum}}}{\stackrel{\xaf}{W}}_{r}$ (3.3)

be the pooled within-cluster mean relative difference (PWCMRD).

The idea of our method is to select the distinctive attributes sequentially, which results in the minimum pooled within-cluster mean relative difference com- paring with the other attibutes, i.e.,

${v}_{m}=\mathrm{arg}\underset{{v}_{k}}{\mathrm{min}}\stackrel{\xaf}{W}\left({v}_{k}\right),$ (3.4)

thereafter, partition the dataset upon the finite characters of the selected attibutes and give the subspace of each subgroup at each iteration.

3.2. Clustering Procedure

Step 1 Initially the data set $D$ is clustered according to the characters of ${v}_{k}\left(k=1,2,\cdots ,p\right)$ , i.e., the objects are partitioned to $N\left({v}_{k}\right)$ clusters such that the objects in each cluster have the same character on ${v}_{k}$ ;

Step 2 Find a distinctive attribute ${v}_{g}$ satisfies

${v}_{g}=\mathrm{arg}\underset{k=1,2,\cdots ,p}{\mathrm{min}}\stackrel{\xaf}{W}\left({v}_{k}\right)$ (3.5)

where $\stackrel{\xaf}{W}\left({v}_{k}\right)$ be the pooled within-cluster mean relative difference of the clusters partitioned upon ${v}_{k}$ .

Step 3 Partition the dataset based on ${v}_{g}$ , and calculate the corresponding within-cluster mean relative difference ${\stackrel{\xaf}{W}}_{r}$ for each cluster ${C}_{r}\left(r=1,2,\cdots ,N\left({v}_{g}\right)\right)$ .

Step 4 While ${\stackrel{\xaf}{W}}_{r}>{W}_{T}$ (where ${W}_{T}$ is the threshold predefined to stop the procedure),

Update the data set $D$ by ${C}_{r}$ ,

Repeat Step 1 and Step 2 until all ${\stackrel{\xaf}{W}}_{r}\le {W}_{T}$ .

End.

3.3. The Stop Threshold W_{T}

The stop threshold ${W}_{T}$ can be chosen arbitrarily. In fact, different ${W}_{T}$ results are in different hierarchical clustering. In our paper, the threshold is adopted to be 0.35, means a different of $35\%$ attributes in a cluster is accepted.

4. The Performance of the Proposed Method

4.1. Numerical Experiments

In the section, a simulated sample is deduced as reference [12] . Also the criterion of classification rate (CR) is adopted to give the accuracy of the assignment. The classification rate measures the accuracy of an algorithm to assign data points into correct clusters. With given $K$ clusters,

$CR\left(K\right)={\displaystyle \underset{i=1}{\overset{K}{\sum}}}\frac{{n}_{i}}{n}$

where ${n}_{i}$ is the number of data points that have been correctly assigned by an algorithm, $n$ is the total number of the data.

For the simulated sample, [12] obtains a mean CR $94.62\%$ , with standard derivation $3.14\%$ , we obtains a mean CR $96.02\%$ , with standard derivation $2.57\%$ , a litter better than their method.

4.2. Soybeansmall Data

The data set is derived from UCI Machine Learning Repository (archive.ics.uci.edu/ml/), it contains 47 objects, each has 35 categorical attributes. There are some attributes with exactly the same value, so after eliminate the attributes redundant, there only 21 attributes left in data set.

Table 3(a) gives the clustering results of the proposed method. It shows that only one objects are clustered incorrectly. This diagram is different from Table 1 where we identified ${v}_{11}$ (with PWCMRD 0.6548) instead of ${v}_{14}$ (with PWC- MRD 0.6584). Table 3(b) describes the detail of the proposed method on Soy- bean data, all except one object is assigned correctly. Also we can see the accu- racy of the proposed method with $CR=0.98$ .

Figure 1 gives the details of partition for Soybean data. In each step, a distinct attribute is identified, and the data is separated along this attribute, and also the subgroup with largest $W$ is chosen to be the target one to be separated in next step.

4.3. Zoo Data

The Zoo data set is available from UCI Machine Learning Repository (archive.ics.uci.edu/ml/), it contains 101 objects, each has 16 categorical attri- butes. There are some objects who posses exactly same value on all attributes, so

Table 3. Cluster result of soybean data by the proposed method.

Figure 1. Performance on soybeansmall data.

it can be considered as the same ones, after eliminate the redundant objects, there only 59 objects left in data set.

Table 4 gives the clustering results of the proposed method, where “group” means the true category of the objects. It shows that the performance is poor on group 5, 6, 7, with 1 object in group 5 is clustered incorrectly into group 3, and the group 6 and 7 each has one object are considered as member of the new cluster. Since the objects in group 5 (frog, frog, newt, toad) and group 3 (pitvi- per, seasnake, slowworm, tortoise, tuatara) have more similarity than others (there are 11 attributes are the same among 16 attributes), so the two group can roughly be considered as one group. After combining the two subgroups, our proposed methods has a precision clustering with only one incorrect.

Table 5 describes the detail of the proposed method on zoo data.

5. Comparison with HD Vector Method

Zhang et al. ( [12] ) indicates that their method can archive a good results in both the zoo data and Soybean disease data, the comparison between HD vector method and K-modes as well as Autocluss algorithm shows the superiority of their method, the drawback of their method is the comparison of possible data with the number equals to the total number of possible positions in a categorical sample space, in our proposed this possible positions are not needed, therefore the algorithm can be faster than their method.

6. Conclusions

Categorical variables are widely explored in different fields to give a native

Table 4. Cluster result of Zoo data by the proposed method.

Table 5. Cluster result of Zoo data by the proposed method.

clustering algorithm to deal with such type data; a pooled-within-cluster-mean- different based method is proposed to select some distinctive attributes, and then the data are clustered upon such distinctive attributes; the subspaces are also investigated.

The applications on zoo data and soybean data (from UC Irvine Machine Learning Repository) illustrate the performance of the proposed method. The results show a high accuracy and simplicity in practical applications.

References

[1] He, Z., Xu, X. and Deng, S. (2008) k-ANMI: A Mutual Information Based Clustering Algorithm for Categorical Data. Information Fusion, 9, 223-233.

[2] Andritsos, P. and Tsaparas, P. (2010) Categorical Data Clustering. In Sammut, C. and Webb, G., Eds., Encyclopedia of Machine Learning, Springer, Boston, 154-159.

[3] Bontemps, D. and Toussile, W. (2013) Clustering and Variable Selection for Categorical Multivariate Data. Electronic Journal of Statistics, 7, 2344-2371.

https://doi.org/10.1214/13-EJS844

[4] Anderlucci, L. and Hennig, C. (2014) The Clustering of Categorical Data: A Comparison of a Model-Based and a Distance-Based Approach. Communication in Statistics—Theory and Methods, 43, 704-721.

https://doi.org/10.1080/03610926.2013.806665

[5] Bouguessa, M. (2015) Clustering Categorical Data in Projected Spaces. Data Mining and Knowledge Discovery, 29, 3-38.

https://doi.org/10.1007/s10618-013-0336-8

[6] Silvestre Cardoso, C.M. and Figueiredo, M. (2015) Feature Selection for Clustering Categorical Data with an Embedded Modelling Approach. Expert Systems, 32, 444-453.

https://doi.org/10.1111/exsy.12082

[7] dos Santos, T.R.L. and Zarate, L.E. (2015) Categorical Data Clustering: What Similarity Measure to Recommend? Expert Systems with Applications, 42, 1247-1260.

[8] Clarke, B.S., Amiri, S. and Clarke, J.L. (2016) EnsCat: Clustering of Categorical Data via Ensembling. BMC Bioinformatics, 17, 380.

https://doi.org/10.1186/s12859-016-1245-9

[9] Ganti, V., Gehrke, J. and Ramakrishnan, R. (1999) CACTUS—Clustering Categorical Data Using Summaries. Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM Press, San Diego, 73-83.

[10] Guha, S., Rastogi, R. and Shim, K. (2000) Rock: A Robust Clustering Algorithm for Categorical Attributes. Information Systems, 25, 345-366.

[11] Arias-Castro, E. and Xiao, P. (2017) A Simple Approach to Sparse Clustering. Computational Statistics & Data Analysis, 105, 217-228.

[12] Zhang, P., Wang, X. and Song, P.X.K. (2006) Clustering Categorical Data Based on Distance Vectors. Journal of the American Statistical Association, 101, 355-367.

https://doi.org/10.1198/016214505000000312