Comparison and Validation of Distance on the Balanced Assignments of Group Having Entities with Multiple Attributes

Show more

1. Introduction

Cluster analysis or clustering is the task of grouping a set of objects in such a way that entities in the same group are more similar in some sense to each other than to those in other groups. And atypical assignment is a fundamental combinatorial optimization problem. It consists of finding, in a weighted bipartite graph, or a matching in which the sum of weights of the edges is as large as possible. On the other hands, a balanced assignment is a way to make the subgroups equal in the process of distributing entities to multiple subgroups. It can be more complicated to allocate entities into subgroups using a well-balanced manner when each entity has multiple attributes. The results of clustering show that there exist characteristic differences within subgroup, and the similarity between subgroups with respect to the attributes. In this study, the properties inherent in an entity are defined as attributes, and the properties that represent the group are defined as characteristics [1] . Therefore, in the balanced assignment, the entities within a subgroup can be different, but no difference in the characteristics among subgroups. So that the characteristics among groups disappear as all groups have the same properties. For example, a group consisting of 10 male students and 10 female students is classified into two groups based on their gender. Regarding the gender as an attribute, the result of the clustering is that two groups are classified as a male group and a female group. On the other hand, in the balanced assignment problem, each group consists of five male students and five female students, and the characteristics of each group are the same.

The classification of entities with multiple attributes is different from the well-known solution methodology such as partitioning method or hierarchical method, but the efforts are being made to improve it due to the difficulty of the optimization process. A mathematical model or an application methodology with a constraint such as a series of processes for finding an appropriate compromise among attributes that are in conflict with each other is modeled as a multi-criteria function, and its necessity is increasing. Among the methods of solving the multi-criteria function, the most commonly used approaches are the weighting method and the goal programming method. The weights or specific numerical goals should be established appropriately for each function using the mathematical programming approaches in the optimization process. Barron and Schmidt [2] study the constrained problem using the distance and fuzzy measure as a mathematical approach to multiple attributes problem. The MTS (Mahalanobis Taguchi System) method is a pattern recognition method applied to classify data into some categories [3] [4] . In the MTS method, the Mahalanobis distance is used to measure the degree of abnormality of patterns, and principles of Taguchi methods are used to evaluate accuracy of predictions based on the scale constructed. The excellency of the Mahalanobis distance is that it considers correlations between the variables, which are essential in pattern analysis. Fora balanced assignment, the closeness between the entities is usually measured by the specific distance measure such as the Euclidean distance [5] . Recently, the MTS method is applied to solve the balanced assignment of allocating group with multiple attributes into small subgroups [1] .

In the clustering process, a distance between entities or attributes is applied in the Ward method and partitioning method as a measure of clustering accuracy. Here, the distance means not the physical distance but the distance between the attributes in the entities. Usually, the Manhattan distance, the Euclidian distance and the Mahalanobis distance are applied as a distance selection. The Manhattan distance, also known as the Taxicab distance, is a one-dimensional distance connecting two points. On the other hand, the Euclidean distance is a method to obtain the shortest distance between two points in n-dimensional space. This distance is a way to generalize dimensions to n dimensions using Pythagorean Theorem. Finally, the Mahalanobis distance is calculated by considering the correlation of variables as an index to measure the degree of diffusion of variables. Since the Mahalanobis distance is very sensitive to standardized variables, this distance could be increased significantly, even though the standardized variable for the reference group is slightly different. In this paper, the comparison between the distance criterion is shown by changing a specific assignment standard, and finally comparing it against the MTS method. This paper is a sequel to an earlier paper by Rhee [1] .

The paper is organized as follows. We review related works in Section 2. And a balanced assignment is considered with respect to associated distances for an example in Section 3. In Section 4, the comparison between the suggested distance criterion is shown by changing a specific assignment standard, and the result of the MTS method is checked by comparing it against the given criterion. Finally, Section 5 gives concluding remarks.

2. Literature Review

In this section, the distances required in the balanced assignment process are introduced since the effectiveness in clustering is often measured a distance as its closeness. The choice of distance measures is very crucial, and it has a strong influence on the clustering results. Usually, the Euclidean distance is considered to the common distance measure in clustering. Depending on the type of data and the researcher questions, correlation-based distance is often used as an alternative. The methods used for the distance measurement include the Manhattan distance, the Euclidian distance and the Mahalanobis distance. And the MTS method is also presented to implement the balanced assignment using these distances. The MTS method is one of the well-known clustering methodologies and this method is considered to be very helpful for the purpose of classifying large groups with multiple attributes into many subgroups.

2.1. Manhattan Distance

The Manhattan distance is a distance metric between two points in N-dimensional vector space. It is used extensively in a vast area of field from regression analysis to frequency distribution. It was introduced by Hermann Minkowski [6] . The Manhattan distance is also known as ${L}_{1}$ distance or city block distance. It is named so because it is the distance a car would drive in a city laid out in square blocks, like Manhattan. The Manhattan distance function computes the distance that would be traveled to get from one data point to the other if a grid-like path is followed. The Manhattan distance between two items is the sum of the differences of their corresponding components. This distance between two points $x=\left({x}_{1},{x}_{2}\cdots ,{x}_{n}\right),h$ (1) in each dimension as follows and $y=\left({y}_{1},{y}_{2},\cdots ,{y}_{n}\right)$ in N-dimensional space is expressed the sum of the distances in each dimension as follows, $BD\left(x,y\right)={\displaystyle {\sum}_{i=1}^{n}\left|{x}_{i}-{y}_{i}\right|}$ .

The properties of the Manhattan distance are, first, there exist several paths between two points whose length is equal to the Manhattan distance. Secondly, a straight path with length equal to the Manhattan distance has two permitted moves such vertical or horizontal by one direction only. Finally, for a given point, the other point at a given the Manhattan distance lies in a square. The Manhattan distance is frequently applied in regression analysis, specially, linear regression to find a straight line that fits a given set of points. In solving an underdetermined system of linear equations, the regularization term for the parameter vector is expressed in terms of the Manhattan distance. This approach appears in the signal recovery framework called compressed sensing. The Manhattan distance is also used to assess the differences in discrete frequency distributions. Finally, the Manhattan distance heuristic is an attempt to measure the minimum number of steps required to find a path to the goal state. The closer to get the actual number of steps, the fewer nodes have to be expanded during search, where at the extreme with a perfect heuristic, and the nodes that are guaranteed to be on the goal path can be expanded.

2.2. Euclidean Distance

The choice of distance measures is very important, as it has a strong influence on the clustering results. For most common clustering software, the default distance measure is the Euclidean distance [7] . The Euclidean distance or Euclidean metric is an ordinary straight-line distance between two points in Euclidean space. Euclidean space was originally devised to study the relationships between angles and distances. This system of geometry is still in use today and is the one that high school students study most often. Euclidean geometry specifically applies to spaces of two and three dimensions. However, it can easily be generalized to higher order dimensions. It is, also, known as Euclidean norm, Euclidean metric, ${L}_{2}$ norm, ${L}_{2}$ metric and Pythagorean metric.

The Euclidean distance is applied under the assumption that the properties of the attributes that an object is inherent are consistent. Properties of the Euclidean distance are that there is a unique path between two points whose length is equal to Euclidean distance, and the other point lies in a circle such that the Euclidean distance is fixed for a given point. The radius of the circle is the fixed Euclidean distance. With this distance, Euclidean space becomes a metric space. The Euclidean distance is defined as the shortest distance connecting two points. For example, the distance of two points $x=\left({x}_{1},{x}_{2},\cdots ,{x}_{n}\right),h$ (1) in each dimension as follows and $y=\left({y}_{1},{y}_{2},\cdots ,{y}_{n}\right)$ in $n$ dimensions is expressed as $ED(x,y)=\sqrt[n]{{\displaystyle {\sum}_{i=1}^{n}{\left({x}_{i}-{y}_{i}\right)}^{2}}}$ . Simply, this is a basic distance measurement in which the correlation between attributes is not considered.

The Euclidean distance is frequently used in Euclidean Geometry to find the shortest distance between two points in a Euclidean space and the length of a straight line between two points. This distance is commonly used in clustering algorithms such as K-means. If the Euclidean distance is chosen, then observations with high values of features will be clustered together. The same holds true for observations with low values of features. Finally, it is used as a simple metric to measure the similarity between two data points in associated areas. Correlation-based distance considers two objects to be similar if their features are highly correlated, even though the observed values may be far apart in terms of Euclidean distance. The distance between the two objects is 0 when they are perfectly correlated.

2.3. Mahalanobis Distance

The distance by correlation between the data can be very effective to the clustering analysis rather than the distance scale discussed in the previous section. In particular, it seems desirable to apply correlation when there are multiple attributes of an entity. This is because the disadvantages of the Euclidean distance can be compensated by analyzing the relationship between attributes and the effect of clustering can be augmented. Clustering by correlation based distance or by the Euclidean distance is quite sensitive to outliers, but generally the correlation based distance is more effective than the Euclidean distance. One of the correlation based distances is the Mahalanobis distance in the clustering methodology.

The Mahalanobis distance is known to be an appropriate measure of distance between two elliptic distributions having different locations but a common shape, and also known as an effective way to simply compare between groups with well-known characteristics and to those who are not familiar with the characteristics [8] . Since the Mahalanobis distance is very sensitive to standardized variables, it leads to a large increment, even though the standardized variable is slightly different for the reference group [4] . Applying this to all attributes in the entity, the Mahalanobis distance can be readjusted by considering the correlation between attributes. The Euclidian distance has the form of a circle, since it does not take into account the correlation between attributes. On the other hand, the Mahalanobis distance takes the form of an ellipse in consideration of the correlation, and is expressed as follows.

$MD\left(x,y\right)={\left[\left(x-y\right){S}^{-1}{\left(x-y\right)}^{t}\right]}^{1/2}$ (1)

In (1), $MD\left(x,y\right)$ represents the Mahalanobis distance between entity x and entity y, where x and y denote object vector. And also ${S}^{-1}$ denotes the inverse matrix of the covariance, and ${\left(x-y\right)}^{t}$ is expressed as the transpose matrix of $\left(x-y\right)$ . The Mahalanobis distance is applied the covariance matrix as a multivariate measure based on the correlation between attributes, and is effective when the units of attributes are different and there exists the correlation between attributes.

2.4. MTS Method

The first step to measure the Mahalanobis distance is to apply data conversion, a statistical process to provide a kind of reference point for comparing two or more different groups. The standard normal conversions are only applicable if the attributes of data in the group follow a standard normal distribution. Assuming that $X\left(\mu ,{\sigma}^{2}\right)$ has random variables with mean, $\mu $ and variance, ${\sigma}^{2}$ for the random variable in a data group, then $X$ can be transformed into Y using the simple data conversion without losing its statistical property. The result of data conversion can be used to data comparison between attributes, since the statistics that represent a data group are different each other. As can be seen in (2), ${y}_{ij}$ is indicating the distance measure, and converted from ${x}_{ij}$ , where ${\stackrel{\xaf}{x}}_{i.}$ denote the average of the attributes in group i.

${y}_{ij}=\frac{\left({x}_{ij}-{\stackrel{\xaf}{x}}_{i.}\right)}{{\sigma}_{i.}}$ (2)

The MTS method is a pattern information technology, which has been used in different diagnostic applications to help in making quantitative decisions by constructing a multivariate measurement scale using data analytic methods [9] . In the MTS approach, the Mahalanobis distance is used to measure the degree of abnormality of patterns or the closeness to the reference point, and principles of this methods are used to evaluate accuracy of predictions based on the scale constructed [10] . The advantage is that it considers correlations between the variables, which are essential in pattern analysis or clustering.

The Mahalanobis space should be defined before calculating the Mahalanobis distance. The process of defining the Mahalanobis space begins with the selection of reference entities and other entities to calculate the Mahalanobis distance. In the MTS method, the Mahalanobis space is selected using the standardized variables of normal data. This selection is generally effective in clustering by selecting entities with more or less extreme attribute values rather than selecting entities that are close to the average. The reason is that the clustering effectiveness is halved if the closest entity to the average is selected as a reference entity, since most of entities are located in closer to the average. Once the Mahalanobis space is established, the number of attributes is reduced using orthogonal array and SN (signal-to-noise ratio) by evaluating the contribution of each attribute. Each row of the orthogonal array determines a subset of the original system by the including and excluding that attribute of system. Some statistical processes such as data standardization and correlation matrix are required to obtain the Mahalanobis distance by applying (1) and (2). And the inverse matrix of covariance using the correlation analysis should be followed to convert the data. The correlation coefficient between attribute $i$ and attribute $j$ is already known as ${r}_{ij}=\frac{{\sigma}_{ij}}{{\sigma}_{i}\times {\sigma}_{j}}$ , and it becomes ${\sigma}_{i}=1,{\sigma}_{j}=1$ in the standard normal data conversion. In addition, the Mahalanobis distance accounts for the variance of each variable and the covariance between variables. Geometrically, it does this by transforming the data into standardized uncorrelated data and computing the ordinary Euclidean distance for the transformed data.

The SN is computed to determine how much each attribute is affected by the Mahalanobis space. Therefore, this procedure is to apply as an evaluation criterion by reducing the low impact characteristics and to select the high impact characteristics among the various characteristics affecting the Mahalanobis distance. The SN plays a critical role to determine the influence between entity and the Mahalanobis space. The quadratic loss function for the smaller the better is used as seen in (3), since the smaller distance between the Mahalanobis space and the entity means the closer it is.

$SN=-10\mathrm{log}\left(\frac{1}{n}{\displaystyle {\sum}_{i=1}^{n}{y}_{i}^{2}}\right)$ (3)

The balanced assignment should be executed to ensure that the characteristics of the subgroups are similar, and that the attributes included in the characteristics are also similar by assuming that the balanced assignments should be made taking into account all attributes specified in the entity [5] . After getting the SN, the orthogonal array is used to determine which objects are closer to the designated Mahalanobis space. The calculation of the Mahalanobis distance, which is the last stage of the MTS method, is to apply the SN as an influence indicator.

3. Data Collection

In this section, the case where an entity contains three attributes will be analyzed and the results of calculating the Mahalanobis distance suggested in the previous section will be presented. However, the Manhattan distances and the Euclidean distances, which are easier to compute than the Mahalanobis distance, are not presented in this section. The collected data is shown in Table 1 as an example of case study. The balanced assignment is executed to classify into 3 subgroups, and is tried to make the characteristic of 3 subgroups the same.

In order to compute the Mahalanobis distance, it is necessary to define the Mahalanobis space that can be used as a reference entity. In this study, the entities having the most extreme value of each attribute are set as the reference entities, and those are the Mahalanobis space. The entity with high value and low value for each attribute is defined as a reference point or the Mahalanobis space. Since the given example consists of three attributes, 6 shaded entities in Table 1 are represented as a Mahalanobis space

The entity by each attribute must be converted using (2), and followed by correlation inverse matrix. The Mahalanobis distance between the space and each entity using (1) is shown in Table 2, and 6 entities of A, B, C, D, E, and Fare represented as corresponding the Mahalanobis space.

Furthermore, the balanced assignment by the MTS method is accompanied by calculating SN ratios using (3), and by assigning all entities into many subgroups using orthogonal array.

4. Comparisons and Validation

In this section, the comparison between the suggested distance criterion is shown by changing assignment standard, and finally comparing it against the MTS method. The mean value of the suggested attributes in Table 1 is investigated as 172.6, 63.3, and 82.0 respectively, and also the corresponding variances are obtained as 44.2, 126.0, and 86.2. The above value can be expected to yield the similar for the mean when the number of assignment into subgroups is identical and fairly assigned. And the variance of each attribute in the subgroups is also deduced to be at least equal or greater than the above value.

Table 1. Collected data as an example.

Table 2. Computation of the Mahalanobis distance.

The result of the balanced assignment by applying the MTS method is shown in Table 3, as it were, all entities presented in Table 1 are distributed into 3 subgroups under the certain criterion. Table 3 also shows relatively good results when analyzed in terms of the mean in the balanced assignment. However, considering the variance as a criterion, it is not a satisfactory result. This result can be inferred that the balanced assignment by the MTS method is a balanced assignment using the average.

The result for the corresponding distance scale is presented in Table 4. As shown in Table 4, the number in the distance scale shows that the Manhattan distance is the biggest on average, followed by the Euclidian distance and the Mahalanobis distance. Since the criteria for calculating the distances are different, it is not meaningful to compare each other. However, the mean and the

Table 3. Balanced assignment by applying the MTS method.

Table 4. Distance comparisons.

difference between the maximum values and the minimum values among the subgroups, can be used to analyze which criterion represents a good indicator for the balanced assignment. The Mahalanobis distance is considered to be the better choice for the given example, even though the difference depends on the criterion for selecting attributes. Finally, the balanced assignment is carried out by applying the MTS method, and its result under each distance criteria is shown in Table 5.

Table 5. Distance by the MTS method.

As seen in Table 5, the results of the MTS method are comparatively satisfactory even if all distances are considered. In addition, since attribute 1 and 2 have a statistical property having a strong positive correlation, these two attributes can be integrated to reduce the number of attributes, which can lead to a dimensional reduction in terms of modeling.

5. Conclusion

In the clustering, the distance between entities or attributes is applied as a measure of clustering accuracy. The Manhattan distance, the Euclidian distance and the Mahalanobis distance are considered as a tool for measuring its closeness.

In this paper, the comparison between distance criterion is shown by changing specific assignment details, and finally comparing it against the MTS method. Since the standards for calculating the distances are different, it is not meaningful to compare them one by one. However, the mean and the difference between the maximum values and the minimum values within the subgroup, can be used to analyze which method represents a good indicator for the balanced assignment. In general, the balanced assignment by the Mahalanobis distance is seen as a better choice, even though the difference depends on the criterion for selecting attributes. Finally, the balanced assignment is carried out by applying the MTS method.

References

[1] Rhee, Y. (2018) A Study on the Balanced Assignment of Allocating Large Group with Multiple Attributes into Subgroups. American Journal of Industrial and Business Management, 8, 1418-1432.

[2] Barron, H. and Schmidt, C. (1988) Sensitivity Analysis of Additive Multi-Attributes Value Models. Operations Research, 36, 122-127.
https://doi.org/10.1287/opre.36.1.122

[3] Taguchi, G., Chowdhury, S. and Wu, Y. (2000) The Mahalanobis-Taguchi System. John Wiley and Sons, Hoboken.

[4] Taguchi, G. and Jugulum, R. (2002) The Mahalanobis-Taguchi Strategy: A Pattern Technology System. John Wiley and Sons, Hoboken.
https://doi.org/10.1002/9780470172247

[5] Kim, H. (2010) A Study of Clustering Analysis to Guarantee the Equality of Attributes among Small Group. Master Thesis, Keimyung University, Daegu.

[6] Louise, G. and Menger, K. (1990) Taxicap Geometry. Mathematics, 63, 213-232.

https://doi.org/10.2307/2690903

[7] Anton, H. (1994) Elementary Linear Algebra. 7th Edition, John Wiley & Sons, Hoboken, 170-171.

[8] De Maesschalck, R., Jouan-Rimbaud, D. and Massart, D.L. (2000) The Mahalanobis Distance. Chemometrics and Intelligent Laboratory Systems, 50, 1-18.

https://doi.org/10.1016/S0169-7439(99)00047-7

[9] Su, C.T. and Li, T.S. (2002) Neural and MTS Algorithms for Feature Selection. The Asian Journal on Quality, 3, 113-131. https://doi.org/10.1108/15982688200200023

[10] Woodall, M., et al. (2003) A Review and Analysis of the Mahalanobis Taguchi System. Technometrics, 45, 1-30. https://doi.org/10.1198/004017002188618626