will first calculate a matrix that is of a convenient form and has sufficient accuracy to represent information contained in the conditions of the task. To this end, we will sequentially apply randomization and indexing of information.
The randomization procedure is designed to transform the feature values in the pseudo-feature values for all k, where v is a random variable that is uniformly distributed in the interval, and is a constant. In the case of high data variety, which often occurs for quantitative features, is taken. As a result of the randomization, we obtain a matrix of pseudo-features of size, in which all the elements of each column are different.
We assume that the interval is the domain of pseudo-feature objects from the set, where and are the minimum and maximum values of, respectively. Let the points be placed in the interval with step, where with. The parameter n is an analogue of N and therefore is called the parameter of variety.
Now we can clarify the concept of the index. The value is called the index feature k for object s, if. Then, the number of indices does not exceed, and index if. For any n, the object s will be approximately described in the index space by the index vector, and the combined sample will be represented by the matrix of size. For the sequence we obtain a sequence of matrices.
In these mappings, the initial data for the problem are significantly transformed and boundaries of not quantitative features are become fuzzy. The ongoing changes could be illustrated by the example of the vector of an object with dissimilar features: at consecutive steps: before quantization, after quantization, after randomization and after indexation. An object can receive these values if the number options of feature 1 are not less than 5 at.
3. Design Formulas
It follows from the definition of index that is equal to the integer part value of for any n, and hence is a random variable. In the general case, for each k, the group of objects of different classes with the same values of the ordered pair falls within the interval. Taking into account the errors in the patterns, we introduce the assumption that it is possible to neglect the difference of feature values for objects pair. According to this assumption, the distribution of features in the domain will be piecewise constant on each interval and any object pair will match the index m.
Then we can partition the combined sample into subsets called granules that contain objects of ordered pairs. Of particular interest is a subset of training sample objects of a certain class, which we will call information granules.
To calculate the composition of the granules, we establish on the set of partial order relations, at which the values of on the axis object numbers will be arranged in non-decreasing order. The scheme for calculating indices and information granules of training sample objects for a feature varying from 12.5 to 15 at is revealed in Figure 2. It is shown that the composition of a granule has a stochastic nature.
As a result of these transformations, the values of features will be measured on a single scale with a division value equal to one index. Therefore, the matrix can be interpreted as an image of the combined sample, calculated by statistical modeling of its objects’ measurements. Now we are able to move on to analyzing the distribution of frequencies for the pair
Let and denote, respectively, the length of the granule of objects of all classes of the training sample and only the class i. If no object of class i is in the interval m, then, and by definition, the respective values of m are not indices. It is clear that is the frequency of the information granules of class i, and at the same time, it is the sample estimate for the probability of this granule as well as any of its objects. According to the problem statement, control and training samples belong to the single set. It follows that the objects of both samples have the same probability and frequency.
Since the appearance of each of the features is an incompatible event, and these features form a complete group, the estimate of conditional probability that the object s belongs to class i is equal to
From (1) it follows that equals the average frequency of index for the object s of class i. This frequency plays the role of a summarized object characteristic that considers the totality of feature values.
The estimation of the class of the object s is found using the obvious formula
The relation and for the objects of the training and control samples,
Figure 2. The scheme for calculating the indices and information granule with.
respectively, characterizes the algorithm’s ability to study and classify objects. The quality of teaching as a function of length of the training sample t is estimated by overage error rates for class i and by the overage rate for all classes. The classification accuracy is measured by the analogous frequencies and for the control sample objects. The calculated value of the class corresponds to some n, at which an acceptable accuracy of the solution is achieved.
4. Existence and Accuracy of the Solution
We will consider the issue of convergence for the sequence of error rates. Let the granule for a certain n contain an object s of the training sample of class and some object w. Then the following inequality is satisfied: is satisfied. If, then the step and. In addition, the matrix will be sparse, since an object falls into no more than from an infinite number of intervals, and for the remaining intervals.
In this case for we get that, since all the values of will be different. It follows that and is achieved error-free training. If, the granule can contain only objects for which. Since there cannot be two identical objects in a combined sample, this equality can be satisfied only for a narrow number of features. All the summands that define the average value of for the object s will be greater than zero, while for an arbitrary object w all or only some of the summands are equal to zero. Here it is possible to have any relation average values of for objects, and there is a risk of error. It is sufficient to take to prevent this error.
Classes can differ significantly in the convergence of error rates of learning (see Figure 3). The process of convergence of is illustrated in the graphs in Figure 4, which were calculated for the Abalone database for. To evaluate the results of calculations, we note here that the number of classes and, according to the UCI, the accuracies of solutions obtained by various classical methods were equal to 33% - 65%.
Figure 3. The impact of the variety parameter n on the error rate of training for the data set Glass at.
Figure 4. The impact of the variety parameter n on the error rate of classification for different lengths of the training sample t of data set Abalone at.
Nevertheless, the high accuracy of training does not guarantee acceptable classification accuracy since training is the first step in determining the class. Within the second step, it is necessary to evaluate the accuracies of solutions for the control sample objects based on relations (1) and (2), according to which any object for each n is characterized by the frequency, if granules exist. Otherwise,. Evidently, the accuracy of classification depends on the intersection lengths of the index sets for the same features in the training and control samples.
Here we face a problem of overtraining: if we restrict the value of n, it is possible to obtain a more accurate solution for the control sample, but the reliability of this result will be decreased because simultaneously the accuracy of learning will be lower. It is obvious that the simplicity of the algorithm reduces the computational complexity and severity of this problem, as well as allows us to estimate the effects of the characteristics of the matrix.
The impact of the parameter
Now consider the joint impact of n and on the accuracy of solution. Let the objects and have the same value for feature. Randomization should result in and belonging to different intervals of length, and these objects should be assigned different indices. Note that decreases with increasing n and. It follows that to preserve the randomization effect for large n, it is advisable to use larger values of.
5. The Results of Solving Particular Tasks
The information about data sets from the repository UCI is given in Table 1, where is the ratio of the maximum to the minimum numbers of objects in different classes of the training sample. The tasks cover a wide range of values in terms of the numbers of objects (150-42121), features (3-16), classes (2-29), types of features (quantitative and mixed, including nominal, ordinal and integer) and the degree of unevenness of the object distributions between classes of the training sample (1-689).
The algorithm of the method is based on the data grouping that predetermines the decreasing influence of various kinds of outliers. Calculations have shown that the solution is stable for an infinite set of acceptable solutions: small oscillations of the parameters and n usually lead to insignificant changes in the error rate, and perceptible influence are rendered only by changes in n that are comparable with. Therefore, for the parameter of variety, it is advisable to consider a “dimensionless” and more stable characteristic.
The calculations have confirmed that for sufficiently large, the errors are reduced with increases in the values and n. For, Table 1 shows
Figure 5. The impact of the length of the training sample t of the data set Car evaluation on the error rate of classification with different values of the parameter of variety n.
Table 1. Parameters of data sets.
aObjects were numbered via a random number generator. b Objects with data errors are not considered.
the lower boundary of the value range of these parameters, under which the error rates are and.
The results were verified by 10-fold cross-validation, in which splits of combined sample and for each variant were calculated for and, as listed in Table 1. It was found that all. The average error rates of classification over the splits, denoted by, are provided in Table 1.
From Table 1 it following that, in most cases, is smaller than the acceptable rate of 0.05, but for Haberman’s Survival data set the norm is exceeded 2.3 fold. This situation is caused by the fact that the data set has the lowest data variety when the objects are described by three features of integer type and are divided into two classes. To maintain high accuracy, this property was compensated for by significantly increasing up to. Calculations have shown that increasing by 6.2%, 12.5% and 25% causes the error rate to decrease to 0.089, 0.069 and 0.063, respectively. Note that the average error rate of currently used methods exceeds 18%  .
This analysis can greatly simplify the process of solving practical problems, in which, instead of a control sample, a test sample is specified, for which the distribution of classes is unknown. Now, we can use tabulated values of and as the first approximation, which is likely to be the final result. To verify this, from t objects of the training sample, we need to allocate objects (for example,) and use them directly for training; the remaining objects can be used to monitor the solution accuracy.
Another important result was obtained by testing the method experimentally. A direct application of the above algorithm for the data set Adult gives the minimum values of and at. The relatively large number of errors can be explained by the fact that the variants of feature values are distributed very unevenly. In particular, one variant belongs to 92% of the objects, and each of the other options belongs to only 0.07% of objects, on average. In this case, because of the large total repeatability of objects with rare feature values, the probabilities of matching for pairs of the training and control samples are decreased, which leads to an increased number of errors.
To mitigate of noted effects of rare information, we will introduce the additional procedure of compression, or pre-granulation, which is also based on the above considerations about the data reliability. Now, the values of corresponding feature k are sorted in ascending order and divided into groups with the same feature values, which are identified by the sequence of variant numbers, where is the number of feature variants. Then, we check whether the group with contains at least objects, and if not, then objects with are added to it, and the process continues until the condition is satisfied (is design parameter). The value of for all objects of the group, calculated in this way, is. Similar calculations are performed for subsequent values of and. Obviously, in the general case.
After the additional transformation, the solution of task Adult has remained stable and errors are now closer to zero. The corresponding table data calculated at. Note that the error rates of 16 different solutions of the task, derived from the well-known algorithms, are in the range 0.14 - 0.21  .
This procedure has proven effective in addressing a number of other data sets, for example, Letter Image Recognition. It can also be observed as a way to reduce the values of to reduce the computation time.
The paper proposes a methodology for solution classification problems, the core of which is an approximate calculation of the invariants of data matrix. The new approach implements the concepts of soft computing and granulation and is biologically inspired. In essence, it reduces to the transformation of all measured scale features, in which the values of features, called indexes, are defined according to a single scale in the new units.
Developed methodology is based on procedures of randomization and indexation of the data set (and, in some cases, also a pre-granulation procedure), which generate an infinite sequence of index matrices. These matrices are invariants of the data matrix in relation to a class of objects. They provide error-free training and allow us to calculate the object class under the simplest formulas of total probability for any single type or mixed types of features.
The proposed method differs from existing ones by the universality and simplicity of the algorithm and, as a rule, almost an order of magnitude higher accuracy.
The obtained results go beyond the problem of classification and have independent significance for the solutions of the problems of data analysis. It can be expected that the method will receive a hardware implementation, and its extension to multi-level data will lead to the development of effective image recognition systems and information retrieval. The application of the method does not require mathematical education, which increases its innovative potential.