Anomaly Detection of Store Cash Register Data Based on Improved LOF Algorithm

Show more

1. Introduction

Along with the development of living standards, the purchasing power of residents is also increasing. In Shopping malls, as the market with the most extensive sales of goods, a large number of commodities and customers in domestic generate huge amounts of cash register information every day. The anomaly detection of such information and maintaining the normal operation of the cash register system are critical [1] [2] .

At present, the anomaly detection of data based on the LOF algorithm has achieved lots of research results. For example, Chen Wei [3] improved the LOF algorithm by considering the influence of neighboring points, and constructed a fuzzy LOF algorithm. In the study of Hu Wei [4] , the LOF algorithm was combined with SVM to detect abnormal data. Therefore, applying LOF algorithm to detect abnormal data is a feasible method.

This paper applies the LOF algorithm to calculate the local degree of data outliers. Then a loose threshold is set to screen outliers. After deleting the outliers, the similarity between the screening data and reasonable data is measured by SAX test. If not, it will expand the abnormal limit, increase screening power, and perform loop testing and optimization. Through the gradual adjustment and optimization of the outlier threshold, false alarms can be avoided to the greatest extent.

2. Preliminary Data Processing

2.1. Data Sources

Firstly, we choose the transaction system data in late January of a shopping mall to analyze. We have a total of 12,954 transaction records. The data samples includes: transaction date, time, volume, success rate, response time. Some transaction data are as follows Table 1.

2.2. Principles of Optimized K-Means Clustering

The k-means clustering algorithm first selects k objects as the initial clustering center randomly. Then the distance between each object and each seed cluster center is calculated and each object is assigned to the nearest cluster center. The cluster centers and the objects assigned to them represent a cluster. After all objects have been assigned, the cluster centers of each cluster are recalculated based on the existing objects in the cluster. This process will be repeated until some termination condition is met. The termination condition may be that no object is reassigned to different clusters, and at that point the squared error sum is locally minimum.

The basic operation is as follows:

1) Take k elements randomly from element set d as the respective centers of k clusters.

2) Calculate the degree of dissimilarity between the remaining elements to the centers of k clusters, and assign these elements to clusters with the lowest

Table 1. Part of the transaction data.

dissimilarity, respectively. The dissimilarity algorithm is as follows:

$di{s}_{k}={\omega}_{i}{\displaystyle \underset{i=1}{\overset{n}{\sum}}\left({x}_{i}-{x}_{ki}\right)}$

Among them, ${x}_{i}$ is an attribute value of an i-th element. ${x}_{ki}$ is the i-th attribute value of the k-th cluster center. ${\omega}_{i}$ is attribute value weight. In order to avoid the influence of the dimension, the variation coefficient of each attribute variable is used as the weight, and the formula is:

${\omega}_{i}={S}_{i}/{\stackrel{\xaf}{x}}_{i}$

Among them, ${S}_{i}$ is variance of attribute variables. ${\stackrel{\xaf}{x}}_{i}$ is average value for attribute variables.

3) According to the clustering result, take the arithmetic average of the respective dimensions of all the elements in the cluster and recalculate the centers of the k clusters. The formula is:

${x}_{ki}={\displaystyle \sum {x}_{i}/{n}_{k}}$

Among them, ${x}_{i}$ is the i-th attribute of all elements in the k-th cluster. ${n}_{k}$ is the number of all elements in the k-th cluster.

4) Regroup all elements in d according to the new center.

5) Repeat step 4 until the clustering result no longer changes.

6) Output the result.

2.3. Choosing Optimal Cluster K Value Based on CH Index

The CH indicator describes the compactness by intra-class dispersion matrices. And the disparity matrix between classes describes the degree of separation. The indicators are defined as:

$CH\left(k\right)=\frac{trB\left(k\right)/\left(k-1\right)}{trW\left(k\right)/\left(n-k\right)}$

Among them, “n” denotes the number of clusters. “K” denotes the current class. “ $trB\left(k\right)$ ” denotes the trace of the disparity matrix between classes and “ $trW\left(k\right)$ ” denotes the trace of the intra-class dispersion matrix.

From the definition of the CH indicator, it can be known that the greater the CH indicator is, the closer the class itself is and the more dispersed is the class and another class, that is, the clustering result is better. In order to measure the effectiveness of the clustering results, the CH indicator was selected to measure the effectiveness of the cluster. K values ranged from 2 to 8, and cluster centers were randomly selected for clustering. Each k value was clustered 10 times and its average CH value results are as follows Figure 1.

According to the result graph, when the k value is 4, the clustering result is optimal.

2.4. Trading Period Clustering Results

Select k = 4 as the number of clusters, apply transaction time, volume, success rate, response time as the transaction category attribute, and cluster the transaction date to obtain the clustering result (Figure 2).

According to the results of the specific classification, the date of transaction data is basically divided into four periods: before the Spring Festival, after the Spring Festival, on the working day, and on the non-workdays.

K-means clustering is performed on the daily time period according to the above date classification. All dates in the four date categories are selected. The average value of each category attribute is obtained, and the timeline data of the transaction data is plotted (Figure 3).

From the trend of line chart, we know that the daily transaction volume trends are basically the same in all time periods. The transaction volume gradually increases from 0 o’clock. At midday, there is a small downtrend in transaction volume and then it rises and eventually begins to decrease. The trough periods and peak periods are more obvious. Therefore, the k-means clustering

Figure 1. CH indicator mean chart.

Figure 2. Date clustering result graph.

analysis is also performed on it, and the k value is set to 2 to obtain the time clustering result as shown in the following Table 2.

The trough periods and peak periods are basically the same in all periods. The clustering results are good.

3. The Mathematical Model

3.1. Calculating Local Outliers of Trading Data Based on LOF Algorithm

3.1.1. Principles of LOF Algorithm

Basic variable definition:

1) The distance between two points, p and o, is the difference between two points of data, where the trading volume distance is:

${d}_{1}\left(p,o\right)={x}_{1}\left(p\right)-{x}_{1}(0)$

The distance of transaction success rate is:

${d}_{\text{2}}\left(p,o\right)={x}_{\text{2}}\left(p\right)-{x}_{\text{2}}(0)$

The distance of response time distance is:

${d}_{\text{3}}\left(p,o\right)={x}_{\text{3}}\left(p\right)-{x}_{\text{3}}(0)$

Figure 3. Time trading data line chart.

Table 2. Clustering results at various periods.

2) The k-th distance: The distance from the point k away from the point p to p, excluding p.

3) The k-th distance neighborhood: The k-th distance neighborhood of point p, that is, all points within the k-th distance of, including the k-th distance. Therefore, the number of k-th neighbors of p.

4) Reachable distance:

$reach\text{-}distance\left(p,o\right)=\mathrm{max}\left\{{d}_{k}\left(o\right),d\left(po\right)\right\}$

5) Local accessible density: the higher the density is, the more likely it is to belong to the same cluster. The lower the density, the more likely it is to be an outlier. The local reachable density of point p is expressed as:

$lr{d}_{k}\left(p\right)=\frac{1}{{\displaystyle {\sum}_{o\in {N}_{k}\left(p\right)}reach\text{-}dist\left(p,o\right)}/\left|{N}_{k}\left(p\right)\right|}$

6) Local outlier factor: indicates the degree of abnormality of the data objects, and its size reflects the degree of isolation of the data object relative to the points in its data area, which referred to as:

$LO{F}_{k}\left(P\right)=\frac{{\displaystyle {\sum}_{o\in {N}_{k}\left(p\right)}\frac{lr{d}_{k}\left(o\right)}{lr{d}_{k}\left(p\right)}}}{\left|{N}_{k}\left(p\right)\right|}$

The basic operation is as follows [5] [6] [7] :

1) Query the neighborhood of each data object p in the overall data set d, and obtain the neighborhood ${N}_{k}\left(p\right)$ recalculation distance.

2) Sort the distance, and calculate the k-th distance and the k-th field.

3) Calculate the reachability density of each transaction data.

4) Calculate the local outlier factor for each transaction data.

5) Sort and output local outlier factors for each transaction data.

3.1.2. The Results of Local Outlier Degree Calculation

Based on the periods divided in question 1, the average transaction time, success rate, and response time for each of the eight periods were used to calculate the degree of outliers at each time point. Taking the maximum value of outliers as the abnormal condition under the mean value, the abnormal regions in each period are set as follows Table 3.

According to this, the degree of local outliers of the transaction data factors at any time of the day can be determined, and the abnormality can be determined based on the abnormality threshold. The following are the outlier excursions and abnormal point discrimination charts at each time on the 1.23 day (Figure 4, Figure 5).

3.2. Optimizing Anomaly Threshold Based on SAX Algorithm

3.2.1. Principles of SAX Algorithm

Symbolic Aggregate Appro-Ximation (SAX) is a data compression and information extraction method that can discretize data sequences and convert them into

Table 3. Abnormality threshold of all factors.

Figure 4. The degree of outliers of transaction volume.

Figure 5. Abnormal points of transaction volume.

symbol sequences according to the characteristics of data density [8] . The algorithm steps are as follows:

1) Z-score standardization of all data is converted into data that conforms to the standard normal distribution. The conversion function is:

${x}^{*}=\frac{x-\mu}{\sigma}$

2) A segmented aggregate approximate conversion PAA is performed on the original time series. The total length is n, and the normalized time series are divided into w groups one by one in chronological order. Then find the arithmetic mean value m of each set of sequences, and use m to replace the value of the entire sequence set, reduce the dimension of the original data by about n/w, and change the fluctuating time series into a staircase sequence.

3) Divide the probability density curve of N(0, 1) into a interval functions according to probability, replace the PAA segment with discrete letters, and complete the symbolization of the sequence.

4) Similarity measure and comparison of symbol sequences. Assuming that P, Q are two symbol sequences, and denotes the value of the ith element of the corresponding symbol sequence, then the distance between symbol sequences is defined as:

$D\left(P,Q\right)=\sqrt{\frac{n}{w}{\displaystyle \underset{i=1}{\overset{w}{\sum}}{\left(dist\left({p}_{i},{q}_{i}\right)\right)}^{2}}}$

where

$dist\left({p}_{i},{q}_{i}\right)=\{\begin{array}{l}0,\text{}\text{\hspace{0.05em}}\text{\hspace{0.05em}}\text{}\left|{p}_{i}-{q}_{i}\right|\le 1\\ {b}_{\mathrm{max}\left({p}_{i},{q}_{i}\right)-1}-{b}_{\mathrm{min}\left({p}_{i},{q}_{i}\right)-1}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\left|{p}_{i}-{q}_{i}\right|\ge 1\end{array}$

b is the area split point under the normal distribution curve.

3.2.2. The Results of Symbolic Aggregate Approximation

After eliminating the outliers in the original sequence, symbolic aggregation approximation processing is performed on the transaction data. Taking the 23rd in April transaction volume as an example, the PAA conversion graph is as follows Figure 6.

After replacing the interval segment with discrete letters, the symbolized data is as follows Figure 7.

3.2.3. Approximate Abnormal Threshold Based on Symbolic Aggregation

Calculate the approximate result of the symbol aggregation for the average data of each time period, and calculate the distance between the symbolization result and the mean value at each period of the transaction data, and perform the following test:

1) If $dist\left({p}_{i},{q}_{i}\right)=0$ , keep the original abnormal threshold;

2) If $dist\left({p}_{i},{q}_{i}\right)>\text{0}$ , increase the abnormality threshold.

After expanding the abnormal threshold, new abnormal points are deleted,

Figure 6. Ladder diagram of transaction volume.

Figure 7. Symbol map of transaction volume.

and the data is checked and corrected again until all the time periods have passed the test. The correction of the threshold value can increase the sensitivity of the anomaly detection model and minimize the false alarms and omissions of abnormal data. Based on the initial abnormality threshold SAX test results are as follows Table 4.

From Table 4, it can be seen that the initial abnormality threshold of transaction volume is good, and the transaction success rate and response time abnormality thresholds need to be optimized. After optimization by the cyclic SAX test, the optimized abnormality threshold is as follows Table 5.

Table 4. The results of initial anomaly threshold SAX test.

Table 5. The results of optimized anomaly threshold SAX test.

3.2.4. The Accuracy of Anomaly Detection

According to the initial abnormality threshold and the optimized abnormality threshold, the trading data is anomalously detected. The detection accuracy rate is as follows Table 6.

Table 6. The accuracy of transaction volume.

From Table 6, we can see that, except for the transaction volume threshold, the transaction success rate and response time anomaly detection accuracy rate are significantly improved, indicating that the optimization anomaly threshold has better properties than the initial anomaly threshold.

4. Conclusions and Suggestions

From the above results, it can be known that the LOF algorithm can measure the degree of local outliers of data points well and thus can be used in anomaly detection. However, the abnormal threshold setting is often subjective. Based on the SAX algorithm, the deletion of outlier data can be tested, which can effectively find the deficiency of artificially set abnormal threshold, so as to adjust and improve. Applying SAX to test the adjusted abnormality threshold can greatly improve the accuracy of anomaly detection.

References

[1] Yin, X.X., et al. (2012) Supermarket Cash Register Management System Analysis and Design. Market Modernization, 2, 6-7.

[2] Prater, E., Frazier, G.V. and Reyes, P.M. (2005) Future Impacts of RFID on E-Supply Chains in Grocery Retailing. Supply Chain Management: An International Journal, 10,134-142.

[3] Chen, M. (2016) Research on Credit Card Fraud Detection Based on Fuzzy Local Outlier Factor (LOF). Financial Theory and Practice, 10, 54-57.

[4] Hu, W., et al. (2016) Intelligent Distribution Network Fault Identification Method Based on LOF and SVM. Power Automation Equipment, 36, 7-12.

[5] Li, L., Huang, L.S., Yang, W., Yao, X.H. and Liu, A. (2015) Privacy-Preserving LOF Outlier Detection. Knowledge and Information Systems, 42, 579-597.

https://doi.org/10.1007/s10115-013-0692-0

[6] Kang, B., Kim, D. and Kang, S.-H. (2011) Real-Time Business Process Monitoring Method for Prediction of Abnormal Termination Using KNNI-Based LOF Prediction. Expert Systems with Application: An International Journal, 12, 6061-6068.

[7] Zhou, P., et al. (2017) An Improved LOF Outlier Detection Algorithm. Computer Technology and Development, 27, 115-118.

[8] Lin, J., Keogh, E., Lonardi, S. and Chiu, B. (2003) A Symbolic Representation of Time Series with Implications for Streaming Algorithms. Proceedings of the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, San Diego, 13 June 2003, 2-11.

https://doi.org/10.1145/882082.882086