Research and Case Analysis of Apriori Algorithm Based on Mining Frequent Item-Sets

Show more

1. Introduction

1.1. Research Background and Purpose

China has fully entered the information age. Under the influence of the situation, this development has been promoted. People who did not have much contact with the Internet before have to start working online. And this is China, which has the largest number of netizens, and it generates more and more data resources. If used properly, it will generate more explicit or hidden benefits for China and reduce the GDP loss caused by the situation.

Data mining is a tool for obtaining knowledge and valuable information. Making good use of association rules and related algorithms can help the country make better predictions, provide decision support, and it is also a means to avoid outbreaks. In other industries, you can also formulate sales strategies based on customer needs to maximize your sales.

This topic is aimed at the Apriori algorithm, combined with related data, analyzes the algorithm, intersperses with my own understanding, and provides some suggestions on the improvement of the Apriori algorithm and its application in life.

1.2. Research Status in China and Abroad

The Apriori algorithm is an algorithm for mining association rules developed by Agrawal and Krishnan in 1994. After that, a variety of association rule algorithms improved based on the Apriori algorithm have also been produced.

The FP-growth algorithm proposed by Professor Han Jiawei in 2000 is one of them. It only needs to scan the database twice, which greatly saves the running time of the algorithm and does not generate candidate sets. Compared with the Apriori algorithm, the algorithm has a great improvement in algorithm efficiency, but it cannot find the association rules between data.

In recent years, Chinese research and applications on this issue have shown results in many fields. Such as, the personalized push in e-commerce which can satisfy consumers while expanding revenue; improvement of teaching work in colleges and universities; in terms of medical treatment, the degree of association between the disease and various factors, etc.

As shown in Figure 1, Figure 2, the visual analysis data obtained from CNKI can be seen that Chinese research in the Apriori algorithm field has grown rapidly since 2000 and reached its peak in 2009, but the research has not declined. The trend continues until now. In these years, the University of Electronic Science and Technology of China has published a large number of documents, and other universities have published a large number of related papers.

Figure 1.Recent trends of Chinese “Apriori algorithm”.

Figure 2.Number of documents published by Chinese universities on “Apriori algorithm”.

Figure 3.Recent Trends of “Apriori Algorithm” Abroad.

Following the proposal of the Apriori algorithm, many researchers abroad have successively proposed improved methods for the Apriori algorithm. For example, DHP algorithm and PCY algorithm. As shown in Figure 3 and compared with Figure 1, we can understand that the research on Apriori algorithm abroad is basically the same as that in China, but until 2009, the research on this has been tepid, with low attention, which is the opposite of Chinese research. After 2009, foreign research on the Apriori algorithm has gradually become hot topic, and there will be a growing trend in 2020.

1.3. Main Research Contents

This paper gives an explanation of the related concepts of data mining and association rules, and gives a brief description of the process and steps of the implementation of data mining and association rules, in order to better understand the process and related concepts of Apriori algorithm and the nouns in it.

In addition, this topic deeply analyzed the steps, properties, advantages and disadvantages of Apriori algorithm, and gives relevant examples to better explain the steps and implementation process of Apriori algorithm, and talk about the application of Apriori algorithm in some fields, as well as the results achieved.

2. Related Theories of Data Mining and Association Rules

2.1. Overview of Data Mining

Data mining is a process of digging out deep and valuable information for data users from large amounts of irregular data. Data mining is one of the important applications of database. Its function is to discover the hidden information of data.

Some people equate data mining with knowledge discovery, but the information from data mining may not necessarily form knowledge. In this respect, knowledge discovery can be regarded as the next operation or core component of the excavated data, collating, identifying and summarizing the excavated information, and finally forming knowledge.

The process of data mining:

First, select data: according to the actual needs, select the required data from the database.

Second, data pretreatment: simple processing of selected data, noise reduction, reduction of data redundancy and useless data, etc.

Third, data selection: according to the desired information, select the data that are simply processed.

Fourth, data transformation: the selected data is changed into Boolean data or unified into a data form. Boolean data is more suitable for the relevant algorithms of data mining.

Fifth, select the data mining method and algorithm: according to the target data type of mining, choose the appropriate mining method. Then according to the selection of data mining methods, combined with the characteristics of the data and actual needs, select the best data mining algorithm, such as Apriori algorithm, K-means algorithm.

Sixth, data mining: get hidden information.

Seventh, evaluation of data mode: delete the unnecessary information and leave the needed information through evaluation.

Eighth, representation of knowledge: make visual analysis and analysis of the information left behind to form easy-to-understand knowledge (Lin, 2017).

Method of data mining: The analysis methods of data mining mainly include the following five kinds: clustering analysis, classification and prediction, outlier analysis and association rule analysis (Liu, 2004).

2.2. Overview of Association Rules

Association rules reflect the interdependence and correlation between the two, and are used to dig out the correlation between valuable data items from a large number of data. Its purpose is to find frequent item-sets and strong association rules.

A typical case of association rules is the “Diapers and beer”. As shown below, in some supermarkets in the United States, when young fathers with new children went to buy diapers after work, nearly half of them also bought beer. Some stores that put diapers and beer next to each other had significantly higher sales than those that did not, and their sales have also increased after these supermarkets put diapers and beer next to each other (Jiao, 2013).

This is an application of association rules—shopping basket analysis. According to the goods in the customer’s shopping basket, the rules are found and frequent item-sets is generated, that is, which goods will be purchased by the customer several times at the same time. These association rules can be used in merchants’ marketing strategies, such as product promotion, product region division, etc.

3. Apriori Algorithm

It is an algorithm based on mining Boolean association rules. After each set of frequent item-sets is generated, the whole database is scanned and the association rules between data are mined from the generated frequent item sets, give us decision support.

3.1. The Idea of Apriori Algorithm

An item-set is a set of 0 or more items, and a frequent item-set is an item-set whose support is greater than the custom minimum support count (Shabtay et al., 2021).

Common evaluation criteria for frequent item-sets:

1) Support: It is one of the two basic parameters of association rules. It is the ratio of the number of transactions containing both *x *and *y* in All sample of dataset *D* to all transactions. If we have two data *x* and *y* that need to be analyzed for correlation, then the corresponding support degree is:

$\text{Support}\left(x\text{,}y\right)=P\left(xy\right)={\scriptscriptstyle \frac{\text{num}\left(xy\right)}{\text{num}\left(\text{AllSamples}\right)}}$. (1)

$\text{Support}\left(X\Rightarrow Y\right)=P\left(X\cup Y\right)={\scriptscriptstyle \frac{count\left(X\cup Y\right)}{\left|D\right|}}$. (2)

For example, a support rating of 28% means that “there is a 28% probability that an individual in the population will contain both *X* and *Y*”.

2) Confidence: It is the ratio of the number of transactions including *x* and *y* to the number of transactions including *y*, namely conditional probability.

$\text{Confidence}\left(x\Rightarrow y\right)=P\left(x|y\right)={\scriptscriptstyle \frac{P\left(xy\right)}{P\left(y\right)}}$. (3)

$\text{Confidence}\left(X\Rightarrow Y\right)=P\left(X|Y\right)={\scriptscriptstyle \frac{\text{Support}\left(X\cup Y\right)}{\text{Support}\left(X\right)}}$. (4)

Assuming that “52% of the terms containing *X* contain *Y*”, the confidence is 52%.

3) Lift: It is the ratio of the number of transactions containing *x* under the premise of including y to the total number of transactions occurring in *x*.

$\text{Lift}\left(x\Rightarrow y\right)={\scriptscriptstyle \frac{P\left(x|y\right)}{P\left(x\right)}}={\scriptscriptstyle \frac{\text{Confidence}\left(x\Rightarrow y\right)}{P\left(x\right)}}$. (5)

Lift uses 1 as the target value to show the relationship between *x* and *y*. If the value is greater than 1, then
$x\Rightarrow y$ is a valid strong association rule. Conversely,
$x\Rightarrow y$ is an invalid strong association rule. When the value is equal to 1, however, there is a special case, that is, the *x* and *y* at independence, at the time
$P\left(x|y\right)=P\left(x\right)$, so
$\text{Lift}\left(x\Rightarrow y\right)=1$ (Zhou et al., 2010).

Only custom minimum support, or a combination of custom support and confidence, can determine the frequent item-sets in the database.

Its core is to retrieve all frequent item-sets, and find all item-sets that are greater than or equal to the support by setting the minimum support count and iterating continuously.

3.2. Steps of Apriori Algorithm

The Apriori algorithm consists of two steps: connecting and pruning.

Connecting: The target is *L _{k}* (

$\left({l}_{1}\left[1\right]={l}_{2}\left[1\right]\right)\wedge \left({l}_{1}\left[2\right]={l}_{2}\left[2\right]\right)\wedge \cdots \wedge \left({l}_{1}\left[k-2\right]={l}_{2}\left[k-2\right]\right)\wedge \left({l}_{1}\left[k-l\right)<{l}_{2}\left[k-1\right]\right)$

Pruning:
${L}_{k}\subset {C}_{k}$, *C _{k}* is a superset of

The steps of code design are as follows:

Input: Data set *D*, minimum support count α.

Output: the largest frequent *k* item-set.

1) Scan the entire database, arrange all the data in the data set, and get *C*_{1}, which is the candidate frequent 1 item-set. *k *= 1, frequent 0 item-sets are empty sets.

2) Mining frequent *k* item-sets:

a) Scan the database and filter the data sets larger than α.

b) Remove the item-sets whose support degree is lower than α in *C _{k}*, and get

c) When the item-set in *L _{k}* has two or more items, the connection generates

3) Let*k *= *k *+ 1, repeat step 2.

The code steps make the disadvantages of Aprior algorithm, scanning the database every iteration, appear on paper. This also leads to the low efficiency of the algorithm code when the database is large and there are many data to be mined (Zhu, 2014).

3.3. Properties of Apriori Algorithms

The two properties of Apriori algorithm can effectively reduce the search complexity of the algorithm, reduce the amount of computation, and improve the efficiency of mining frequent item-sets.

First, if X is a frequent item-set, then all subsets of X are frequent item-sets.

Second, if X is a non-frequent item-set, then all supersets of X are non-frequent item-sets (Yu et al., 2010).

3.4. The Pros and Cons of Apriori Algorithm

The space complexity of the algorithm is small, and the coding difficulty of the programmer is reduced. Through the nature of the algorithm and pruning operation, many repeated operations can be avoided, and the running speed of the algorithm is improved. If a given minimum support threshold is large, the number of database scans will be significantly reduced. It is suitable for sparse data, and has high ductility and high exploitability (Wang, 2010).

If there is more data in the transactional database, the number of items in the candidate set will increase exponentially, which will consume more memory and create a burden. However, when the candidate set is pruned, the database must be scanned, which will consume more time and reduce the efficiency of the algorithm.

This results in an exponential decline in the efficiency of the algorithm as the size of the database increases.

In order to solve these problems, many algorithms are derived to improve the mining efficiency based on Apriori: FP-growth algorithm, hash-based technology, thing reduction technology, partition technology, etc. (Song, 2010).

3.5. Example of Apriori Algorithm

Figure 4 shows the purchase records of some consumers extracted from the transaction database of a retail store in Bozhou County, Anhui Province, China, so that the minimum support count (min_support) is 2. The process of mining frequent itemsets is as follows:

The supermarket consumer purchase record database D is shown in Figure 4.

Figure 4. Supermarket consumer purchase record database D.

Step 1: scan the entire dataset to get I = {Potato Chips, Ham Sausage, Coke, Milk, Instant Noodles}, and all the data obtained at this time is used as the 1 candidate set C_{1}. Compare the support count of all items with min_support, and strip out the data that is less than the threshold. In this case, all items should be greater than or equal to the min_support, and the frequent item-set L_{1} should be obtained.

Candidate set C_{1} and frequent item-set L_{1} are shown in Figure 5.

Step 2: use the items left by item-set L_{1} to perform the connection operation *l*_{1}*l*_{2}. And scan the whole data set again, count the data of two projects at the same time, calculate the degree of support, and form 2 candidate set C_{2}. Similar to step 1, compare min_support and strip out unqualified item-sets (hereinafter referred to as “compare and deletion”) to obtain frequent item-set L_{2}.

Candidate set C_{2} and frequent item-set L_{2} are shown in Figure 6.

Step 3: use the items left by item-set L_{2} to perform the connection operation, then the project contains three commodities, and scan the whole data set again, count the data of three items at the same time, calculate the degree of support, and form three candidate sets C_{3}. At this time, C_{3} contains many non-frequent item-sets, which need to be pruned to form candidate set P_{3}, which is compared and deleted to get frequent item-set L_{3}.

Candidate set C_{3} is shown in Figure 7, candidate set P_{3} and frequent item-set L_{3} are shown in Figure 8.

Step 4: repeat the above steps, after C4 pruning, P4=∅. The algorithm ends, all frequent item-sets are retrieved.

3.6. Application of Apriori Algorithm

Apriori algorithm is involved in various fields, including e-commerce, university

Figure 5. Candidate set C_{1} and Frequent item-set L_{1}.

Figure 6. Candidate set C_{2} and Frequent item-set L_{2}.

Figure 7. Candidate set C_{3}.

Figure 8. Candidate set P_{3} and Frequent item-set L_{3}.

information system, big data platform and so on.

Application of apriori algorithm in e-commerce: In the aspect of e-commerce, through mining association rules for a large amount of data, we can achieve the purpose of understanding customers’ consumption habits, and can more effectively carry out market analysis, market prediction and the selection of target customers to achieve personalized service, so that enterprises can carry out targeted sales activities and improve sales efficiency and market competitiveness (Wu, 2012).

Application of Apriori algorithm in universities: The application of Apriori algorithm in colleges and universities is analyzed in two points. First, it is the use of association rules algorithm, including Apriori algorithm in teaching evaluation. By using the data obtained from the students’ evaluation of the curriculum at the end of the semester, the association rules are analyzed from different factors, such as the interaction between students and teachers, the use of multimedia courseware, teaching content and style, to help the teaching management department to make decisions, and to provide feedback to substitute teachers to help teachers improve their teaching quality and efficiency (Zhang, 2017).

The second is the employment field of graduates. Now colleges and universities advocate personalized autonomous learning, school elective courses and Chinese universities MOOC and other online learning platforms, but also provide teaching services, college students can learn what they are interested in both online and offline. Therefore, in addition to their majors, college students also have other skills. Through correlation analysis, they can push more and more employment opportunities for graduates who are about to find jobs.

Third, the data mining of students’ life behavior, because of students’ self-esteem, some poor students will give up applying for poverty grants, so they cannot achieve a good financial aid effect. However, according to the credit card records of students’ campus cards, we can know the living standards of students and subsidize them through these data.

The fourth is the application in the detection of academic misconduct. By scanning the articles and judging the repetition rate, we can avoid academic fraud and ensure the quality of scientific research work.

4. Conclusion

Nowadays, the related algorithms of association rules have been successful in many fields. Through this topic, we can have a deeper understanding of the specific process and core of Apriori algorithm, and better understand the value of association rules. The main work of this paper includes the following points.

The main contents are as follows:

1) The basic theories of data mining and association rules are briefly described and the related processes of them are analyzed.

2) The Apriori algorithm is analyzed deeply. Through the example of mining association rules, the relevant steps and achievable results of the algorithm are explained.

3) The application of Apriori algorithm is simply analyzed, such as mining potential customers in e-commerce and teaching evaluation analysis of colleges and universities.

4) Suggestions for improving the Apriori algorithm: first, by sorting the generated candidate set, the subsequent candidate set can be formed by scanning the previous candidate set and ending when the min_support threshold is scanning. Pruning can be carried out to reduce the amount of computation and improve the efficiency of the algorithm; second, when scanning the database, remove the data of the items whose number of items is less than the candidate set, that need to be checked to improve the efficiency of scanning the database.

References

[1] Jiao, D. W. (2013). Research and Improvement of an Association Rule Algorithm. Ph.D. Thesis, Changchun: Changchun University of Science and Technology.

[2] Lin, P. (2017). Research and Application of Two Improved Key Rule Mining Algorithms. Ph.D. Thesis, Hefei: Hefei University of Technology.

[3] Liu, H. (2004). Research on the Theory and Method of Knowledge Discovery and Its Application. Ph.D. Thesis, Dalian: Dalian University of Technology.

[4] Qian, D. Y. (2006). Research on Association Rules Algorithm in Data Mining. Ph.D. Thesis, Tianjin: Tianjin University.

[5] Shabtay, L., Fournier-Viger, P., Yaari, R., & Dattner, I. (2021). A Guided FP-Growth Algorithm for Mining Multitude-Targeted Item-Sets and Class Association Rules in Imbalanced Data. Information Sciences, 553, 353-375.

https://doi.org/10.1016/j.ins.2020.10.020

[6] Song, X. D. (2010). Research on Key Technologies of Enterprise Group Data Warehouse System. Ph.D. Thesis, Dalian: Dalian University of Technology.

[7] Wang, D. (2010). Research and Implementation of Apriori Algorithm for Association Rules Based on Cloud Computing. Ph.D. Thesis, Nanchang: Nanchang University.

[8] Wu, W. B. (2012). Apply the Data Mining Technology of Apriori Association Rule Algorithm to Mine E-Commerce Potential Customers. Ph.D. Thesis, Zhejiang: Zhejiang University of Technology.

[9] Yu, M., Hu, M., King, K., Hu, L., & Zhao, K. (2010). Telecom Network Alarm Application of Association Rule Algorithm. Journal of Jilin University (Information Science Edition), 28, 264-269.

[10] Zhang, Q. H. (2017). Research on Association Rules Algorithm for Big Data. Ph.D. Thesis, Xi’an: Xi’an University of Science and Technology.

[11] Zhou, J. Z., Liu, J., Yu, C. Y. (2010). Research and Application of Data Mining Based on Web Log. Science, Technology and Engineering, 10, 2762-2766.

[12] Zhu, Y. (2014). Research on the Improvement of Apriori Algorithm Based on Array. Ph.D. Thesis, Harbin: Harbin normal University.