Nowadays, crime situation is becoming increasingly serious across the globe with more crime types and a higher number of criminals, posing a threat to human lives and property as well as social stability. Public security authorities are increasingly tasked with maintaining public security and fighting crimes with an ever-growing requirement on law enforcement. Given the constantly generated crime data, it is necessary for data analysts to reveal hidden patterns in the data, analyze implicit relationships between the data, predict occurrence of crimes and discover potential criminals, so as to improve the efficiency of law enforcement efficiency of public security authorities and prevent occurrence of crimes.
Association rule mining (ARM)  is a research focus in the field of data mining. As one of the classic algorithms for data mining, ARM has been widely used in crime research. It is possible to extract relevant criminal evidence from association rules of a large number of data items, and further explore the patterns, trends and links between different crimes, so as to provide support for the police in case investigation and crime prevention. ARM performs well in exploring the causes of crime, identifying the main crime suspects, and having a deeper insight into crime series. A variety of ARM algorithms have been playing an important role in crime analysis and crime prediction. In the international research community of association-rule-based crime mining, Ng et al.  introduced temporal association rules and proposed an incremental algorithm to solve the problem of how to process time series whose association rules contain time expressions, and employed the new algorithm to discover crime patterns in Hong Kong. Buczak et al.  explored applying fuzzy association rule mining to recognition of community crime patterns, and such application promoted the efficiency of local law enforcement. Tan et al.  were the first to analyze the role of the FP-growth algorithm in computer crime forensics, specifying the limitation that the FP-growth algorithm had when used to discover the latest crimes and serious crimes, and making some improvements to the FP-growth algorithm and its test. Joshi et al.  proposed the FP-Tree similarity algorithm and found it more effective than the Apriori algorithm. Usha et al.   tested the Apriori algorithm and other algorithms such as the Fp-growth algorithm in real and synthetic crime data sets, and found that each algorithm had its own advantages. Shekhar et al.  explored spatial frequent pattern mining (SFPM) based on criminal pattern analysis (CPA) and validated this mining method with a spatial crime dataset. Isafiade et al.  revisited frequent pattern growth models for crime pattern mining, proposing a revised frequent pattern growth (RFPG) model, and also proposed a descriptive statistical approach based on a quartile (floor-ceil) function, which was used to identify recurring trends in criminal activity. Based on geographic and demographic factors, Asmai et al.  used ARM to generate a crime mapping model for crime analysis, employing the model to examine the occurrence of crime at a specific location, and demonstrating that the model could be used to analyze future crime locations with relatively high crime potential and improve crime prevention implementation. Extensive research has been conducted in China on ARM-based crime mining. Based on fuzzy set and rough set theory, Chen and Yu   employed ARM to quantitatively analyze a criminal dataset, make deductions, and extract rules, providing theoretical guidance for crime prevention. ARM has gained widespread attention and application in the fields of criminal portraits and criminal forensic analysis  . Moreover, ARM has been widely used in crime research such as crime investigation  , criminal suspect analysis  , criminal behavior analysis  , and reoffending   . In view of the temporal and spatial characteristics of crime data, many studies have proposed a number of improved ARM algorithms such as spatio-temporal association rules   , cluster association rules  , and data cube-based association rules  , as well as other improved algorithms such as incremental association rules  .
In summary, association rules have been extensively applied to crime mining in relevant studies, the effectiveness of ARM in different fields of crime mining has been investigated in depth, and a large number of improved ARM algorithms have been proposed. These studies are mainly focused on using crime data for association rules mining, but in practice more frequently encountered is business data, which is not necessarily related to crime but is easier to obtain, thereby making it crucial to extract crime information from a large volume of ordinary business data. Moreover, existing studies are largely focused on crime case analysis and crime pattern mining instead of the mining and prediction of potential criminals, and most of the studies are focused on macro-level factors that affect the occurrence of crimes while not considering micro-level characteristics of criminals, while the micro-level characteristics are the internal factors that determine whether an individual would commit a crime. This study combines ARM algorithm and clustering algorithm to deal with crime and related data. This method is different from previous crime mining methods. It can directly find criminal suspects instead of criminal hotspots or others, and it makes full use of ordinary business data, which is easier to access and handle. In this study, ARM is performed on city S-related travel data and accommodation data of criminals and ordinary people, and meanwhile, tag clustering is performed on criminals and ordinary people, in order to discover as criminal suspects, the ordinary people who not only frequently travel with criminals but also have highly similar tags to them. Given that criminal acts are often carried out in the form of gangs, the people who are discerned to travel with the criminals and have a high similarity to the criminals in tags can be considered as potential key individuals. Monitoring criminal suspects can greatly reduce the incidence of infraction and crimes and improve public safety. In Section 2, we briefly describe the data used in the study and the preprocessing of the data. In Section 3, we provide the details of our business process modeling methodology. In Section 4, we describe the research results and analysis in detail, and test and compare the algorithm. In Section 5, we summarize the conclusions of this work.
2. Research Data
City S-related travel and accommodation data of criminals and ordinary people in 2016 as well as their personal tag data at that time are collected. The travel and accommodation data consist of shuttle ticketing data and hotel accommodation data (Table 1). A total of 600,000 and 2 million personal attribute data are collected on criminals and ordinary people, respectively, with the data consisting of personal ID numbers and relevant personal tags (Table 2). The criminal data is divided into a training data set and a validation data set at a 1:1 ratio. The training data set is used for criminal suspect-related computation, while the validation data set is used for verification of the method effectiveness by checking whether actual criminals are among the criminal suspects.
Clustering analysis and similarity calculation are performed on vector inputs. Tag vectorization is conducted according to the publicly accessible corpus of Chinese word vectors developed by Beijing Normal University and Renmin University of China―a corpus providing pre-trained 300-dimensional (300 d) character vectors. Each tag vector is calculated according to the following formula:
where V is a calculated tag vector, with m representing the number of Chinese characters in the tag and a character vector of the tag’s i-th character. Each individual has 12 tags, and therefore the tag matrix of each individual is 12 × 300 in size.
3. Research Methods
This paper first uses the FP-growth algorithm to mine the ordinary people who frequently travel with the criminals as criminal suspects based on travel data and hotel accommodation data; Then use the DBSCAN algorithm to cluster and analyze the label data of criminals and ordinary people, and obtain several tag clusters. Carry out the similarity calculation of the criminals and ordinary personnel in each cluster, and find some ordinary persons with the highest similarity with the criminals as criminal suspects; Finally, the criminal suspects based on the association rules and the criminal suspects based on the label clustering are interdigitated to obtain the final criminal suspects, and the criminals test data is used to check whether calculated criminal suspects obtain actual criminals.
Table 1. Travel accommodation data.
Table 2. Personal tags and meaning of the tag values.
The research method flow is shown in Figure 1.
3.1. Association Rule Mining Based on FP-Growth
Association rule mining is a procedure which is meant to find frequent patterns, correlations, associations, or causal structures from data sets found in various kinds of databases such as relational databases, transactional databases, and other
Figure 1. Research method flow chart.
forms of data repositories. Association rules are created by thoroughly analyzing data and looking for frequent if/then patterns. Then, depending on the following two parameters, the important relationships are observed:
1) Support: Indication of how frequently the itemset appears in the database. It is defined as the fraction of records that contain X∪Y to the total number of records in the database. Suppose, the support of an item is 0.1%, it means only 0.1% of the transactions contain that item.
2) Confidence: Fraction of the number of transactions that contain X∪Y to the total number of records that contain X.
It’s is a measure of strength of the association rules. Suppose, the confidence of the association rule X ⇒ Y is 80%, it means that 80% of the transactions that contain X also contain Y together.
The FP-Growth Algorithm, proposed by Han in  , is an efficient and scalable method for mining the complete set of frequent patterns by pattern fragment growth, using an extended prefix-tree structure for storing compressed and crucial information about frequent patterns named frequent-pattern tree (FP-tree). In his study, Han proved that his method outperforms other popular methods for mining frequent patterns, e.g. the Apriori Algorithm  and the TreeProjection  . In some later works   it was proved that FP-Growth has better performance than other methods, including Eclat  and Relim  . The popularity and efficiency of FP-Growth Algorithm contributes with many studies that propose variations to improve his performance.
The FP-Growth Algorithm is an alternative way to find frequent itemsets without using candidate generations, thus improving performance. For so much it uses a divide-and-conquer strategy. The core of this method is the usage of a special data structure named frequent-pattern tree (FP-tree), which retains the itemset association information.
In simple words, this algorithm works as follows: first it compresses the input database creating an FP-tree instance to represent frequent items. After this first step it divides the compressed database into a set of conditional databases, each one associated with one frequent pattern. Finally, each such database is mined separately. Using this strategy, the FP-Growth reduces the search costs looking for short patterns recursively and then concatenating them in the long frequent patterns, offering good selectivity.
Based on travel data and hotel accommodation data, this study compares each car or daily hotel accommodation to a shopping basket, where the item is the ID of the person and the item set is all the people in the car or hotel. With the minimum support and confidence, the FP-growth algorithm is used to calculate the association rules between criminals and ordinary people. That is to mine ordinary people who frequently travel or stay with criminals as criminal suspects.
3.2. DBSCAN Tag Clustering and Cosine Similarity
As one of the most cited of the density-based clustering algorithms, DBSCAN is likely the best known density-based clustering algorithm in the scientific community today. The central idea behind DBSCAN and its extensions and revisions is the notion that points are assigned to the same cluster if they are density-reachable from each other. To understand this concept, we will go through the most important definitions used in DBSCAN and related algorithms. The definitions and the presented pseudo code follows the original by Ester et al., but are adapted to provide a more consistent presentation with the other algorithms discussed in the paper. Clustering starts with a dataset D containing a set of points . Density-based algorithms need to obtain a density estimate over the data space. DBSCAN estimates the density around a point using the concept of ϵ-neighborhood.
Definition 1. -Neighborhood. The ϵ-neighborhood, N(p), of a data point p is the set of points within a specified radius around p.
where d is some distance measure and . Note that the point p is always in its own ϵ-neighborhood, i.e., always holds.Following this definition, the size of the neighborhood can be seen as a simple unnormalized kernel density estimate around p using a uniform kernel and a bandwidth of ϵ. DBSCAN uses and a threshold called minPts to detect dense regions and to classify the points in a data set into core, border, or noise points.
Definition 2. Point classes. A point is classified as
・ a core point if has high density, i.e., where is a user-specified density threshold,
・ a border point if p is not a core point, but it is in the neighborhood of a core point , i.e., , or
・ a noise point, otherwise.
Definition 3. Directly density-reachable. A point is directly density-reachable from a point with respect to and minPts if, and only if,
1) , and
That is, p is a core point and q is in its ϵ-neighborhood.
Definition 4. Density-reachable. A point p is density-reachable from q if there exists in D an ordered sequence of points with q = p1 and p = pn such that pi+1 directly density-reachable from pi .
Definition 5. Density-connected. A point is density-connected to a point if there is a point such that both p and q are density-reachable from o.
Definition 6. Cluster. A cluster C is a non-empty subset of D satisfying the following conditions:
1) Maximality: If and q is density-reachable from p, then ; and
2) Connectivity: , p is density-connected to q.
The DBSCAN algorithm identifies all such clusters by finding all core points and expanding each to all density-reachable points. The algorithm begins with an arbitrary point p and retrieves its ϵ-neighborhood. If it is a core point then it will start a new cluster that is expanded by assigning all points in its neighborhood to the cluster. If an additional core point is found in the neighborhood, then the search is expanded to include also all points in its neighborhood. If no more core points are found in the expanded neighborhood, then the cluster is complete and the remaining points are searched to see if another core point can be found to start a new cluster. After processing all points, points which were not assigned to a cluster are considered noise.
This study uses cosine similarity to calculate the distance. Refer to the existing study  to set ε to 0.6 and MinPts to 255. The DBSCAN algorithm clusters the vectorized criminals and ordinary people’s tag matrix, and calculates the similarity and sorting of the criminals and criminal suspects for each cluster. Then, ordinary people with similarities greater than 0.8 are selected from various clusters as criminal suspects. The cosine similarity calculation formula is as follows:
We use the machine learning algorithm library Spark Mllib in the distributed computing framework Spark to perform association rule calculation and cluster analysis. A total of five physical nodes are deployed for fast and efficient computing. Compared to stand-alone computing, Spark distributed computing time efficiency is increased by 300%; compared to Hadoop MapReduce distributed clusters, time efficiency is increased by 150%.
4. Results and Analysis
4.1. Analysis of Association Rule Results
Association rule computation is based on a given minimum support (min_sup) and a given minimum confidence (min_conf). A change of min_sup and min_conf will lead to a different result. The settings of min_sup and min_conf are allowed to vary according to actual needs. In this study, their settings are as follows:
1) Travel association rules
For ARM of travel data, each route (including route code, departure time, start station, arrival station) is treated as a transaction ID, and all passengers on the route are treated as transaction items. Computation is performed on shuttle-related data according to a given min_sup and min_conf, and the results are expressed in the form of “route passenger X => passenger Y”, which represents that when the requirements of minimum support and minimum confidence are met, passengers X and Y are considered to have taken the same route within a certain time window. However, it is unclear whether the above rules contain criminals, and therefore filtering operation is performed on all association rules to find the association rules that contain criminals. Based on each association rule that contains criminals, such as “X => Y” wherein X is assumed to be a criminal, it is possible to find all the routes that both X and Y have taken together. The ordinary individual Y who often takes the same routes as the criminal X can be considered as a criminal suspect.
The calculation results, as listed in Table 3, shows that there is a total of 433 association rules, that is, 433 criminal suspects are found. The five fields in Table 3 present the ID number of the criminal, the ID number of the criminal suspect, the confidence, the same routes that both the criminal and the criminal suspect have taken, as well as the number of the same routes taken by both the criminal and the criminal suspect―which is used as a surrogate for the support. The second rule in the table, as an example, states that the criminal with ID number 6eb9b7199bxxxxxxxx and the suspect with ID number 4ab8b81987xxxxxxxx have traveled 78 times with each other in the same routes, and the confidence of 0.9 indicates that 90% of all the routes that are taken by the criminal also involve the criminal suspect; in addition, the route information indicates that, for example,
Table 3. Travel association rules.
Note: A total of 433 association rules.
there is one route named 653124xxxxxxxx, which started at 14:50:00 on March 16, 2016 from the aks station to the awt station. As shown in the above table, different association rules have different levels of support and confidence. For example, the support and confidence in the first association rule are 45 and 1.0, respectively, indicating that the criminal 6eb9b8197cxxxxxxxx and the criminal suspect 6eb9b8a9f8xxxxxxxx have traveled together with each other for all of the 45 routes. The second association rule shows that nearly 70 of the 78 routes that the criminal 6eb9bg199bxxxxxxxx has taken involve the criminal suspect 4ab8b81987xxxxxxxx. It is impossible yet to know which of the criminal suspects 6eb9b8a9f8xxxxxxxx and 4ab8b81987xxxxxxxx has a higher likelihood of being an actual criminal. In the third association rule, the number of the same routes and the confidence are relatively smaller compared to the first two rules, indicative of a smaller likelihood of the criminal suspect 6deab8199c xxxxxxxx being an actual criminal.
2) Hotel accommodation association rules
For each hotel, all persons who check in for accommodation on a given day are treated as one transaction. Computation is performed on the accommodation data according to given min_sup and min_conf, and the results are expressed in the form of “hotel code person X => person Y”, indicating that passengers X and Y are considered to have stayed at this hotel given the fulfillment of the min_sup and min_conf requirements. By processing the hotel accommodation association rules in a similar way to the travel association rules, it is possible to find the criminal suspect Y that often stays in the same hotels as the criminal X.
Table 4 shows the analysis results of hotel accommodation association rule. There is a total of 323 association rules, that is, 323 criminal suspects are found. The information about criminals and criminal suspects staying in the same hotels is revealed in the analysis results, with the time denoting the hotel check-in
Table 4. Hotel accommodation association rules.
Note: A total of 323 association rules.
date; in addition, the times of staying in the same hotels refers to the times of the criminal and the criminal suspect staying in the same hotels during the studied time window. There is a difference in support and confidence between different accommodation association rules, as is the case with the travel association rules, which indicates that different ordinary people have different likelihoods of being a criminal suspect. For example, the probability of an ordinary individual being a criminal suspect is higher in the first association rule than in the second association rule, as the former rule has higher support and confidence than the latter; the third association rule and the last association rule in Table 4 have the same confidence, but the former has a higher support, so the ordinary people in the third association rule are more likely to be a criminal suspect than in the last association rule.
Duplicates may exist between criminal suspects discovered by shuttle ticketing data and by hotel accommodation data―in other words, criminal suspects based on the association rules of shuttle ticketing may also appear in the association rules of hotel accommodation. Therefore, it is necessary to combine the two sets of results to eliminate the repetitive criminal suspects, and by dosing so a total of 648 criminal suspects has been finally obtained.
4.2. Result Analysis of DBSCAN Clustering
The results of DBSCAN clustering are shown in Table 5. Calculation gives 67 clusters, each containing a different number of people in the range of 2000 - 30,000. Each cluster contains both criminals and ordinary people, with the ratio of criminals to ordinary people varying from cluster to cluster.
Based on the tag clusters of criminals and ordinary people, the cosine similarity between ordinary people and criminals in each cluster is calculated, and ordinary people with cosine similarity greater than 0.8 are escalated as criminal suspects. The calculation results, as listed in Table 6, indicates that a total of 973 criminal suspects are obtained, with each cluster having a different number of
Table 5. DBSCAN clustering results.
Table 6. Calculation results of cosine similarity.
Note: A total of 973 criminal suspects.
criminal suspects―which is attributed to that each cluster is different than another in terms of both the total number of people and the number of criminals, so a cluster that is higher in both of the two numbers is also likely to have a higher number of criminal suspects identified by similarity calculation. In addition, DBSCAN clustering may result in a different compactness in a different cluster, and as a result the number of predicted criminal suspects also varies.
Moreover, the distribution of the 973 criminal suspects by their personal tags as shown in Figure 2, which shows that males are nearly twice as many as females, that is, the probability of a male individual being a criminal suspect is much higher than that of a female individual. In the distribution by age, nearly 90% of the criminal suspects are young and middle-aged, while the minors have a negligible proportion. The distribution by marital status indicates that unmarried criminal suspects are twice as many as married criminal suspects. Similarly, when it come to the distribution by employment situation, there is a huge difference between the number of employed and unemployed, with more than 80% of the criminal suspects being unemployed. In terms of household income, the low-income and middle-income people account for nearly 90% of the potential key individuals, and the number of people in each of these two income groups is nearly twice as great as that in the high-income group. Similar to income-based distribution, the education level-based distribution shows that criminal suspects with primary and junior high school education account for 80% of the potential key individuals. In addition, of the 973 criminal suspects, 65% are raised in single-parent families, 70% are raised in foster families, and more than 70% have no
Figure 2. The distribution of potential key individuals by their personal tags.
offspring. With respect to the ownership of house property, nearly 80% of the criminal suspects do not have their own house property. The distribution by household registration and migrant status indicates that 67% of the potential key individuals have rural household registration and 60% are migrants. In summary, criminal suspects mainly have the tags of being a male, being middle-aged, being unmarried, being unemployed, having low and middle income, having received elementary school and junior high school education, being raised in a single-parent family, having no offspring, being a tenant, being raised in a foster family, having rural household registration, and being a migrant. These tags are in line with the actual situation of criminals. For example, in the comparison of a male individual versus a female individual, an employed individual versus an unemployed individual, and a low-income individual versus a high-income individual, the former is more prone than the latter to commit crimes. In summary, criminal suspects discovered by means of clustering analysis and tag similarity are in line with the tag-based distribution.
4.3. Result Validation
After ARM and DBSCAN clustering, the two sets of criminal suspects found by the two methods are subject to intersection operation. In this process, the number of criminal suspects appearing in both of the two sets is calculated to be 567, and the 567 criminal suspects are verified by the validation set of criminals, that is, these suspects are traversed in order to find the individuals appearing in the validation data―who are actual crimes, totaling 419. Therefore, with the elimination of the actual criminals, the number of criminal suspects is finally determined to be 148. We can say that the accuracy of the algorithm is 73.9%, but the actual accuracy needs to be tested in actual law enforcement activities. We can only think that the validation results here indicate the effectiveness of the above method for finding criminal suspects. Moreover, most studies predict the number of crime hotspots and crime cases, but do not predict criminal suspects. Therefore, it is difficult to compare with other crime prediction methods.
In this study, the FP-growth algorithm is used to perform ARM on travel data and hotel accommodation data, and the DBSCAN algorithm is used to achieve tag clustering of criminals and ordinary people, and finally the above results are verified. The results show that:
1) The FP-growth association rule mining algorithm shows that different association rules have different support and confidence, that is, different ordinary people have different possibilities of being criminal suspects.
2) By using the FP-growth association rule algorithm, 648 criminal suspects are found, while 973 are found through DBSCAN clustering of personnel tags; the number of criminal suspects in the intersection of the above two sets of criminal suspects is 567, and when verified against the validation data, the 567 criminal suspects are found to contain 419 actual criminals, thereby leaving 148 as the final criminal suspects with a prediction accuracy of 73.9%.
3) Criminal suspects mainly have the tags of being male, being middle-aged, being unmarried, being unemployed, having low and middle income, having received elementary school and junior high school education, being raised in a single-parent family, having no offspring, being a tenant, being raised in a foster family, having rural household registration, and being a migrant, and such tag-based distribution agrees with the situations of actual key individuals to some extent.
4) The validation results show that the method is effective in discovering criminal suspects, and it has great generalizability; that is, it can be used for data mining of train passenger information, passenger exit/entry records, Internet-café user information, as well as other travel spending information.