Credit card fraud remains an important issue for theft and fraud committed using a payment card, such as a credit card or debit card. To combat this many fraud detection algorithms are widely used in industry     . Card fraud can happen with the theft of the physical card as well as with the compromise of the card, including skimming, breach, account takeover, that would otherwise look like a legitimate transaction. According to the Global Payments Report 2015  , the credit card is the highest-used payment method globally in 2014 compared to other methods such as an e-wallet and Bank Transfer. Along with the rise of credit card usage, the number of fraud cases has also been steadily increasing  . The rise in credit card fraud has a large impact on the financial industry. The global credit card fraud in 2015 reached a staggering USD 21.84 billion  .
Financial institutions today typically develop custom fraud detection systems targeted to their own portfolios  . The data mining and machine learning techniques are vastly embraced to analyze patterns of normal and unusual behavior as well as individual transactions in order to flag likely fraud. Given the reality, the best cost-effective option is to tease out possible evidence of fraud from the available data using statistical algorithms  . Supervised models trained on labeled data examine all previous labeled transactions to mathematically determine how a typical fraudulent transaction looks and assigns a fraud probability score to each transaction  . Among the supervised algorithms typically used, the neural network is popular, and support vector machines (SVMs) have been applied, as well as decision trees and other models    -  . However, little attention has been devoted in the literature to some comparison of all the common algorithms, particularly using real data sets.
In this paper, we explore the application of various linear and nonlinear statistical modeling and machine learning models on credit card transaction data. The models built are supervised fraud models that attempt to identify which transactions are most likely fraudulent.
2. Description of Data
The data available for this research project are a collection of credit card transactions from a government agency located in Tennessee, U.S.A. The particular agency is not known.
The data consist of 96,753 credit card transactions during the year 2010, with 1059 labeled as fraud. The file contains the fields:
· Record: A unique identifier for each data record. This field also represents the time order;
· Cardnum: The account number for the transactions (we note that they are Mastercard transaction since the account numbers begin with the digits 54);
· Date: The date of the transaction. Month, day and year only (no time of day);
· Merchnum: A typically 12-digit merchant identification number;
· Merch Description: A brief text description field of the merchant, typically around 20 characters;
· Merch State: The state of the address for the merchant;
· Merch Zip: The zip code of the merchant;
· Transtype: A code denoting the type of transaction;
· Amount: The dollar amount of the transaction;
· Fraud: A label for the transaction to indicate whether or not it was a fraudulent transaction.
Table 1 shows summary information about all the fields. Only the Amount field is a numeric type field; the other fields are all categorical or text. The statistical magnitudes in the table were calculated with the outliers eliminated. Three fields have some missing values: Merchnum, Merch state, and Merch zip. It was noticed that the number of unique values of the Merch state field is 227, which is unexpected because the U.S. has only 50 states. Some of the values in this field might be from other countries, such as Canada and/or Mexico.
Below we show some further information about the data.
Figure 1 shows the number of transactions each month. We noticed the general upward trend through September, followed by a sharp drop in October. The monthly transactions are fewer in the last quarter of the year compared with other quarters. This is due to the government fiscal year which starts on October 1, and people tend to be more cautious with their money in the first few months of the new fiscal year.
Figure 2 shows the top 10 of the most frequently traded merchant descriptions. The total transaction frequency of the top 15 categories is 13,256, which is about 13.7% of the records. The top 200 categories account for 41% of the total records. In Table 1, we see that there are 13,126 kinds of merchants by this Merch description field, and 48.6% of the merchant descriptions only occurred once.
Figure 3 depicts the top 10 of the most frequently observed merchant states.
Table 1. Summary description of the data set.
*Statistical magnitude without outliers.
Figure 1. Monthly count of transactions shows seasonality.
Figure 2. Most common merchant descriptions.
Figure 3. Most common states.
The most frequent state is TN which is about 12.6% of the whole transaction frequency, and is not surprising because that is where the facility is located. The total number of transactions in the top 15 states is 71,647, which is about 75% of the entire records. From Table 1 we see that there are 227 different states in the data. And we note that 168 of the state’s field values are numbers rather than letters.
3. Data Cleaning
When we examine the field Amount, shown in the box plot distribution Figure 4, we see that there is one large outlier with the Amount value recorded as over 3 million dollars. After thoroughly checking that particular record, which is not labeled as fraud, we discover that it is an unusual but known transaction in Mexican pesos from a particular Mexican organization, and we thus exclude it from further analysis.
Table 2 shows the information about the fields with missing values. All of these three fields have a strong relationship with the field Merch description, but the same Merch description may also correspond to different values in the three above fields. These three fields also have a strong relationship with each other, so the mode of each field is used to fill in the missing fields.
Figure 4. Box plot distribution of the Amount field.
Table 2. Fields with missing values.
4. Variable Creation
Creating expert variables is a critical step before any model can be built. We examine the original fields from the raw data set, as shown in Table 1, and from these we create a large universe of candidate variables for our supervised models. At this stage, we want to create as many candidate variables as possible, and later we will use a variety of feature selection methods to reduce the universe of candidate variables to those that will be our final set of possible model inputs.
This step of creating variables, also called feature engineering, requires us to know as much as possible about the dynamics of the particular problem we are trying to solve. We want to do our best to create variables that in themselves contain as much encapsulated information of the signals of anomalous behavior that could be fraud. Thus we interview domain experts to fully understand as best as possible the different modes of fraud for this problem and try to build variables that will show the signals of these fraud modes.
This step of variable creation is arguably the most important step in machine learning. Encapsulating the problem dynamics as best as possible into these expert variables, prepares the data for the model in an as optimal way as possible. One should always do as much work upfront in this stage as possible in order to minimize the work required by your machine learning algorithm. It is true that theoretically all these nonlinear algorithms can discover these expert variable dynamics themselves, but it is generally difficult in a high dimensional space with sparse data sets. We note that 100,000 records are sparse data in anything above a handful of dimensions. Thus we work hard to create a universe of candidate variables that have encoded in them as best as possible as many indicators of potential fraud as we can think of. The logic behind these special expert variables is described below  .
There are four general types of expert variables that have been built:
1) Amount variables. These variables are important because the amount spent can be immediately indicative of unusual activity. For example, if a person typically spends about 400 dollars over some time window and one day he spends 400,000 dollars in one transaction, it would be a signal of odd behavior. Therefore these amount-type variables are good candidates to detect potential frauds.
2) Frequency variables. Another example to illustrate the significance of this is that if a person does 2 transactions per day usually, but one day he does 2000 transactions. That will be suspicious. So frequency variables are helpful to see if there is a big difference in a number of transactions.
3) Days since variables. One example is that if it has been 10 months since the last time a person uses his card, someone else might be using it in this case, so that it might be a fraud.
4) Velocity variables. This measure shows how a person may do a large number of transactions over a day compared to his average transactions over a period of time. It is similar to the frequency variables.
with some variations of characteristics we end up 200 variables of the first type, 20 variables of the second type, 5 variables of the third type and 12 variables of the fourth type. These 237 expert variables are listed in Appendix.
5. Feature Selection
A non intuitive fact of dimensionality is that all data points tend to be outliers as dimensionality increases. Another danger of high dimensionality is the exponentially growing need of data to find true nonlinearity rather than noise. These facts, generally called the curse of dimensionality, demand that one does feature selection before attempting to build models in general, but particularly for any nonlinear models. Feature selection is of great necessity through the whole building process of machine learning algorithm as it not only drastically decreases the dimensionality but also helps discover the particular variables that carry the more useful information.
In general, there are three categories of feature selection methods: filter, wrapper, and embedded methods. The former two are applied in this research. A filter is a method applied to the data set without using a particular standard modeling algorithm. The wrapper method “wraps” a particular model around the feature selection process, for example, stepwise selection. Finally, an embedded method has the feature selection built directly into the modeling process, such as various tree methods or the use of regularization.
Here we first apply a filter method of feature selection to reduce the number of variables by about half. By calculating certain performance measures across every single variable individually and ranking the variables by these measures, a filter method can efficiently identify those with low performance. The particular filter measures we use for this fraud problem are a univariate Kolmogorov-Smirnov (KS) and the univariate fraud detection rate (FDR) at 3%. The Kolmogorov-Smirnov statistical test is a measure to quantitatively determine how much two distributions over the same independent variable are separated. This is a common test used for feature selection in a binary classification problem. Here we build the distribution of the two categories of records, one for each of the two dependent variables values, across each of the independent variables, one by one. The question we are asking is how well each independent variable by itself can differentiate the two classes. The equations for calculating the KS separation of two distributions are:
where the first representation is for a continuous distribution and the second is what we use in practice for finite data sets.
For each candidate variable, we build the two distributions of the frauds and the non-frauds by this variable, and we calculate the KS separation between these two distributions. This measure explicitly tells how well each candidate variable by itself can separate the classes, and is, in essence, a univariate model.
Another measure of goodness we can calculate is the univariate fraud detection rate (FDR) for each candidate variable. The FDR is the measure of what percent of total frauds are caught at a particular percent of the population, assorted by the fraud score or, in this case, the candidate variable. For a particular candidate variable, we sort all the records by the value of this variable. We then proceed from the “top” of this list and add up the # frauds observed as we penetrate deeper into the list of records. We count the number of frauds observed at each of the percents of penetration of the data population, and discover, for example, that in this sorted list we might find that 10% of the frauds are contained in the top 3% of the records assorted by this particular candidate variable. This FDR (10% fraud at 3% penetration) is a measure of how well the candidate variable by itself separates the frauds from the non-frauds.
A wrapper is a popular feature selection method to help determine which variables should be included in the machine learning model. It directly builds many models with different sets of variables, selecting them by examining the measure of goodness of the model itself. The most common wrapper method is stepwise selection, typically forward or backward selection. For forward selection, we first build univariate models, one for each of the n candidate variables, and examine the measure of goodness for each model. We select the variable xi whose univariate model is best and then proceed by building n-1 two-variable models, xi combined with all the remaining candidate variables, one by one. We select the best 2-variable model, now having identified xi and another xj as the strongest variables (xj strong in conjunction with xi). We continue this stepwise forward selection until we see negligible improvement when adding a new variable.
Backward selection is similar. Here we start with a single n-variable mode that uses all candidate variables. We then build n new models, each using only n-1 variables, where we have removed one of the n possible variables from each new model. The variable whose n-1 model decays the least is then discarded, and we continue by building n-1 2-variable models using all combinations of the remaining n-1 candidate variables. We continue this way until the removal of the next variable causes undesirable model performance degradation.
Both of these wrapper methods result in a rank-ordered list of the variables, sorted by importance in predicting the dependent variable. Note that this wrapper feature selection is done with a particularly selected modeling method wrapper, and while the rank-order importance is not strongly dependent on the choice of wrapper model method there may be some dependence on this choice.
Because so many models need to be built in stepwise feature selection, it is common to use a very fast modeling method as the wrapper, and one typically uses linear or logistic regression. We note that this common and expedient choice generally ignores the potential of variable interactions. In general, the choice of wrapper model method can be independent of the choice of model method for the final model. In this work, we used logistic regression, FDR at 3% and the forward selection method for the wrapper feature selection process.
We begin our feature selection process with the universe of 237 expert variables created, and first apply our filter methods to reduce the number of expert variables from 237 to 120. By combining the KS and FDR with equal weight, the filter reduces our variable set to about half of the variables, and then we apply the wrapper to reduce to the final 20 variables, still rank-ordered by importance. Figure 5 illustrates the stages of this two-step feature selection process.
Table 3 below lists the final 20 variables after feature selection.
6. Model Algorithms
We explore the use of a variety of supervised modeling methods, starting with a baseline logistic regression and then a number of nonlinear statistical/machine
Figure 5. Feature selection process.
Table 3. The selected top 20 variables ranked by importance after feature selection.
learning methods. For each method, we describe the general principals of the algorithm and the parameter searches performed.
In this analysis we divide our data set into three parts: training, testing and out of time/validation. This third data set is chosen to be the final 2 months of our data. We use the first 10 months of our data set to build our best models, randomly dividing the records into training and testing in multiple ways, and then we evaluate the model on this out of time data set. This gives us the most likely model performance that will be experienced when the model is implemented. Although we use this out of time evaluation multiple times during model tuning, it still provides a more realistic performance expectation measure than what the in-time testing data provides.
6.1. Logistic Regression
With 20 variables selected, logistic regression functions as a baseline model for its simplicity of parameters interpretation. We note that based on its simplicity and generally fine performance, logistic regression is frequently the method of choice for many real-world business classification problems. For this study, the first step is to build a baseline logistic regression using all variables from our feature selection process. Table 4 below is a summary of the main parameters of this logistic regression model. Each variable xi is the variable given in Table 3.
Note that x3 and x16 both have p values higher than 0.05, and therefore should be excluded since they are not statistically significant. Thus the new version of logistic regression on the remaining 18 variables (removing x3 and x16) is shown in Table 5 summary below.
Table 4. First logistic regression output on the training data.
Table 5. Final logistic regression output on the training data.
To optimize the logistic regression the number of variables chosen is an adjustable factor in the tuning process. Models with different numbers of variables are compared in Table 6 below. Note that each experiment of a different number of variables is achieved by removing the last several variables since they are ranked by importance.
Table 6 shows the results of FDR (at 3%) on the out of time data for logistic regression using different numbers of variables, always selected in the rank order shown in Table 3. The nonmonotonic nature of the results is simply a statistical variation due to the sparsity of records.
6.2. Artificial Neural Network
An artificial neural network (ANN) is an algorithm that shares simulative structure of the neural network of human brain. As shown in Figure 6. There are an input layer, hidden layers and an output layer in the overall network architecture. Each of these layers contains nodes or neurons, which are gathering locations for the receipt and transmission of numerical signals from the previous to the next layer of the neural net. Each neuron embedded in the structure receives a signal from the nodes in the previous layer, applies a transfer function to the signal, and then outputs a new signal to the nodes in the next layer. The signal received from the previous layer is in general a linear combination of the outputs of the previous layer nodes. This combined linear combination signal, received by the node, is then passed through a transfer function, typically a sigmoid or logit function. Other transfer functions are also used, but the sigmoid/logit is the most common.
In this work, we use this sigmoid/logistic nonlinear transfer function, with the equation below.
Table 6. Logistic regression performance for different number of variables.
Figure 6. Neural net architecture1.
The variable x in this equation is the linear combination of the weighted signals received from the nodes in the previous layer.
During the parameter tuning process, the key parameters that mainly define the structure of the ANN are number of variables/inputs, hidden layer sizes (number of layers & number of nodes in each layer), a regularization weight alpha and the learning rate, all of which become the searchable parameters in our tuning experiments. Table 7 below shows the parameters settings of the optimal ANN after tuning.
We note that, even though this was our best ANN model it is likely not the best possible. With only one node in a single hidden layer and the logit transform function, it is mathematically equivalent to a logistic regression.
6.3. Support Vector Machine
A Support Vector Machine (SVM) is a wildly applied binary classifier for supervised machine learning problems. The main characteristics of the SVM algorithm are:
Table 7. Final parameters for the Artificial Neural Net model.
*layer: 1 nodes: 1.
· Expand the dimensionality;
· Look for a linear classifier separation surface in this higher dimensional space.
It is counterintuitive to expand the dimensionality of a nonlinear machine learning algorithm, since as we argued previously, dimensionality is in a sense the enemy of nonlinear modeling. Expanding the dimensionality can provide new variables, such as the many varieties of the products and higher order powers of the original variables, where a linear separation in this complex higher dimensional space might be a good separation surface. But adding these dimensions is against the principals recognized by the curse of dimensionality.
The SVM algorithm achieves this balance by not using explicit new variables in this extended dimensionality, but introduces “virtual” new variables through the use of a kernel trick. It is observed that as the algorithm searches for the best linear separator in the space, all that is needed is some measure of distance. The kernel is a generalized measure of distance and can be employed in this virtual higher dimension without explicitly calculating new variables or explicitly expanding the space.
Another important characteristic of the SVM algorithm is the use of a margin concept, where the best linear classifier is selected not just on the basis of the classification error but also considering the best spread around the data, tilting the surface so that this margin of classification/misclassification is as optimal as possible. This is shown in Figure 7.
The choice of a radial basis function (RBF) kernel function is used in this study. The number of variables, C and gamma, are key parameters of the SVM algorithm with the RBF kernel. C is a weight for each of the classes, set to one here, and gamma is a scale factor for the width of the radial basis functions. Table 8 below shows the final settings of parameters of the SVM after tuning.
6.4. Random Forest
A decision tree algorithm is one where the space of data is separated into a number of distinct boxes that, combined, cover all the possible space. It is built by doing iterative cuts of the space, typically dividing an existing box into two “child” boxes, which is thus called a binary tree. At each iteration, the algorithm considers a box of data and examines what might be the best way to split it into two separate boxes where some criterion is optimal. The splitting criterion for a binary classification problem is typically an impurity measure, where the split
Figure 7. Support vector machine separation surface2.
Table 8. Parameter settings for the SVM model.
attempts to separate into two child boxes of the greatest type purity, meaning, the two classes are separated as well as possible. The splitting algorithm continues, considering each child box as a new parent box, and continues to split until a stopping criterion is reached.
Decision trees are an early and ubiquitous machine learning, nonlinear statistical modeling methodology. Used extensively decades ago they are known to have serious flaws that are easy to succumb to. The primary deficiency is the tendency to overfit, followed by instability and fragility. The structure of a tree can change substantially depending on near-random choices of the first few splitting locations, which can easily change with small changes in the fitting data set. These deficiencies have been largely overcome through more modern algorithms that take advantage of the good properties of trees and greatly avoid these known pitfalls. The use of multiple trees rather than a single complex tree substantially removes many of the problems. Two widely-used architectures that use multiple trees are random forests and boosted trees.
A random forest, shown in Figure 8, is a large collection of decision trees, where the final output is a committee choice across many individual trees. The principle of random forest is that many trees are built, and each uses a randomly-chosen subset of features. All the results are then combined, typically by
Figure 8. Random forest architecture.
averaging or voting. A random forest, as compared to a single decision tree, can get a more accurate prediction. Further, since the tree depths are usually smaller and many trees are combined, overfitting issues are largely avoided. While tuning the parameters it is important to know what each parameter means. There are 5 important parameters that can be tuned for the random forest algorithm selected:
· max_features: Maximum number of features the random forest can try in each individual tree;
· max_depth: Maximum depth of each tree;
· min_samples_leaf: The minimum number of samples required to split an internal node;
· n_estimator: Number of trees to be built.
After multiple trials, the setting of parameters was finalized as is shown in Table 9 below.
6.5. Boosted Tree
Boosted trees or gradient boosting trees is a type of supervised model that again uses a collection of decision trees, but constructed in a very different way from the random forest. As shown in Figure 9, the boosting process is an iterative approximation process, where we incrementally add more trees, in a fashion similar fashion to a Taylor series, each adding slightly more predictive power to the whole.
This series of “weak learners” has several important characteristics. First, each tree is a very simple tree, limited in depth and variables. The first tree in the series makes a very crude fit to the target output. The error in this fit is calculated
Table 9. Parameter settings for the random forest model.
Figure 9. Boosted tree architecture.
for each record, and this error becomes a weighting factor for the next training iteration. Thus as this series of weak learners continues, each next additive tree tends to focus its learning objective on the records for which there is the largest error so far. The use of the latest error as the weights for the next training iteration is what gives the algorithm the name boosting, and indeed any weak learner can be used in a boosting algorithm.
Boosted tree models have a set of parameters that can be tuned to improve the quality of the series of weak learners. Some important parameters were explored, specifically the parameters.
Max_depth: Maximum depth of a tree. Increasing this value will make the model more complex and more likely to overfit.
Learning_rate: Step size shrinkage used for the next iterative tree and can help to prevent overfitting. Empirically it has been found that using small learning rates yields dramatic improvements in the model’s generalization ability over gradient boosting without shrinking.
Scale_pos_weight: Control the balance of positive and negative class weights, useful for unbalanced classes.
Min_child_weight: Minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weights less than min_child_weight, then the building process will give up further partitioning.
Table 10 below illustrates the final settings of these parameters for the boosted tree after tuning.
7. Model Results
Table 11 below summarizes the FDR at 3% for each model.
The ultimate purpose of this research is to find a relatively robust model which
Table 10. Parameter settings for the boosted tree model.
Table 11. Model results on the card transaction data on the training, testing and out of time validation data sets. These numbers are averages over 10 runs for each data set.
can handle the real-time data flow required for a credit card transaction fraud model. The out of time (OOT) data set is set aside, separated from the training/testing data, to simulate how the model will perform when implemented. Thus, the key measure to determine which nonlinear model to deploy is naturally the FDR performance on OOT data. Obviously, the boosted tree (FDR = 49.83%) slightly outperformed the next best model (SVM, FDR = 47.54%) and become our best choice for our nonlinear model.
Having settled on the boosted tree as our final model we then examined the FDR at different population cutoff locations, separately on training set, testing set and OOT data set. Tables 12-14 below illustrate these results. Here the FPR is the false positive ratio, the number of incorrectly flagged non-frauds divided by the correctly caught frauds, another common measure of model goodness.
These three tables show the model performance statistics of the three data sets (Table 12 of training data set, Table 13 of testing data set, Table 14 of OOT data set), and the OOT data set (Table 14) shows our best guess of how the model will perform when implemented. In this table, we see that the model pushes the majority of the fraud records to the top bins, which is what is desired. The bin statistics tell us what is happening in each population percentile bin and the cumulative statistics tell us what would happen if we were to select any particular percentage as a score cutoff. We see in the cumulative statistics the FDR of the OOT data set at 3%, which means that if we set the score cutoff at 3% we expect to catch about 50.84% percent of all the fraudulent transaction attempts. A score cutoff of 3% means that the business will reject the top 3% of transaction volume as sorted by the fraud score. Note that the FDR at 3% for each data set here are different from those in Table 11 because Table 11 reflects average FDR for over 10 runs for each data set, while FDR here is generated over one run when building boosted tree.
Table 12. Training results.
Table 13. Testing results.
Table 14. OOT results.
8. Conclusions and Further Work
In this paper, we explore the application of linear and nonlinear statistical and machine learning models on credit card transaction data. The models we build are supervised fraud models that attempt to identify which transactions are most likely fraudulent.
As we would expect, the nonlinear models slightly outperform the linear model, except for the artificial neural network. We believe this underperformance is due to 2 reasons. First, we recognize that we have likely not sufficiently tuned the neural net model and improvement can be found with a different set of model parameters. Second, we note that the data set is substantially limited, and only has 179 labeled fraud events in this OOT data set. The results of any model, particularly a nonlinear one, can be volatile and sensitive to the statistical aberrations and the variation of model parameters. The boosted tree model performs the best and can detect about half of the fraud attempts within only the top 3% data that was sorted as suspicious by the fraud algorithm score.
The resulting model can be utilized in a credit card fraud detection system. We note that similar model development process can be performed in related business domains such as insurance and telecommunications, to avoid or detect fraudulent activity.
We also found that we can use the scores of statistical significance in the logistic regression models as the criteria for selecting variables: high statistical significance score or the low p-value means strong correlation between fraud and related variable. It was found that the logistic regression model works quite well, once sufficiently good expert variables are constructed, which is also seen quite often in practice. What is observed in practice is that well-designed linear models are difficult to beat even with sophisticated nonlinear algorithms. The most important step is the construction of good expert variables that encode the signals of the problem dynamics as much as possible into clever variables. It is for this reason that linear and logistic regression models are so prevalent in the business field.
More data with more fields (for example, adding point of sale information, time of day, or other cardholder or merchant information) would certainly allow model performance improvements. Further model parameter tuning would also provide improvements to any of the models.
The authors would like to thank the Torhea Online Education Research Program as sponsors and hosts for this research project.
Appendix: List of 237 Candidate Expert Variables
1The figure is quoted from http://image.baidu.com.
2The figure is quoted from http://image.baidu.com.