Assignment 07: Supervised learning for fraud detection
In this week’s assignment, you’ll analyze a subset of a large synthetic dataset designed for fraud analytics (taken from E. A. Lopez-Rojas , A. Elmir, and S. Axelsson. “PaySim: A financial mobile money simulator for fraud detection”. In: The 28th European Modeling and Simulation Symposium-EMSS, Larnaca, Cyprus. 2016.).
The dataset I want you to analyze is here: assignment-07.zip in CSV format.
You’ll need to decompress it, and then load it into an analysis tool to let you build classifiers using supervised learning. I’ll give step-by-step directions for the ML stuff for Weka below, but if you’d like to use, a Python framework for machine learning or something, that’s fine too. Just be clear in your writeup about the steps you took to get to your results.
First, load the data in the analysis tool of your choice. Notice that there are two fraud-related columns (attributes): isFlaggedFraud
, which represents automatically-generated flag that tells the analyst without access to labeled data whether this might be a fraudulent transaction, and isFraud
, which is a (manually-created) label, identifying the actually-fraudulent transactions.
1. How many instances of actual (labeled) fraud are in the data set?
If you inspect the data further, either manually in something like Excel, or using a visualization tool, you’ll find something interesting. There’s a variable you can use to help winnow down this data set. I’ll spare you some work, and give you a hint: there are only a few values of the type
attribute that correlate with known fraud.
2. What values of type
correspond with fraud?
Next, you can remove all rows (instances) of data that contain type values that never correlate with known fraud. This action is essentially “hardcoding” into your model your belief that these values with never contain. This might be wrong, so it’s important to be cognizant of what you’re doing! But removing training data with little information will help build a more accurate model – tradeoffs, tradeoffs. (Note: This also means that if you were asked to classify a new, unlabeled instance of data, you’d first check if its type field was one of these values and then say “not fraud” – in other words, you need to hardcode your result, too.)
You could do this in Excel or the like; if you are familiar with SQL you could do it after importing the data into a table via a particular SELECT command. If you want to do it in Weka, open the data in the Explorer, then choose the “RemoveWithValues” filter. In the “attributeIndex” field enter “1” (as that is the number of the type
field). In the “nominalIndices” field, enter the indices of the attributes you want to remove, separated by commas, as listed in Weka if you click on that attribute. For example, if you wanted to remove PAYMENT
, TRANSFER
, and DEBIT
, then you’d enter “1,2,4”. Click OK
to close the window, the Apply
on the right to remove those instances.
If you do this correctly, you’ll be left with (I think) 1,055 instances.
There’s still a very small proportion of fraud in the data set, but something’s better than nothing. Still: This is a hard problem, and we can’t expect magic from ML here.
The fields nameOrig
and nameDest
introduce a different kind of problem into the dataset in terms of modeling. While they would be useful in finding out which accounts engage in fraud, they’re not actually very useful in modeling the behavior of fraud. If we knew which accounts engaged in fraud, we’d just ban them! More to this assignment’s point, they will lead many classification algorithms down various blind alleys.
So, let’s remove these attributes from the dataset. Again, you can do this in Excel by deleting columns, but in Weka, you can select the checkbox next to them in the “Preprocess” tab, then choose “Remove”.
Let’s build a classifier on the data we now have. If you’re following along in Weka, let’s build a simple rule-based classifier. Click on the “Classify” tab, then select the “JRip” classifer under “rules”. Run the classifier. Shortly, you’ll see a summary of results; you may need to scroll back up to see the beginning. You’ll see a set of rules and results of applying those rules, for example:
(oldbalanceDest <= 0) and (newbalanceDest <= 0) => isFraud=TRUE (19.0/1.0)
Which says that if both oldbalanceDest
and oldbalanceDest
are less than or equal to zero, classify the transaction as fraudulent. In the 20 instances where this rule applies, it’s correct 19 of them. This seems something actually informative about the pattern of fraud in the data; the other rules may or may not.
Rules are applied in order. The last rule:
=> isFraud=FALSE (1021.0/5.0)
essentially says “otherwise, no fraud.”
The classifier has high overall accuracy, but that’s not surprising.
3. If we just classified everything as non-fraud, what would our accuracy be? (Side note: this is the “ZeroR” classifier’s action.)
The final thing to look at is the confusion matrix at the bottom, which will look something like this:
=== Confusion Matrix ===
a b <-- classified as
1007 10 | a = FALSE
15 23 | b = TRUE
The row represent the true values; the columns the predictions. In this case, there were 1007 non-frauds classified as non-frauds by our rules, and 10 non-frauds classified as frauds (false positives). 15 of the frauds were missed (false negatives) and 23 correctly labeled as frauds. (These are actually estimates based upon 10-fold cross-validation on the dataset; if you trained and then tested on the same data, you’d get an artificially high estimate of your classifier’s accuracy.)
Click on the “JRip” text to bring up a dialog box, and adjust the MinNo
field to be 10. If you click on “More” it explains what each parameter does; setting this paramater requires that each rule apply to at least this many instances in order to be generated.
4. If you’re using Weka, report the rule(s) generated by the above settings, as well as the confusion matrix overall. If you’re using another tool, generate a rule-based classifier for the data set. Explain which, if any, of the discovered rules you think are telling you something fundamental, and which might be learning “noise” in the dataset.
5. (If you’re using Weka, use its results for this (and future) questions. Otherwise, use your tool’s equivalents.) Now, let’s try another classifier. Switch to “J48”, which is a version of the C4.5 classifier. It’s under “trees”. Run this classifier on your dataset. Report the confusion matrix.
Right-click on the trees.J48
in the “Result list”, and choose “visualize tree”. Decision trees work similarly to decision rules, starting at the top of the tree (like a flowchart) and working their way down.
6. Include a snapshot of the tree you generate in either textual format or based upon the visualization. Comment on how the decision tree does (or does not) capture a rule similar to the first rule we found (just before question 3). (Does this make you re-evaluate whether that rule was actually something fundamental, or perhaps something created by the JRip rule maker? It should!)
7. Try adjusting the “confidenceFactor” parameter of the J48 classifier – read the documentation on what it does. Try also tuning the “minNumObj” parameter. Build a new tree. Submit its confusion matrix and the tree itself.
8. Build one more classifier on your own. I’d suggest either a logisitic regression, Naive Bayes, or a PART decision list, but feel free to explore some of the many available choices; maybe even Google them to see what the fancy machine you’re playing with is doing. You could also explore the bonus option below. Explain which classifier you built. Provide its confusion matrix using cross-validation on the dataset and (if possible) a visualization or explanation of what it’s doing.
9. Suppose you had the following data. Choose a classifier you built in one of the questions above, and tell me whether and why each instance would be classified as fraudulent or not by that classifier.
type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFlaggedFraud
PAYMENT,3765.86,C1806062974,82083.0,78317.14,M521611410,0.0,0.0,0
TRANSFER,120074.73,C1409933277,120074.73,0.0,C162114152,0.0,0.0,0
CASH_OUT,120074.73,C1174000532,120074.73,0.0,C410033330,0.0,120074.73,0
TRANSFER,26768.5,C457596841,26768.5,0.0,C1956477953,0.0,0.0,0
CASH_OUT,26768.5,C682812632,26768.5,0.0,C256417920,101976.0,128744.5,0
(note that they DO NOT include the isFraud
column!)
Bonus: When a particular class value is rare (as is fraud in this case), you can build a “cost-sensitive” classifier. In essence, this lets you put your thumb on the scale and force the classifier to make more of one kind of error (say, more positives, both true and false) in exchange for making less of another (say, fewer negatives, both true and false). This might be important if you didn’t get many true positives (for fraud) and you wanted more, even if some were wrong.
To do this in Weka, choose the “meta” “CostSensitiveClassifier”. Open its configuration panel and choose a classifier to use (e.g., “J48”, which you can then further configure if you want to). Finally, click on “costMatrix”. In that dialog, set the number of classes to 2 and hit return (true/false for fraud). Then in the matrix on the left, enter the relative cost of errors – the locations in this matrix correspond to those in the confusion matrix. For example:
0.0 0.1
1.0 0.0
means that false positives are weighted 1/10th (0.1) as much as false negatives (1.0). Then run the classifier, and you’ll get a confusion matrix with many more positives (both true and false) then without the cost-sensitive settings.