Fraud Analytics Using Descriptive, Predictive, and Social Network Techniques: A Guide to Data Science for Fraud Detection
()
About this ebook
Fraud Analytics Using Descriptive, Predictive, and Social Network Techniques is an authoritative guidebook for setting up a comprehensive fraud detection analytics solution. Early detection is a key factor in mitigating fraud damage, but it involves more specialized techniques than detecting fraud at the more advanced stages. This invaluable guide details both the theory and technical aspects of these techniques, and provides expert insight into streamlining implementation. Coverage includes data gathering, preprocessing, model building, and post-implementation, with comprehensive guidance on various learning techniques and the data types utilized by each. These techniques are effective for fraud detection across industry boundaries, including applications in insurance fraud, credit card fraud, anti-money laundering, healthcare fraud, telecommunications fraud, click fraud, tax evasion, and more, giving you a highly practical framework for fraud prevention.
It is estimated that a typical organization loses about 5% of its revenue to fraud every year. More effective fraud detection is possible, and this book describes the various analytical techniques your organization must implement to put a stop to the revenue leak.
- Examine fraud patterns in historical data
- Utilize labeled, unlabeled, and networked data
- Detect fraud before the damage cascades
- Reduce losses, increase recovery, and tighten security
The longer fraud is allowed to go on, the more harm it causes. It expands exponentially, sending ripples of damage throughout the organization, and becomes more and more complex to track, stop, and reverse. Fraud prevention relies on early and effective fraud detection, enabled by the techniques discussed here. Fraud Analytics Using Descriptive, Predictive, and Social Network Techniques helps you stop fraud in its tracks, and eliminate the opportunities for future occurrence.
Read more from Bart Baesens
Beginning Java Programming: The Object-Oriented Approach Rating: 0 out of 5 stars0 ratings
Related to Fraud Analytics Using Descriptive, Predictive, and Social Network Techniques
Titles in the series (79)
Case Studies in Performance Management: A Guide from the Experts Rating: 5 out of 5 stars5/5Enterprise Risk Management: A Methodology for Achieving Strategic Objectives Rating: 0 out of 5 stars0 ratingsBranded!: How Retailers Engage Consumers with Social Media and Mobility Rating: 0 out of 5 stars0 ratingsCIO Best Practices: Enabling Strategic Value with Information Technology Rating: 4 out of 5 stars4/5Business Intelligence Competency Centers: A Team Approach to Maximizing Competitive Advantage Rating: 4 out of 5 stars4/5Fair Lending Compliance: Intelligence and Implications for Credit Risk Management Rating: 0 out of 5 stars0 ratingsPerformance Management: Integrating Strategy Execution, Methodologies, Risk, and Analytics Rating: 3 out of 5 stars3/5Mastering Organizational Knowledge Flow: How to Make Knowledge Sharing Work Rating: 4 out of 5 stars4/5Taming The Big Data Tidal Wave: Finding Opportunities in Huge Data Streams with Advanced Analytics Rating: 4 out of 5 stars4/5The Executive's Guide to Enterprise Social Media Strategy: How Social Networks Are Radically Transforming Your Business Rating: 0 out of 5 stars0 ratingsHuman Capital Analytics: How to Harness the Potential of Your Organization's Greatest Asset Rating: 0 out of 5 stars0 ratingsStatistical Thinking: Improving Business Performance Rating: 4 out of 5 stars4/5Credit Risk Assessment: The New Lending System for Borrowers, Lenders, and Investors Rating: 0 out of 5 stars0 ratingsThe New Know: Innovation Powered by Analytics Rating: 0 out of 5 stars0 ratingsDelivering Business Analytics: Practical Guidelines for Best Practice Rating: 3 out of 5 stars3/5The Business Forecasting Deal: Exposing Myths, Eliminating Bad Practices, Providing Practical Solutions Rating: 0 out of 5 stars0 ratingsMarketing Automation: Practical Steps to More Effective Direct Marketing Rating: 0 out of 5 stars0 ratingsBank Fraud: Using Technology to Combat Losses Rating: 0 out of 5 stars0 ratingsSocial Network Analysis in Telecommunications Rating: 1 out of 5 stars1/5Customer Data Integration: Reaching a Single Version of the Truth Rating: 3 out of 5 stars3/5The Data Asset: How Smart Companies Govern Their Data for Business Success Rating: 0 out of 5 stars0 ratingsThe Value of Business Analytics: Identifying the Path to Profitability Rating: 0 out of 5 stars0 ratingsDemand-Driven Forecasting: A Structured Approach to Forecasting Rating: 0 out of 5 stars0 ratingsHeuristics in Analytics: A Practical Perspective of What Influences Our Analytical World Rating: 0 out of 5 stars0 ratingsEconomic and Business Forecasting: Analyzing and Interpreting Econometric Results Rating: 0 out of 5 stars0 ratingsBusiness Transformation: A Roadmap for Maximizing Organizational Insights Rating: 0 out of 5 stars0 ratingsHealth Analytics: Gaining the Insights to Transform Health Care Rating: 0 out of 5 stars0 ratingsCIO Best Practices: Enabling Strategic Value With Information Technology Rating: 4 out of 5 stars4/5Financial Institution Advantage and the Optimization of Information Processing Rating: 0 out of 5 stars0 ratingsThe Visual Organization: Data Visualization, Big Data, and the Quest for Better Decisions Rating: 0 out of 5 stars0 ratings
Related ebooks
Profit Driven Business Analytics: A Practitioner's Guide to Transforming Big Data into Added Value Rating: 0 out of 5 stars0 ratingsAnti-Money Laundering Transaction Monitoring Systems Implementation: Finding Anomalies Rating: 5 out of 5 stars5/5Operational Risk Modeling in Financial Services: The Exposure, Occurrence, Impact Method Rating: 0 out of 5 stars0 ratingsR and Data Mining: Examples and Case Studies Rating: 3 out of 5 stars3/5Learning Predictive Analytics with Python Rating: 0 out of 5 stars0 ratingsR: Data Analysis and Visualization Rating: 5 out of 5 stars5/5Data Mining: Practical Machine Learning Tools and Techniques Rating: 4 out of 5 stars4/5Exposing Fraud: Skills, Process and Practicalities Rating: 4 out of 5 stars4/5Bank Fraud: Using Technology to Combat Losses Rating: 0 out of 5 stars0 ratingsFraud Analytics Complete Self-Assessment Guide Rating: 0 out of 5 stars0 ratingsFraud Examination Casebook with Documents: A Hands-on Approach Rating: 4 out of 5 stars4/5Fighting Fraud: How to Establish and Manage an Anti-Fraud Program Rating: 0 out of 5 stars0 ratingsInvestigator and Fraud Fighter Guidebook: Operation War Stories Rating: 0 out of 5 stars0 ratingsTackling Fraud Rating: 4 out of 5 stars4/5Stress Testing for Risk Control Under Basel II Rating: 0 out of 5 stars0 ratingsCertified Fraud Examiner A Complete Guide - 2021 Edition Rating: 0 out of 5 stars0 ratingsFraud Analytics: Strategies and Methods for Detection and Prevention Rating: 5 out of 5 stars5/5The Data and Analytics Playbook: Proven Methods for Governed Data and Analytic Quality Rating: 5 out of 5 stars5/5Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die Rating: 4 out of 5 stars4/5Data Mining and Predictive Analysis: Intelligence Gathering and Crime Analysis Rating: 4 out of 5 stars4/5Business Intelligence Strategy and Big Data Analytics: A General Management Perspective Rating: 5 out of 5 stars5/5Investigative Data Mining for Security and Criminal Detection Rating: 0 out of 5 stars0 ratingsIFRS 9 and CECL Credit Risk Modelling and Validation: A Practical Guide with Examples Worked in R and SAS Rating: 3 out of 5 stars3/5White-Collar and Financial Crimes: A Casebook of Fraudsters, Scam Artists, and Corporate Thieves Rating: 0 out of 5 stars0 ratingsThe Law of Fraud: An Australian Investigator's Guide Rating: 0 out of 5 stars0 ratingsFraud Prevention Rating: 5 out of 5 stars5/5Corporate Fraud Prevention and Detection Complete Self-Assessment Guide Rating: 0 out of 5 stars0 ratingsSuspending Suspicious Transactions Rating: 0 out of 5 stars0 ratings
Computers For You
Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls Rating: 4 out of 5 stars4/5The Invisible Rainbow: A History of Electricity and Life Rating: 4 out of 5 stars4/5SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics Rating: 4 out of 5 stars4/5Elon Musk Rating: 4 out of 5 stars4/5CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61 Rating: 0 out of 5 stars0 ratingsProcreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad Rating: 0 out of 5 stars0 ratingsAlan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition Rating: 4 out of 5 stars4/5The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology Rating: 0 out of 5 stars0 ratingsThe Hacker Crackdown: Law and Disorder on the Electronic Frontier Rating: 4 out of 5 stars4/5101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters Rating: 4 out of 5 stars4/5Mastering ChatGPT: 21 Prompts Templates for Effortless Writing Rating: 5 out of 5 stars5/5CompTIA Security+ Practice Questions Rating: 2 out of 5 stars2/5Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are Rating: 4 out of 5 stars4/5Childhood Unplugged: Practical Advice to Get Kids Off Screens and Find Balance Rating: 0 out of 5 stars0 ratingsThe Professional Voiceover Handbook: Voiceover training, #1 Rating: 5 out of 5 stars5/5People Skills for Analytical Thinkers Rating: 5 out of 5 stars5/5Going Text: Mastering the Command Line Rating: 4 out of 5 stars4/5Dark Aeon: Transhumanism and the War Against Humanity Rating: 5 out of 5 stars5/5Grokking Algorithms: An illustrated guide for programmers and other curious people Rating: 4 out of 5 stars4/5AP Computer Science Principles Premium, 2024: 6 Practice Tests + Comprehensive Review + Online Practice Rating: 0 out of 5 stars0 ratingsCreating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates Rating: 4 out of 5 stars4/5How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally Rating: 4 out of 5 stars4/5
Reviews for Fraud Analytics Using Descriptive, Predictive, and Social Network Techniques
0 ratings0 reviews
Book preview
Fraud Analytics Using Descriptive, Predictive, and Social Network Techniques - Bart Baesens
Table of Contents
Title Page
Copyright
Dedication
List of Figures
Foreword
Preface
Acknowledgments
Chapter 1: Fraud: Detection, Prevention, and Analytics!
Introduction
Fraud!
Fraud Detection and Prevention
Big Data for Fraud Detection
Data-Driven Fraud Detection
Fraud-Detection Techniques
Fraud Cycle
The Fraud Analytics Process Model
Fraud Data Scientists
A Scientific Perspective on Fraud
References
Chapter 2: Data Collection, Sampling, and Preprocessing
Introduction
Types of Data Sources
Merging Data Sources
Sampling
Types of Data Elements
Visual Data Exploration and Exploratory Statistical Analysis
Benford's Law
Descriptive Statistics
Missing Values
Outlier Detection and Treatment
Red Flags
Standardizing Data
Categorization
Weights of Evidence Coding
Variable Selection
Principal Components Analysis
RIDITs
PRIDIT Analysis
Segmentation
References
Chapter 3: Descriptive Analytics for Fraud Detection
Introduction
Graphical Outlier Detection Procedures
Statistical Outlier Detection Procedures
Clustering
One-Class SVMs
References
Chapter 4: Predictive Analytics for Fraud Detection
Introduction
Target Definition
Linear Regression
Logistic Regression
Variable Selection for Linear and Logistic Regression
Decision Trees
Neural Networks
Support Vector Machines
Ensemble Methods
Multiclass Classification Techniques
Evaluating Predictive Models
Other Performance Measures for Predictive Analytical Models
Developing Predictive Models for Skewed Data Sets
Fraud Performance Benchmarks
References
Chapter 5: Social Network Analysis for Fraud Detection
Networks: Form, Components, Characteristics, and Their Applications
Is Fraud a Social Phenomenon? An Introduction to Homophily
Impact of the Neighborhood: Metrics
Community Mining: Finding Groups of Fraudsters
Extending the Graph: Toward a Bipartite Representation
References
Chapter 6: Fraud Analytics: Post-Processing
Introduction
The Analytical Fraud Model Life Cycle
Model Representation
Selecting the Sample to Investigate
Fraud Alert and Case Management
Visual Analytics
Backtesting Analytical Fraud Models
Model Design and Documentation
References
Chapter 7: Fraud Analytics: A Broader Perspective
Introduction
Data Quality
Privacy
Capital Calculation for Fraud Loss
An Economic Perspective on Fraud Analytics
In Versus Outsourcing
Modeling Extensions
The Internet of Things
Corporate Fraud Governance
References
About the Authors
Index
End User License Agreement
List of Illustrations
Chapter 1: Fraud: Detection, Prevention, and Analytics!
Figure 1.1 Fraud Triangle
Figure 1.2 Fire Incident Claim-Handling Process
Figure 1.3 The Fraud Cycle
Figure 1.4 Outlier Detection at the Data Item Level
Figure 1.5 Outlier Detection at the Data Set Level
Figure 1.6 The Fraud Analytics Process Model
Figure 1.7 Profile of a Fraud Data Scientist
Figure 1.8 Screenshot of Web of Science Statistics for Scientific Publications on Fraud between 1996 and 2014
Chapter 2: Data Collection, Sampling, and Preprocessing
Figure 2.1 Aggregating Normalized Data Tables into a Non-Normalized Data Table
Figure 2.2 Pie Charts for Exploratory Data Analysis
Figure 2.3 Benford's Law Describing the Frequency Distribution of the First Digit
Figure 2.4 Multivariate Outliers
Figure 2.5 Histogram for Outlier Detection
Figure 2.6 Box Plots for Outlier Detection
Figure 2.7 Using the z-Scores for Truncation
Figure 2.8 Default Risk Versus Age
Figure 2.9 Illustration of Principal Component Analysis in a Two-Dimensional Data Set
Chapter 3: Descriptive Analytics for Fraud Detection
Figure 3.1 3D Scatter Plot for Detecting Outliers
Figure 3.2 OLAP Cube for Fraud Detection
Figure 3.3 Example Pivot Table for Credit Card Fraud Detection
Figure 3.4 Break-Point Analysis
Figure 3.5 Peer-Group Analysis
Figure 3.6 Cluster Analysis for Fraud Detection
Figure 3.7 Hierarchical Versus Nonhierarchical Clustering Techniques
Figure 3.8 Euclidean Versus Manhattan Distance
Figure 3.9 Divisive Versus Agglomerative Hierarchical Clustering
Figure 3.10 Calculating Distances between Clusters
Figure 3.11 Example for Clustering Birds. The Numbers Indicate the Clustering Steps
Figure 3.12 Dendrogram for Birds Example. The Thick Black Line Indicates the Optimal Clustering
Figure 3.13 Screen Plot for Clustering
Figure 3.14 Scatter Plot of Hierarchical Clustering Data
Figure 3.15 Output of Hierarchical Clustering Procedures
Figure 3.16 k-Means Clustering: Start from Original Data
Figure 3.17 k-Means Clustering Iteration 1: Randomly Select Initial Cluster Centroids
Figure 3.18 k-Means Clustering Iteration 1: Assign Remaining Observations
Figure 3.19 k-Means Iteration Step 2: Recalculate Cluster Centroids
Figure 3.20 k-Means Clustering Iteration 2: Reassign Observations
Figure 3.21 k-Means Clustering Iteration 3: Recalculate Cluster Centroids
Figure 3.22 k-Means Clustering Iteration 3: Reassign Observations
Figure 3.23 Rectangular Versus Hexagonal SOM Grid
Figure 3.24 Clustering Countries Using SOMs
Figure 3.25 Component Plane for Literacy
Figure 3.26 Component Plane for Political Rights
Figure 3.27 Must-Link and Cannot-Link Constraints in Semi-Supervised Clustering
Figure 3.28 δ-Constraints in Semi-Supervised Clustering
Figure 3.29 ε-Constraints in Semi-Supervised Clustering
Figure 3.30 Cluster Profiling Using Histograms
Figure 3.31 Using Decision Trees for Clustering Interpretation
Figure 3.32 One-Class Support Vector Machines
Chapter 4: Predictive Analytics for Fraud Detection
Figure 4.1 A Spider Construction in Tax Evasion Fraud
Figure 4.2 Regular Versus Fraudulent Bankruptcy
Figure 4.3 OLS Regression
Figure 4.4 Bounding Function for Logistic Regression
Figure 4.5 Linear Decision Boundary of Logistic Regression
Figure 4.6 Other Transformations
Figure 4.7 Fraud Detection Scorecard
Figure 4.8 Calculating the p-Value with a Student's t-Distribution
Figure 4.9 Variable Subsets for Four Variables V1, V2, V3, and V4
Figure 4.10 Example Decision Tree
Figure 4.11 Example Data Sets for Calculating Impurity
Figure 4.12 Entropy Versus Gini
Figure 4.13 Calculating the Entropy for Age Split
Figure 4.14 Using a Validation Set to Stop Growing a Decision Tree
Figure 4.15 Decision Boundary of a Decision Tree
Figure 4.16 Example Regression Tree for Predicting the Fraud Percentage
Figure 4.17 Neural Network Representation of Logistic Regression
Figure 4.18 A Multilayer Perceptron (MLP) Neural Network
Figure 4.19 Local Versus Global Minima
Figure 4.20 Using a Validation Set for Stopping Neural Network Training
Figure 4.21 Example Hinton Diagram
Figure 4.22 Backward Variable Selection
Figure 4.23 Decompositional Approach for Neural Network Rule Extraction
Figure 4.24 Pedagogical Approach for Rule Extraction
Figure 4.25 Two-Stage Models
Figure 4.26 Multiple Separating Hyperplanes
Figure 4.27 SVM Classifier for the Perfectly Linearly Separable Case
Figure 4.28 SVM Classifier in Case of Overlapping Distributions
Figure 4.29 The Feature Space Mapping
Figure 4.30 SVMs for Regression
Figure 4.31 Representing an SVM Classifier as a Neural Network
Figure 4.32 One-Versus-One Coding for Multiclass Problems
Figure 4.33 One-Versus-All Coding for Multiclass Problems
Figure 4.34 Training Versus Test Sample Set Up for Performance Estimation
Figure 4.35 Cross-Validation for Performance Measurement
Figure 4.36 Bootstrapping
Figure 4.37 Calculating Predictions Using a Cut-Off
Figure 4.38 The Receiver Operating Characteristic Curve
Figure 4.39 Lift Curve
Figure 4.40 Cumulative Accuracy Profile
Figure 4.41 Calculating the Accuracy Ratio
Figure 4.42 The Kolmogorov-Smirnov Statistic
Figure 4.43 A Cumulative Notch Difference Graph
Figure 4.44 Scatter Plot: Predicted Fraud Versus Actual Fraud
Figure 4.45 CAP Curve for Continuous Targets
Figure 4.46 Regression Error Characteristic (REC) Curve
Figure 4.47 Varying the Time Window to Deal with Skewed Data Sets
Figure 4.48 Oversampling the Fraudsters
Figure 4.49 Undersampling the Nonfraudsters
Figure 4.50 Synthetic Minority Oversampling Technique (SMOTE)
Chapter 5: Social Network Analysis for Fraud Detection
Figure 5.1a Köningsberg Bridges
Figure 5.1b Schematic Representation of the Köningsberg Bridges
Figure 5.2 Identity Theft. The Frequent Contact List of a Person is Suddenly Extended with Other Contacts (Light Gray Nodes). This Might Indicate that a Fraudster (Dark Gray Node) Took Over that Customer's Account and shares
his/her Contacts
Figure 5.3 Network Representation
Figure 5.4 Example of a (Un)Directed Graph
Figure 5.5 Follower–Followee Relationships in a Twitter Network
Figure 5.6 Edge Representation
Figure 5.7 Example of a Fraudulent Network
Figure 5.8 An Egonet. The Ego is Surrounded by Six Alters, of Whom Two are Legitimate (White Nodes) and Four are Fraudulent (Gray Nodes)
Figure 5.9 Toy Example of Credit Card Fraud
Figure 5.10 Mathematical Representation of (a) a Sample Network: (b) the Adjacency or Connectivity Matrix; (c) the Weight Matrix; (d) the Adjacency List; and (e) the Weight List
Figure 5.11 A Real-Life Example of a Homophilic Network
Figure 5.12 A Homophilic Network
Figure 5.13 Sample Network
Figure 5.14a Degree Distribution
Figure 5.14b Illustration of the Degree Distribution for a Real-Life Network of Social Security Fraud. The Degree Distribution Follows a Power Law (log-log axes)
Figure 5.15 A 4-regular Graph
Figure 5.16 Example Social Network for a Relational Neighbor Classifier
Figure 5.17 Example Social Network for a Probabilistic Relational Neighbor Classifier
Figure 5.18 Example of Social Network Features for a Relational Logistic Regression Classifier
Figure 5.19 Example of Featurization with Features Describing Intrinsic Behavior and Behavior of the Neighborhood
Figure 5.20 Illustration of Dijkstra's Algorithm
Figure 5.21 Illustration of the Number of Connecting Paths Between Two Nodes
Figure 5.22 Illustration of Betweenness Between Communities of Nodes
Figure 5.23 Pagerank Algorithm
Figure 5.24 Illustration of Iterative Process of the PageRank Algorithm
Figure 5.25 Sample Network
Figure 5.26 Community Detection for Credit Card Fraud
Figure 5.27 Iterative Bisection
Figure 5.28 Dendrogram of the Clustering of Figure 5.27 by the Girvan-Newman Algorithm. The Modularity Q is Maximized When Splitting the Network into Two Communities ABC –DEFG
Figure 5.29 Complete (a) and Partial (b) Communities
Figure 5.30 Overlapping Communities
Figure 5.31 Unipartite Graph
Figure 5.32 Bipartite Graph
Figure 5.33 Connectivity Matrix of a Bipartite Graph
Figure 5.34 A Multipartite Graph
Figure 5.35 Sample Network of Gotcha!
Figure 5.36 Exposure Score of the Resources Derived by a Propagation Algorithm. The Results are Based on a Real-life Data Set in Social Security Fraud
Figure 5.37 Egonet in Social Security Fraud. A Company Is Associated with its Resources
Figure 5.38 ROC Curve of the Gotcha! Model, which Combines both Intrinsic and Relational Features
Chapter 6: Fraud Analytics: Post-Processing
Figure 6.1 The Analytical Model Life Cycle
Figure 6.2 Traffic Light Indicator Approach
Figure 6.3 SAS Social Network Analysis Dashboard
Figure 6.4 SAS Social Network Analysis Claim Detail Investigation
Figure 6.5 SAS Social Network Analysis Link Detection
Figure 6.6 Distribution of Claim Amounts and Average Claim Value
Figure 6.7 Geographical Distribution of Claims
Figure 6.8 Zooming into the Geographical Distribution of Claims
Figure 6.9 Measuring the Efficiency of the Fraud-Detection Process
Figure 6.10 Evaluating the Efficiency of Fraud Investigators
Chapter 7: Fraud Analytics: A Broader Perspective
Figure 7.1 RACI Matrix
Figure 7.2 Anonymizing a Database
Figure 7.3 Different SQL Views Defined for a Database
Figure 7.4 Aggregate Loss Distribution with Indication of Expected Loss, Value at Risk (VaR) at 99.9 Percent Confidence Level and Unexpected Loss
Figure 7.5 Snapshot of a Credit Card Fraud Time Series Data Set and Associated Histogram of the Fraud Amounts
Figure 7.6 Aggregate Loss Distribution Resulting from a Monte Carlo Simulation with Poisson Distributed Monthly Fraud Frequency and Associated Pareto Distributed Fraud Loss
List of Tables
Chapter 1: Fraud: Detection, Prevention, and Analytics!
Table 1.1 Nonexhaustive List of Fraud Categories and Types
Table 1.2 Call Detail Records of a Customer with Outliers Indicating Suspicious Activity (deviating behavior starting at a certain moment in time) at the Customer Subscription (Fawcett and Provost 1997)
Table 1.3 Example Credit Card Transaction Data Fields
Table 1.4 Key Characteristics of Successful Fraud Analytics Models
Chapter 2: Data Collection, Sampling, and Preprocessing
Table 2.1 Dealing with Missing Values
Table 2.2 z-Scores for Outlier Detection
Table 2.3 Coarse Classifying the Product Type Variable
Table 2.4 Pivot Table for Coarse Classifying the Product Type Variable
Table 2.5 Coarse Classifying the Product Type Variable
Table 2.6 Empirical Frequencies Option 1 for Coarse Classifying Product Type
Table 2.7 Independence Frequencies Option 1 for Coarse Classifying Product Type
Table 2.8 Calculating Weights of Evidence (WOE)
Table 2.9 Filters for Variable Selection
Table 2.10 Calculating the Information Value Filter Measure
Table 2.11 Contingency Table for Marital Status versus Good/Bad Customer
Chapter 3: Descriptive Analytics for Fraud Detection
Table 3.1 Transaction Data Set for Peer-Group Analysis
Table 3.2 Transactions Database for Insurance Fraud Detection
Table 3.3 Data Set for Hierarchical Clustering
Table 3.4 Output from a k-Means Clustering Exercise (k=4)
Chapter 4: Predictive Analytics for Fraud Detection
Table 4.1 Data Set for Linear Regression
Table 4.2 Example Classification Data Set
Table 4.3 Reference Values for Variable Significance
Table 4.4 Example Data Set for Performance Calculation
Table 4.5 Confusion Matrix
Table 4.6 Table for ROC Analysis
Table 4.7 Multiclass Confusion Matrix
Table 4.8 Data for REC Curve
Table 4.9 Values for PFA for a Data Set with No Fraudsters
Table 4.10 Values for PFB for a Data Set with No Fraudsters
Table 4.11 Values for PFC for a Data Set with No Fraudsters
Table 4.12 Values for PFA, PFB, and PFC for a Data Set with Fraudsters
Table 4.13 Adjusting the Posterior Probability
Table 4.14 Misclassification Costs
Table 4.15 Performance Benchmarks for Fraud Detection
Chapter 5: Social Network Analysis for Fraud Detection
Table 5.1 Example of Credit Card Transaction Data
Table 5.2 Overview of Neighborhood Metrics
Table 5.3 Summary of the Total, Fraudulent, and Legitimate Degree
Table 5.4 Summary of the Number of Legitimate, Fraudulent, and Semi-Fraudulent Triangles
Table 5.5 Summary of the Density
Table 5.6 Summary of Relational Neighbor Probabilities
Table 5.7 Summary of Relational Features by Lu and Getoor (2003)
Table 5.8 Centrality Metrics
Table 5.9 Summary of Geodesic Paths to Fraudulent Nodes
Table 5.10 Summary of Closeness and Closeness Centrality for Each Node of the Network in Figure 5.13
Table 5.11 Summary of the Betweenness Centrality for Each Node of the Network in Figure 5.13
Table 5.12 PageRank Algorithm
Table 5.13 Featurization Process. The Unstructured Network Is Mapped into Structured Data Variables
Table 5.14 Overview of the 100 Companies with the Highest Score as Output by the Detection Model
Chapter 6: Fraud Analytics: Post-Processing
Table 6.1 Fully Expanded Decision Table
Table 6.2 Contracted Decision Table
Table 6.3 Minimized Decision Table
Table 6.4 Decision Table for Rule Verification
Table 6.5 Using the Expected Fraud Amount to Decide on Further Investigation
Table 6.6 Calculating the System Stability Index (SSI)
Table 6.7 Monitoring the SSI through Time
Table 6.8 Calculating the SSI for Individual Variables
Table 6.9 Monitoring the Performance Metric of a Fraud Model
Table 6.10 Monitoring the Calibration of a Classification Model
Table 6.11 Monitoring the Calibration of a Regression Model
Chapter 7: Fraud Analytics: A Broader Perspective
Table 7.1 Example Costs for Calculating Total Cost of Ownership (TCO)
Wiley & SAS Business Series
The Wiley & SAS Business Series presents books that help senior-level managers with their critical management decisions.
Titles in the Wiley & SAS Business Series include:
Analytics in a Big Data World: The Essential Guide to Data Science and Its Applications by Bart Baesens
Bank Fraud: Using Technology to Combat Losses by Revathi Subramanian
Big Data Analytics: Turning Big Data into Big Money by Frank Ohlhorst
Big Data, Big Innovation: Enabling Competitive Differentiation through Business Analytics by Evan Stubbs
Business Analytics for Customer Intelligence by Gert Laursen
Business Intelligence Applied: Implementing an Effective Information and Communications Technology Infrastructure by Michael Gendron
Business Intelligence and the Cloud: Strategic Implementation Guide by Michael S. Gendron
Business Transformation: A Roadmap for Maximizing Organizational Insights by Aiman Zeid
Connecting Organizational Silos: Taking Knowledge Flow Management to the Next Level with Social Media by Frank Leistner
Data-Driven Healthcare: How Analytics and BI Are Transforming the Industry by Laura Madsen
Delivering Business Analytics: Practical Guidelines for Best Practice by Evan Stubbs
Demand-Driven Forecasting: A Structured Approach to Forecasting, Second Edition by Charles Chase
Demand-Driven Inventory Optimization and Replenishment: Creating a More Efficient Supply Chain by Robert A. Davis
Developing Human Capital: Using Analytics to Plan and Optimize Your Learning and Development Investments by Gene Pease, Barbara Beresford, and Lew Walker
The Executive's Guide to Enterprise Social Media Strategy: How Social Networks Are Radically Transforming Your Business by David Thomas and Mike Barlow
Economic and Business Forecasting: Analyzing and Interpreting Econometric Results by John Silvia, Azhar Iqbal, Kaylyn Swankoski, Sarah Watt, and Sam Bullard
Financial Institution Advantage and The Optimization of Information Processing by Sean C. Keenan
Foreign Currency Financial Reporting from Euros to Yen to Yuan: A Guide to Fundamental Concepts and Practical Applications by Robert Rowan
Harness Oil and Gas Big Data with Analytics: Optimize Exploration and Production with Data Driven Models by Keith Holdaway
Health Analytics: Gaining the Insights to Transform Health Care by Jason Burke
Heuristics in Analytics: A Practical Perspective of What Influences Our Analytical World by Carlos Andre Reis Pinheiro and Fiona McNeill
Human Capital Analytics: How to Harness the Potential of Your Organization's Greatest Asset by Gene Pease, Boyce Byerly, and Jac Fitz-enz
Implement, Improve and Expand Your Statewide Longitudinal Data System: Creating a Culture of Data in Education by Jamie McQuiggan and Armistead Sapp
Killer Analytics: Top 20 Metrics Missing from Your Balance Sheet by Mark Brown
Predictive Analytics for Human Resources by Jac Fitz-enz and John Mattox II
Predictive Business Analytics: Forward-Looking Capabilities to Improve Business Performance by Lawrence Maisel and Gary Cokins
Retail Analytics: The Secret Weapon by Emmett Cox
Social Network Analysis in Telecommunications by Carlos Andre Reis Pinheiro
Statistical Thinking: Improving Business Performance, second edition by Roger W. Hoerl and Ronald D. Snee
Taming the Big Data Tidal Wave: Finding Opportunities in Huge Data Streams with Advanced Analytics by Bill Franks
Too Big to Ignore: The Business Case for Big Data by Phil Simon
The Value of Business Analytics: Identifying the Path to Profitability by Evan Stubbs
The Visual Organization: Data Visualization, Big Data, and the Quest for Better Decisions by Phil Simon
Understanding the Predictive Analytics Lifecycle by Al Cordoba
Unleashing Your Inner Leader: An Executive Coach Tells All by Vickie Bevenour
Using Big Data Analytics: Turning Big Data into Big Money by Jared Dean
Win with Advanced Business Analytics: Creating Business Value from Your Data by Jean Paul Isson and Jesse Harriott
For more information on any of the above titles, please visit www.wiley.com.
Fraud Analytics Using Descriptive, Predictive, and Social Network Techniques
A Guide to Data Science for Fraud Detection
Bart Baesens
Véronique Van Vlasselaer
Wouter Verbeke
Title PageCopyright © 2015 by John Wiley & Sons, Inc. All rights reserved.
Published by John Wiley & Sons, Inc., Hoboken, New Jersey.
Published simultaneously in Canada.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 646-8600, or on the Web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.
Wiley publishes in a variety of print and electronic formats and by print-on-demand. Some material included with standard print versions of this book may not be included in e-books or in print-on-demand. If this book refers to media such as a CD or DVD that is not included in the version you purchased, you may download this material at http://booksupport.wiley.com. For more information about Wiley products, visit www.wiley.com.
Library of Congress Cataloging-in-Publication Data:
Baesens, Bart.
Fraud analytics using descriptive, predictive, and social network techniques : a guide to data science for fraud detection / Bart Baesens, Veronique Van Vlasselaer, Wouter Verbeke.
pages cm. — (Wiley & SAS business series)
Includes bibliographical references and index.
ISBN 978-1-119-13312-4 (cloth) — ISBN 978-1-119-14682-7 (epdf) — ISBN 978-1-119-14683-4 (epub)
1. Fraud— Statistical methods. 2. Fraud— Prevention. 3. Commercial crimes— Prevention. I. Title.
HV6691.B34 2015
364.16′3015195—dc23
2015017861
Cover Design: Wiley
Cover Image: ©iStock.com/aleksandarvelasevic
To my wonderful wife, Katrien, and kids, Ann-Sophie, Victor, and Hannelore.
To my parents and parents-in-law.
To my husband and soul mate, Niels, for his never-ending support.
To my parents, parents-in-law, and siblings-in-law.
To Luit and Titus.
List of Figures
Figure 1.1 Fraud Triangle
Figure 1.2 Fire Incident Claim-Handling Process
Figure 1.3 The Fraud Cycle
Figure 1.4 Outlier Detection at the Data Item Level
Figure 1.5 Outlier Detection at the Data Set Level
Figure 1.6 The Fraud Analytics Process Model
Figure 1.7 Profile of a Fraud Data Scientist
Figure 1.8 Screenshot of Web of Science Statistics for Scientific Publications on Fraud between 1996 and 2014
Figure 2.1 Aggregating Normalized Data Tables into a Non-Normalized Data Table
Figure 2.2 Pie Charts for Exploratory Data Analysis
Figure 2.3 Benford's Law Describing the Frequency Distribution of the First Digit
Figure 2.4 Multivariate Outliers
Figure 2.5 Histogram for Outlier Detection
Figure 2.6 Box Plots for Outlier Detection
Figure 2.7 Using the z-Scores for Truncation
Figure 2.8 Default Risk Versus Age
Figure 2.9 Illustration of Principal Component Analysis in a Two-Dimensional Data Set
Figure 3.1 3D Scatter Plot for Detecting Outliers
Figure 3.2 OLAP Cube for Fraud Detection
Figure 3.3 Example Pivot Table for Credit Card Fraud Detection
Figure 3.4 Break-Point Analysis
Figure 3.5 Peer-Group Analysis
Figure 3.6 Cluster Analysis for Fraud Detection
Figure 3.7 Hierarchical Versus Nonhierarchical Clustering Techniques
Figure 3.8 Euclidean Versus Manhattan Distance
Figure 3.9 Divisive Versus Agglomerative Hierarchical Clustering
Figure 3.10 Calculating Distances between Clusters
Figure 3.11 Example for Clustering Birds. The Numbers Indicate the Clustering Steps
Figure 3.12 Dendrogram for Birds Example. The Thick Black Line Indicates the Optimal Clustering
Figure 3.13 Screen Plot for Clustering
Figure 3.14 Scatter Plot of Hierarchical Clustering Data
Figure 3.15 Output of Hierarchical Clustering Procedures
Figure 3.16k-Means Clustering: Start from Original Data
Figure 3.17k-Means Clustering Iteration 1: Randomly Select Initial Cluster Centroids
Figure 3.18k-Means Clustering Iteration 1: Assign Remaining Observations
Figure 3.19k-Means Iteration Step 2: Recalculate Cluster Centroids
Figure 3.20k-Means Clustering Iteration 2: Reassign Observations
Figure 3.21k-Means Clustering Iteration 3: Recalculate Cluster Centroids
Figure 3.22k-Means Clustering Iteration 3: Reassign Observations
Figure 3.23 Rectangular Versus Hexagonal SOM Grid
Figure 3.24 Clustering Countries Using SOMs
Figure 3.25 Component Plane for Literacy
Figure 3.26 Component Plane for Political Rights
Figure 3.27 Must-Link and Cannot-Link Constraints in Semi-Supervised Clustering
Figure 3.28δ-Constraints in Semi-Supervised Clustering
Figure 3.29ε-Constraints in Semi-Supervised Clustering
Figure 3.30 Cluster Profiling Using Histograms
Figure 3.31 Using Decision Trees for Clustering Interpretation
Figure 3.32 One-Class Support Vector Machines
Figure 4.1 A Spider Construction in Tax Evasion Fraud
Figure 4.2 Regular Versus Fraudulent Bankruptcy
Figure 4.3 OLS Regression
Figure 4.4 Bounding Function for Logistic Regression
Figure 4.5 Linear Decision Boundary of Logistic Regression
Figure 4.6 Other Transformations
Figure 4.7 Fraud Detection Scorecard
Figure 4.8 Calculating the p-Value with a Student's t-Distribution
Figure 4.9 Variable Subsets for Four Variables V1, V2, V3, and V4
Figure 4.10 Example Decision Tree
Figure 4.11 Example Data Sets for Calculating Impurity
Figure 4.12 Entropy Versus Gini
Figure 4.13 Calculating the Entropy for Age Split
Figure 4.14 Using a Validation Set to Stop Growing a Decision Tree
Figure 4.15 Decision Boundary of a Decision Tree
Figure 4.16 Example Regression Tree for Predicting the Fraud Percentage
Figure 4.17 Neural Network Representation of Logistic Regression
Figure 4.18 A Multilayer Perceptron (MLP) Neural Network
Figure 4.19 Local Versus Global Minima
Figure 4.20 Using a Validation Set for Stopping Neural Network Training
Figure 4.21 Example Hinton Diagram
Figure 4.22 Backward Variable Selection
Figure 4.23 Decompositional Approach for Neural Network Rule Extraction
Figure 4.24 Pedagogical Approach for Rule Extraction
Figure 4.25 Two-Stage Models
Figure 4.26 Multiple Separating Hyperplanes
Figure 4.27 SVM Classifier for the Perfectly Linearly Separable Case
Figure 4.28 SVM Classifier in Case of Overlapping Distributions
Figure 4.29 The Feature Space Mapping
Figure 4.30 SVMs for Regression
Figure 4.31 Representing an SVM Classifier as a Neural Network
Figure 4.32 One-Versus-One Coding for Multiclass Problems
Figure 4.33 One-Versus-All Coding for Multiclass Problems
Figure 4.34 Training Versus Test Sample Set Up for Performance Estimation
Figure 4.35 Cross-Validation for Performance Measurement
Figure 4.36 Bootstrapping
Figure 4.37 Calculating Predictions Using a Cut-Off
Figure 4.38 The Receiver Operating Characteristic Curve
Figure 4.39 Lift Curve
Figure 4.40 Cumulative Accuracy Profile
Figure 4.41 Calculating the Accuracy Ratio
Figure 4.42 The Kolmogorov-Smirnov Statistic
Figure 4.43 A Cumulative Notch Difference Graph
Figure 4.44 Scatter Plot: Predicted Fraud Versus Actual Fraud
Figure 4.45 CAP Curve for Continuous Targets
Figure 4.46 Regression Error Characteristic (REC) Curve
Figure 4.47 Varying the Time Window to Deal with Skewed Data Sets
Figure 4.48 Oversampling the Fraudsters
Figure 4.49 Undersampling the Nonfraudsters
Figure 4.50 Synthetic Minority Oversampling Technique (SMOTE)
Figure 5.1a Köningsberg Bridges
Figure 5.1b Schematic Representation of the Köningsberg Bridges
Figure 5.2 Identity Theft. The Frequent Contact List of a Person is Suddenly Extended with Other Contacts (Light Gray Nodes). This Might Indicate that a Fraudster (Dark Gray Node) Took Over that Customer's Account and shares
his/her Contacts
Figure 5.3 Network Representation
Figure 5.4 Example of a (Un)Directed Graph
Figure 5.5 Follower–Followee Relationships in a Twitter Network
Figure 5.6 Edge Representation
Figure 5.7 Example of a Fraudulent Network
Figure 5.8 An Egonet. The Ego is Surrounded by Six Alters, of Whom Two are Legitimate (White Nodes) and Four are Fraudulent (Gray Nodes)
Figure 5.9 Toy Example of Credit Card Fraud
Figure 5.10 Mathematical Representation of (a) a Sample Network: (b) the Adjacency or Connectivity Matrix; (c) the Weight Matrix; (d) the Adjacency List; and (e) the Weight List
Figure 5.11 A Real-Life Example of a Homophilic Network
Figure 5.12 A Homophilic Network
Figure 5.13 Sample Network
Figure 5.14a Degree Distribution
Figure 5.14b Illustration of the Degree Distribution for a Real-Life Network of Social Security Fraud. The Degree Distribution Follows a Power Law (log-log axes)
Figure 5.15 A 4-regular Graph
Figure 5.16 Example Social Network for a Relational Neighbor Classifier
Figure 5.17 Example Social Network for a Probabilistic Relational Neighbor Classifier
Figure 5.18 Example of Social Network Features for a Relational Logistic Regression Classifier
Figure 5.19 Example of Featurization with Features Describing Intrinsic Behavior and Behavior of the Neighborhood
Figure 5.20 Illustration of Dijkstra's Algorithm
Figure 5.21 Illustration of the Number of Connecting Paths Between Two Nodes
Figure 5.22 Illustration of Betweenness Between Communities of Nodes
Figure 5.23 Pagerank Algorithm
Figure 5.24 Illustration of Iterative Process of the PageRank Algorithm
Figure 5.25 Sample Network
Figure 5.26 Community Detection for Credit Card Fraud
Figure 5.27 Iterative Bisection
Figure 5.28 Dendrogram of the Clustering of Figure 5.27 by the Girvan-Newman Algorithm. The Modularity Q is Maximized When Splitting the Network into Two Communities ABC –DEFG
Figure 5.29 Complete (a) and Partial (b) Communities
Figure 5.30 Overlapping Communities
Figure 5.31 Unipartite Graph
Figure 5.32 Bipartite Graph
Figure 5.33 Connectivity Matrix of a Bipartite Graph
Figure 5.34 A Multipartite Graph
Figure 5.35 Sample Network of Gotcha!
Figure 5.36 Exposure Score of the Resources Derived by a Propagation Algorithm. The Results are Based on a Real-life Data Set in Social Security Fraud
Figure 5.37 Egonet in Social Security Fraud. A Company Is Associated with its Resources
Figure 5.38 ROC Curve of the Gotcha! Model, which Combines both Intrinsic and Relational Features
Figure 6.1 The Analytical Model Life Cycle
Figure 6.2 Traffic Light Indicator Approach
Figure 6.3 SAS Social Network Analysis Dashboard
Figure 6.4 SAS Social Network Analysis Claim Detail Investigation
Figure 6.5 SAS Social Network Analysis Link Detection
Figure 6.6 Distribution of Claim Amounts and Average Claim Value
Figure 6.7 Geographical Distribution of Claims
Figure 6.8 Zooming into the Geographical Distribution of Claims
Figure 6.9 Measuring the Efficiency of the Fraud-Detection Process
Figure 6.10 Evaluating the Efficiency of Fraud Investigators
Figure 7.1 RACI Matrix
Figure 7.2 Anonymizing a Database
Figure 7.3 Different SQL Views Defined for a Database
Figure 7.4 Aggregate Loss Distribution with Indication of Expected Loss, Value at Risk (VaR) at 99.9 Percent Confidence Level and Unexpected Loss
Figure 7.5 Snapshot of a Credit Card Fraud Time Series Data Set and Associated Histogram of the Fraud Amounts
Figure 7.6 Aggregate Loss Distribution Resulting from a Monte Carlo Simulation with Poisson Distributed Monthly Fraud Frequency and Associated