Forget About the Algorithms – It’s the Data that Prevents Fraud

Myth: it’s all about the algorithm. Reality: it’s all about the data the training data, the sources of data, and the quality of the data.

Training Data

The basic concept of machine learning is that it gives computers the tools to learn from data without having to be explicitly programmed. Let’s say you want your e-commerce site to prevent fraudulent credit card purchases automatically. Machine learning could help you do that. Hypothetically speaking, you could build a machine learning model and algorithm for preventing credit card fraud.

But the model and algorithm won’t work until you train them to detect and identify credit card fraud, and to do that you need high-quality training data and lots of it.

The number of training datasets you need depends on the problem you’re trying to solve. You may only need a small number of datasets for a very basic credit card fraud prevention model. But credit card fraud is just one of many types of fraud.

Javelin Strategy and Research reported that in 2016, U.S. fraud losses totaled $16 billion. Account takeover fraud losses reached a total of $2.3 billion, an increase of 61% from 2015. If you wanted your model to also tackle promotion abuse, account takeover, and other forms of fraud, you would have to build more models and algorithms first. Then you would need to obtain high-quality, financial and fraud-specific training datasets.

Sources of Training Data

The American research scientist and entrepreneur Alexander Wissner-Gross conducted a review of highly publicized AI advances happening within the past 30 years. Wissner-Gross determined that the availability of high-quality datasets could result in a breakthrough in the field of artificial intelligence about six times faster than algorithms.

If you go back about a decade, you’ll find that the number of available public training datasets was limited. Today, however, you can find many publicly available high-quality training datasets for a wide variety of AI and machine learning applications such as object recognition, action recognition, and sentiment analysis. ImageNet, for example, is one of the most well-known sources of image datasets, and it provides human-annotated images for use in computer vision research.

There are public datasets available for numerous industries and applications. For example, Quandl offers a wide variety of financial and economic data (free and paid) that comes from central banks, financial exchanges, private companies, and government agencies. You can even find public datasets for credit card and other types of financial fraud at sites like Kaggle.

Public datasets are generally for testing out algorithms and training models that will only perform a simple, general task. But fraud is a complex problem, whether it’s credit card fraud, account opening fraud, or any other type of fraud. So instead of trying to find a publicly available dataset for the type of fraud you want to prevent, you would probably be better off using a solution to create your own custom datasets. Or better yet, find a machine learning fraud prevention solution with thousands of models already trained with high-quality datasets.

It is crucial that machine learning models and algorithms are trained with high-quality, well-annotated training data. It only takes a small amount of flawed data to throw off your fraud detection model.

Crowdsourced Data

Training datasets are not the only sources of data where quality is a concern. AI and machine learning models that learn from application users could have data quality problems, even if the models were trained on high-quality data. Public sources of data like chatbot platforms, social media networks, and forums can be problematic for machine learning-based apps.

Remember Microsoft’s now-defunct chatbot Tay? It took less than 24 hours for Twitter users to teach Tay how to make racist and offensive statements. The chatbot learned from the biased and flawed data that came from its users. Tay was a great example of the adage “garbage in, garbage out.” Machine learning can go sideways quickly if it’s fed low-quality data.

Waze is another example of an app that is driven by crowd-sourced data. Wazers can edit the live map, and exchange traffic and navigation data through the Waze app. Accidents, police nearby, construction ahead, and other real-time traffic alerts are the result of data contributed by the Waze community. Waze relies on its more than 50 million users for much of the data that runs its app. Waze uses a rank and points system to encourage users to provide accurate information and to keep users coming back to the app.

Machine learning-based fraud prevention solutions are not all the same. Some rely on feedback from users; some do not. But they all must train their machine learning models and algorithms with high-quality data. And if the fraud prevention solution does rely on user feedback, the quality of that feedback will impact how well or how poorly the solution works.

Algorithms are only part of the fraud prevention puzzle. It’s the data that prevents fraud.