Guide for banks to prepare data for machine learning algorithms

Investing in machine learning without an understanding of data preparation best practices is like buying a top-of-the-line car without knowing how to drive. It doesn’t matter how sleek it is or how fast it can go. You’re going to have a hard time driving down the road. 

The same goes for artificial intelligence (AI) and machine learning. If a financial institution’s data science team can’t operate it correctly, it’s not going to be of much use. That’s why understanding the data preparation processes at the beginning of your machine learning journey is so consequential.

What Banks Don’t Understand about Data Preparation for Machine Learning

Assumptions over how the data preparation process is expected to unfold are one of the biggest obstacles in data preparation for machine learning. Here are several key misconceptions that many banks have about their own data before they deploy their own machine learning algorithms and models.

Data is Frequently Siloed

Many banks worldwide have legacy banking systems in place that were built 30 or 40 years ago. Different types of data can live on these systems, including transactions from ATMs, credit cards, debit cards, home loans, call center activity, merchant data, mobile app information, and more. Banks will struggle to access this important data if it is stored across several legacy systems.

The Right Data Architecture Matters

The good news for banks is they don’t have to remain saddled with data silos or fragmented systems. Banks can implement an architecture that centralizes data and provides greater visibility into how customers engage across different banking channels. These solutions enable banks to better understand and address their risk management priorities.

Data Prep for Real-World Scenarios, Not Just the Sandbox

There’s an old saying that assumptions are the mother of all…mistakes. Data prep is no different. The more efficient your data prep is in testing, the faster you can move your machine learning project into production. 

Data is the key for banks and financial institutions to develop their risk management strategies. That’s why it’s essential that banks ensure their data behaves the same way in production environments as in testing environments. Assuming the data behaves the same way in both environments can prove to be a costly oversight. 

Here are three things to remember as you prepare your data.

  1. Be Mindful of Changing APIs: Fields such as response codes can change over time. Your data should be resilient to potential changes.
  2. Different Environments, Same Data Transformations: A typical pitfall is the need to translate data transformation from the sandbox environment to the production system. Unwanted surprises can occur as a result of logic getting lost in this process.
  3. Real Systems Are Not Perfect: Production systems fail and data goes missing from time to time due to production issues. You should evaluate those scenarios on historical data and be mindful to prevent them from hurting your risk strategy in real-time scenarios. 

Banks’ Guide to a Smooth Data Preparation Process

When you’re ready to build and deploy your own machine learning models, you’ll want to ensure the models aren’t hampered by fragmented data and match your expectations when they are deployed into a live environment. Here’s how FIs can ensure a smooth data preparation workflow.

Use Multiple Datasets 

The more data a machine learning system can access, the better decisions it can make. The better decisions, the more effective an FI’s risk management strategy will be. 

An important step in data preparation is to use data from multiple internal and external sources. There are several avenues available. One option is data lakes, which can centralize fragmented data located across different legacy systems. Another option is integrating a machine learning system with external data sources to further enrich the data. 

Start with a Sample of Your Data

Banks can process millions or even billions of transactions each year. Validating a machine learning model’s accuracy with such a large volume of data can prove time-consuming. If you discover a problem with your model’s accuracy, you will lose time and resources fixing it.

Validating a sample of your dataset is an effective way to confirm your model’s performance before deploying it into production. 

Sample Data Validation Checklist

  1. Data Formatting: The system that will ultimately use the data has its own unique requirements. Take care that your data preparation efforts match those requirements. Ensure field names and data types match any applicable schema;
  2. Account for All Data: Make sure you have all the data you need. Count the number of rows in the sample and make sure the number of fields per line is consistent;
  3. Check Field Formatting: Ensure field formats are as you expect, (i.e., numerical fields contain numbers, date and time fields use the correct format);
  4. Check for Duplicate Data: This includes duplicate rows and IDs that should be unique;
  5. Only Use Relevant Data: Identify null values and remove fields that have a significant share of missing values;
  6. Be Mindful of Entity Fields: This includes entities that can be used in metrics, such as customer, merchant, or device IDs;
  7. Amounts and Events: Monitor transaction amount fields and fields that identify event types;
  8. Check Data Quality: Watch for dummy values added to some fields, especially fields containing free text;
  9. Double-check Data Accuracy: Make sure personal information (emails, phone numbers, addresses) is entered correctly;
  10. Numerical Field Factors: Among numerical fields check for outliers, maximum and minimum values, 10th and 90th percentile values, and if any field has consistent values;
  11. Consistent Currencies: For monetary fields, make sure all amounts are in the same currency;
  12. Time Lapses: Identify any time periods with data gaps.

Additionally, machine learning models can gain more context from the raw data using feature engineering. Moving away from siloed data enables a clearer, fuller view of customers’ behaviors, empowering FIs to detect fraud effectively.  By connecting multiple data points, for example, an FI can determine if a customer’s debit card was used at an ATM at one location while also making a purchase several miles away. 

Use Shadow Mode to Test Models

You’ve gone through the above checklist step-by-step. But how will you know if your model behaves correctly once deployed to a live environment? Banks and FIs can build new machine learning models in “shadow mode” – a side-by-side real-time comparison between champion and challenger models without the new model making decisions.

Data scientists can use shadow mode to test whether a model built in a sandbox behaves as expected once deployed into production. From there, data science teams can make adjustments to the model based on real-world activities and deploy updated models. This option enables FIs to identify gaps in their data collection workflows if a model consistently encounters issues with missing data, for example. FIs can also respond quickly when new data comes into play

Machine learning technology is a must-have for today’s banks. And for good reason. Machine learning models can read vast troves of data and give banks actionable intelligence to improve their risk strategies, prevent fraud, and stop money laundering. FIs need to understand the fundamentals of good data preparation for a machine learning system to be effective. From there, FIs can streamline their data preparation processes to ensure the models they build offline easily transition to online environments.

Banks that still use on-premise solutions will struggle to meet the needs of digital-first customers. Download our solution sheet Upgrade From On-Prem to Cloud-Based Solutions and learn how to give your fraud prevention and AML systems an upgrade.