Illustration of business man working on computer, another one on the phone, and a woman investigating - meant to represent the data science loop

Machine Learning: The Data Science Loop

Demystifying Machine Learning for Banks is a five-part blog series that details how the machine learning era came to be, explains why machine learning is the key artificial intelligence technology, outlines how machine learning works, and explains how to put all this information together in the data science loop. This article is the fourth post in the series.

Machine learning systems can degrade over time, something referred to as concept drift. To avoid concept drift, data scientists continually monitor the models and adjust them as needed via the data science loop.  The data science loop describes the iterative approach for developing and improving a machine learning system.

The six phases of the data science loop

The combination of sophisticated algorithms and practical context makes the data science loop particularly powerful. Algorithms possess superior analytical abilities, and data scientists provide practical context. Together, they are what makes the data science loop so powerful. Data scientists can evaluate the performance of algorithms in real time and make adjustments to develop the most accurate model possible. The six phases of the data science loop are:

diagram of the data science loop in machine learningData Cleaning

The data scientist selects the training or testing data, validates the quality of the data, and when needed, applies transformations so it reaches a valid format and can be used in the next steps of the data science loop.

Analyze and sample

The data scientist analyzes the data to determine which information will be the most useful to train a machine learning model to identify the target variable. In some cases, Data Scientists might sample the data to ensure that the target variable (e.g. fraud) is significantly represented in the data. This can be done through stratified sampling — a method that consists in splitting the data into groups based on the target feature and sampling independently from each group. By choosing the frequency with which each group is sampled, we can tune how much they’re represented in the final sample. The sampling process also reduces the volume of data that the system must process to build a model, which reduces the training time, enabling faster model iterations

Feature Engineering

This phase kicks off the modeling process and is where art meets science.  Sophisticated data analysis is performed to determine which features contribute the most to the model training and which features are harmful to the training. A feature is a measurable property of the data, as the value of a purchase. The data scientist, based on the analysis performed, selects the measurable attributes from the underlying data that will be included in the machine learning model. The data scientist usually creates new features that derive from the original data (e.g., create an is-holiday feature based on a date feature).  This is one of the most important and time-consuming parts of developing a machine learning model.

Model building

In this phase, data scientists select the machine learning algorithm(s) and build the model(s). There are many types of Machine Learning algorithms readily available. Each type of model has its advantages and disadvantages. Selecting the best type will depend on the kind of data and task being performed. They pay special attention to any requirements such as explainability and prediction speed (e.g., in transaction fraud, predictions are typically made in real-time; hence, scoring time is critical). 

Hyperparameter optimization

When buying a car, one can pick a model, but then there’s a lot of additional configurations that can be done over that model which end up changing the security, comfort and fuel efficiency of the vehicle. Likewise, machine learning models and the training algorithms have a lot of settings that can be adjusted and that impact how well the resulting model will perform. These settings are called hyperparameters. For example, for a machine learning algorithm for decision-tree based models, a hyperparameter might be the maximum depth of the decision trees to generate. Deeper trees could make the model more powerful — more decision splits capturing more information about the data —, but also more prone to overfitting (when the model closely fits the  training data and does not generalize well to new data). This means that hyperparameters have to be tuned carefully. Eventually many models are trained with many hyperparameter configurations and then the data scientists evaluate, compare and select the best ones.

Evaluate and compare

For every trained model, the data scientists extract some metrics of the model’s performance, such as the fraud detection rate or the rate of false positive predictions. This is done by running each model against a test set (data that wasn’t included in the training phase). By looking into how different features and hyperparameter configurations influence the metrics, the data scientists can devise new configurations that show promise of superior performance. 

Data scientists can repeat the loop at any time to improve their models. This is especially useful when the underlying data or business requirements change.

Key Takeaways

Systematic investments in open machine learning platforms allow for the automation of repetitive steps in the data science loop. This allows your data scientists to do what they do best: understand the data, create models, and generate insights or predictions that drive your business.

Elizabeth Cruz, Director of Data Science
Elizabeth Cruz, Director of Data Science

Latest posts by Elizabeth Cruz, Director of Data Science (see all)

Subscribe to stay infomed

  • data science
  • Machine learning