How Data Visualization Reveals the Shape of Fraud

At Feedzai, we leverage vast amounts of data to help firms fight fraud efficiently. It’s part of our battle against ever-evolving financial crime. And it’s a team effort. Data scientists use our platform to build machine learning models from historical data to stop fraudsters across the globe in real-time. Fraud analysts investigate the most complex cases and take action. Investigators look into what’s trending among fraudsters, creating rules to complement the model and stop future attacks.

All of these roles use data to successfully reach their goals. For instance, a data scientist handles terabytes of transactions while an analyst investigates customers’ histories. However, despite differences in the size and scope of the data they handle, both need to make sense of the data.

If you ask me, I’d say that visualization is one of the best tools for doing that — and research continuously proves it.

Fraud-fighting charts

Data visualization engineers strive to make complex data interpretable through visual representations. We take data, abstract the information, and encode its properties through visual channels to create a visual representation.

Human perception, through preattentive visual processing, quickly decodes shapes, colors, positions, and movements. As Colin Ware puts it, “The features that pop out are hardwired in the brain, not learned.” So, you can think of our vision as a high bandwidth channel to the brain.

Data visualizations enable people to understand patterns and quickly spot relationships hidden in data.

This fast data visualization works well in Feedzai’s reporting tool, Insights, which makes data (including real-time fraud metrics, the most frequently triggered rules, and overviews of day-to-day analyst operations) visible across multiple dashboards. You can design charts in these dashboards that can be read at a glance, usually alongside gauges and metrics.

In the world of fast data visualization, there’s a tendency to stick to simplicity and the most conventional charts, but there should still be some room for creativity. As Amanda Cox said,

“There’s a strand of the data viz world that argues that everything could be a bar chart. That’s possibly true but also possibly a world without joy.”

Not all visualizations are simple. Our data scientists often create elaborate charts that can be used to explore different perspectives on the data (e.g., how the shape of a distribution of a categorical variable changes between legitimate and fraudulent transactions, and how that relates to each in-category fraud rate). These are not the sort of plots you can glance over, no matter how great your preattentive visual processing is. These charts are meant to be studied because they encode many aspects of the data. It’s slow data visualization: you invest more time and get more robust, in-depth analyses.

Fast and slow data visualization

To further explore the differences between fast and slow data visualization, check out Elijah Meeks’s article Data Visualization, Fast and Slow.

As data visualization engineers at Feedzai, it’s essential to always understand where the visualization falls on this fast-slow spectrum, how much time readers take to look at the chart, and their data literacy. Not every situation calls for fast data visualization, but not every user wants an elaborate interactive chart. Like most things, there’s a time and place for each type.

It’s all connected

This blog post narrates our exploration of node-link diagrams and graph-based features, which are key to uncovering intricate fraud patterns, and how our discovery kick-started one of Feedzai’s innovations, Genome.

Let’s start with a glimpse of the world of payments and fraud.

We’re usually told to be aware of our passwords and credit card details. We’re told to protect them from malicious, lonely hackers who hide behind computer screens and try to steal our data and money.

While there are hackers who operate solo, fraudsters rarely work alone. They are often members of fraudulent organizations and take part in multiple activities, including setting up cloning card devices in ATMs and deploying phishing attacks online.

When fraudsters try to convert stolen cards to re-sellable goods and attempt to shop online, they tend to connect through private networks, switch devices, and then spoof their location around the world. However, this makes it easy for them to accidentally leave a trace. All it takes is for one sloppy fraudster (someone who stayed up last night, binging through too many episodes of “Stranger Things” on Netflix) to forget a step in the switching procedure. Once that happens, the organization’s cover is blown, and you can stop them. It sounds a bit clichè, but everything is connected.

Tables are a lousy medium for fraud stories

As we pieced together fraud stories through these connections — a shared device between two users, someone using over ten different cards (each of which is shared with other distinct users) — we discovered that tables are a lousy medium for this task.

Things started to look a bit like this (see image below). Because of that, we decided to give node-link diagrams and graph-based features a go at Feedzai’s first Zaickathon.

Screenshot of a panicking man with paper clutter in the background — Screenshot from “It’s Always Sunny in Philadelphia“.

Interested in reading the original post or other similar articles? Check out the Feedzai TechBlog, a compilation of tales on how we fight villains through data science, AI and engineering.

Feedzai’s Zaickathon

Feedzai hosted its inaugural Zaickathon in February 2018. For two days, there were pizzas, tons of coffee, t-shirts, and a whole bunch of ideas. It was also the perfect opportunity to test out a new visualization method.

We picked a dataset of digital wallet transactions with known fraud schemes and visualized it with a node-link diagram. Each node represented an entity (a distinct customer, credit card, email, or device) in the dataset. Two nodes were linked (connected by an edge) if their entities participated in the same transaction. Here’s an example:

Imagine we are an online sneaker shop. We get an order from Zoe. From the transaction data, we know that Zoe uses her smartphone, a Samsung S9, to purchase a new pair of sneakers. We also know that she pays with her debit card. Given these entities, we can build the following graph:

A few weeks pass, and we get a couple more orders from Zoe. She pays with the same card, but now she uses a new Huawei device. We update our graph and connect the new device to Zoe and her debit card. The edge thickness changes because it’s proportional to the number of transactions both entities (nodes) participate in.

Now let’s say we have relevant historical data (previous transactions the new device participated in that were identified as fraud). In that case, we would want to encode this information in the diagram visually. Let’s update the graph once again.

The red borders and edges show which entities link to fraud, based on historical data.

To recap, up to this point, we’ve used the following visual encodings:

The result? We visually found fraud! In the middle of a sea of small connected components, huge intertwined subgraphs surfaced.

We had successfully mapped the sequence of different fraud patterns. For the first time, we saw the shape of fraud.

The hackathon project was a success, and a cross-functional team was soon established to kick-start a product out of it. Genome went on to become a visual link analysis tool that leverages Feedzai’s powerful AI technology, providing an intuitive way for investigators and data analysts to quickly identify emerging financial crime patterns.

Larger and larger networks

We assembled the team, and the two-day prototype gave way to a scalable graph visualization built from scratch. The team was still in the early days of building a product. There were post-its with ideas scattered around walls, daily brainstorms, and quick experimentations.

We wanted complete freedom to render the graph any way we desired — custom interactions, complex node and edge design, and different layouts (from the standard d3-force to alternative layouts with WebCoLa). This meant one thing: we had to build a graph renderer.

We played around with several front-end technologies, starting with SVG but quickly realizing it didn’t scale for the large graphs received from real data. We moved from SVG to Canvas and worked a bit on WebGL. In the end, Canvas worked best. Victor Fernandes was the magician who used creative, ingenious strategies to push its performance for our needs (that’s a topic for a whole other blog post). After plotting larger and larger graphs, we concluded that the bottleneck was no longer the browser performance — it was in the graph legibility for the user. You could potentially plot 30,000 nodes and 100,000 edges, but why on earth would you want to do that?

As most data visualization practitioners are aware, node-link diagrams are a tricky domain. While seemingly perfect for understanding relationships, they can quickly get out of control if we’re dealing with large, highly-connected graphs — something commonly referred to as “the hairball problem.”

It was never realistic to think we’d see the full network in the browser. The fraud analyst usually only sees a subgraph generated from current alerted event data and relevant historical context, and then they can expand nodes to find additional connections. An investigator starts with a more generic query (e.g., “all missed fraud from last week”), sees the graph generated from all those events, and tries to identify interesting new patterns. In both cases, we’re using a “search and expand-on-demand” approach instead of the classical “overview first, zoom and filter, then details-on-demand” data visualization paradigm.

This means we can usually avoid those large graphs, but sometimes we have to face them. Even though they are not interpretable, they can be quite good looking, so we call them “Genome Data Art.” They make good desktop wallpapers and t-shirts, and maybe they can even partake in a data art exhibit one day:

Storytelling fraud

As we learned more about connected fraud, it became clear that there was an essential dimension to some of these patterns that we weren’t encoding visually: time. The temporal aspect of a fraud attack is of utmost importance: the frequency and periodicity of the events are significant indicators of fraudulent activity.

Because of this, we went on to develop a time histogram for Genome. The histogram shows the distribution over time of the events generating the graph. While this seems simple enough to do, it comes with a few challenges, as developing data visualizations for a product usually do. Luís Cardoso has written a blog post on it that’s great to check out.

The histogram supports adjustable binning, pan, and zoom, and it even has an overview strip plot. All this solved many of our problems. However, we still wanted to relate the node-link diagram to the histogram, and for this, we used another visual channel: animation. By animating the time histogram, we see the story unfold over time as new nodes and connections appear on the screen.

GIF showing a time histogram of compromised cards after a data breach — The story of three compromised cards (green nodes) after a data breach.

What the future holds

Genome has already provided analysts with insights into how fraud looks. Now the question is, “how we can make it even better?” How do we make the graph even easier to read and interpret?

Our next goal is to create the perfect conditioner to take Genome up a notch so that we can untangle the messy “hairballs” — the large, highly-connected graphs. In other words, we’re putting together the right mix of “ingredients” (alternative layouts, edge bundling, node grouping, or graph summarization) to create a scalable overview mode that complements the investigation view.

We’re also figuring out how to add a geospatial dimension to the graph. Similar to time, this is a sign of some fraud modi operandi. We look forward to exploring that side of GeoViz.

Machine learning is always at the core

Finally, AI is always at the core of Feedzai, and our data scientists work hard to make Genome even smarter so it can best assist fraud analysts in their investigations. The team has already made great strides on this front, including scoring subgraphs to recommend interesting areas of the graph, and clustering subgraphs that are all instances of the same fraud pattern (known as Genometries). It’s all about making machine learning interpretable and actionable to improve the analyst’s experience with Genome.

However, there’s still much to do, as data scientists at Feedzai are brimming with innovative ideas they want to explore and state-of-the-art experiments they want to run. What started as a two-day prototype in a hackathon has grown and evolved. We’ve found our footing and created a solid foundation to keep building on this next-generation graph visualization platform. Now the fun continues, as a world of possibilities for network visualization is at our fingertips.