Does Not Compute: Data Inconsistencies in Machine Learning Pipelines

Benjamin Cohen-Wang

Ben is a machine learning engineer at Robust Intelligence.

Benjamin Cohen-Wang

Ben is a machine learning engineer at Robust Intelligence.

One of the risks associated with deploying machine learning algorithms is their handling of anomalies. Anomalies can be the result of data pipeline errors or malicious actors, or might simply be naturally occurring outliers. Whatever the cause, models typically make predictions as usual when presented with anomalies, despite the fact that they are much more likely to make mistakes.

When anomalies appear in production, it is valuable not just to know the model’s best guess at the label, but to be alerted that an anomaly is present. This alert might be used to flag this particular data point for manual review or it might be used to adjust the model’s confidence score. Among other types of anomalies such as a numeric outlier and a rare or previously unseen category, a more subtle anomaly that might go unnoticed by a machine learning system and yet be detrimental to its performance is what we call a data inconsistency.

What are data inconsistencies?

Data inconsistencies are points where a combination of feature values appear that violate the patterns generally observed in the training data. Consider a machine learning system whose inputs include a country and a time zone. Then the following values together would be a data inconsistency:

{"Country": "France", "Time Zone": "Pacific Standard Time"}

When the country is "France" we expect the time zone to be "Central European Time" so there’s something weird going on here. This inconsistency might be due to a bug in our data collection process, it might be due to a malicious actor manipulating the inputs to the model, or it might just represent a Californian tourist visiting France who didn’t change their time zone. Regardless of the cause, a machine learning model expecting a particular relationship between the country and time zone would be more likely to make a mistake on this data point.

Why do we care?

In our experience, there are three common causes of inconsistencies. In each case, there are actions to be taken to eliminate future inconsistencies or mitigate their effects on model performance.

  1. Anomalies: Data inconsistencies might result from data points that don’t follow the common patterns apparent in the data as discussed in the above example. Although such inconsistencies aren’t a symptom of any particular issue, the model might not be well-equipped to make predictions in such cases. Instead of blindly trusting the model’s prediction where inconsistencies are present, it may be prudent to flag such data points for manual review or adjust the model’s confidence accordingly.
  2. Data Pipeline Bugs: Errors in the way data is collected might result in inconsistencies. In the example above, it might be the case that the time zone input is updated more frequently than the country input, resulting in a brief span of time when contradictory values appear. Flagging inconsistencies is a key debugging step for fixing such pipeline errors.
  3. Malicious Actors: Data inconsistencies can be the result of malicious actors intentionally attempting to circumvent fraud detection systems. For example, to commit a fraudulent transaction using stolen credit card information, a fraudster might want the transaction to appear to have been made from wherever the real owner of the credit card is (they might do so using an emulator). However, the fraudster might forget to change their time zone to match the new location, resulting in an inconsistency like the above example. Identifying inconsistencies in the fraud domain can be used to augment existing fraud detection systems.

Identifying data inconsistencies with RIME

What are the challenges of identifying data inconsistencies?

The key technical challenges of algorithmically identifying inconsistencies are (1) efficiently learning the relationships between features that generally hold and (2) filtering uninteresting inconsistencies or those that are the result of random noise. In terms of learning feature relationships, the primary challenge is computational complexity: the number of combinations of feature values to consider is exponentially large in the number of features. Robust Intelligence has developed methods for learning key relationships while avoiding exponential running time. For identifying only “interesting” inconsistencies, we have found that several heuristics are needed to constrain the search for inconsistencies to combinations of features which would provide interesting results.

Data Inconsistencies in a Fraud Detection Dataset

RIME algorithmically finds inconsistencies by learning relationships between different features from the data and flagging violations when they occur. As there is some subjectivity in what is considered to be an inconsistency, it provides a set of configurable parameters for detection. We run RIME with default parameters on the IEEE-CIS Fraud Detection dataset [1].

The following are a few of the inconsistencies we find (the features "id_30" and "id_31" seem to denote the operating system version and browser).

{"DeviceType": "mobile", "DeviceInfo": "MacOS"}
{"DeviceType": "desktop", "DeviceInfo": "iOS Device"}
{"id_30": "Windows 7", "id_31": "safari generic"}

In the first two instances, there’s likely some form of emulation happening since a mobile device shouldn’t be running the MacOS and a desktop device shouldn’t be running iOS. The last instance may not be an emulator, since it’s possible to run the Safari browser on a system running Windows 7, but represents an anomaly which may interfere with the model’s predictive performance.

To measure the effects of data inconsistencies on model performance, we train a gradient boosted decision tree model to detect fraud. Over a test set of ~200K data points, the model has an AUC score of 0.895. However, over the 336 inconsistent data points in this test set, the model has an AUC score of only 0.786. This drop in performance reflects the model’s inability to make high-quality predictions when inconsistencies occur.

The performance (AUC) of a model on the entire IEEE-CIS Fraud Detection Test Set, and on the samples flagged by RIME as inconsistent. On inconsistent samples the performance is substantially lower.

Conclusion

Data inconsistencies present a risk to the deployment of machine learning models, particularly in the domain of fraud detection. Deploying a system for identifying and flagging inconsistent values among other anomalies reduces this risk.

Benjamin Cohen-Wang