November 23, 2021
minute read

Machine Learning Actionability: Fixing Problems with Your Model Pipelines


You’ve spent months collecting data, cleaning features, choosing the right model, and deploying it to production but …. your model performance isn’t up to par. You breathe a sigh of desperation - despite all your best efforts, your model is inadequate. You want to know why and how you can fix it, but you have no idea where to start. Do you choose a new model? Do you revise your data collection process? Do you start all the way from the beginning? As full-time machine-learning engineers, we at Robust Intelligence understand how frustrating it is when everything you try just isn’t enough. In this post, we’ll walk you through some of the steps that we use to diagnose and resolve problems with machine learning pipelines, and how you can begin to act to address your own.

Diagnosing the Problem

When it comes to machine learning pipelines, there are a host of issues that can pop up. Thinking of problems from the top down helps us get an idea of how to drill down to specific issues. If none of your data collection or output steps are failing, then the two main sources of error that might contribute to performance failure are data processing and model training. From these, we get two simple questions:

1. What is wrong with my data?

2. What is wrong with my model?

Beginning to find solutions to these questions requires breaking them down even further. We can start at the stages in which these problems might arise. When we think about data issues, for example, the different sources of data are a natural concern (ie. the production data and the training data).

In this context, some failures are easier to pinpoint – maybe the training or production data is not cleaned properly, so it’s not in a format the model expects - while others, resulting from differences in the data, can be much harder to. Since each source can uniquely contribute to model training and prediction issues, novel performance issues can surface. For instance, production data might have values not seen in training. When a model trained on the old dataset is used to predict on the new set, it may often fail to perform well.

 This interaction between data and model is a particularly difficult part of the diagnostic process and relates to issues in model training you might want to identify. Training data can be deficient in unique ways and cause models to respond in equally different ways. This, in turn, involves questions about subsets of features in the training data and concerns of fairness and bias. If the data collected does not represent all subsets of the population proportionately, then the model can’t be expected to learn well and perform across all subsets equally. Diagnosing exactly what is wrong with the model, therefore, requires an understanding of how the training data and model itself might have contributed to a skewed representation of the task at hand.

Three predominant themes emerge:

1. Is training data processed in a clean, acceptable format? 

2. Is the production data consistent with the training data? 

3. Is the model training on a balanced dataset and performing equally on these features?  

Solutions to these questions can be very difficult to produce and implement. Let’s tackle each one at a time.

Tackling Dirty Data

In the case of the first question, the issue is with the preprocessing pipeline. Making sure that the training data doesn’t contain any duplicates or contain mixed data types before it is passed into the model should solve a lot of problems. The goal is to make sure that relationships in the data are as simple as possible, so the model can learn as efficiently as possible. There are likely many constraints related to the specific use case for your model as well – you’ll want to think through each one and make sure that the data used to train the model is up to par.

Addressing Inconsistencies

Solutions start getting more difficult when we get to the question of data consistency - what does consistency really mean? The root problem here is that the production and training data don’t align properly. Usually, there is some sort of drift in the data. Before acting, you should consider what the cost of wrong predictions here is. If it’s low, maybe you shouldn’t be making changes because an inconsistent prediction isn’t the end of the world. On the other hand, if it is, you’ll want to act immediately and set up warnings in your pipeline process. Labeling key portions of your incoming production data and training your model on these new samples is usually sufficient to solve the problem. This ensures that any drift to your underlying data are reflected in your model. 

Realistically, however, continuously labelling new data and retraining a model is both extremely difficult and expensive; an alternative is to relabel and retrain in batches. Inconsistent data can appear through shifting distributions, categorical outliers, new features, and new subsets. Flagging cases when these occur and setting retraining thresholds is the first step towards solving the problem. Acting proactively instead of reactively, you can make your own decisions on what types of data points you’re willing to accept and the threshold you’re willing to accept them at. Changing your real-time pipelines appropriately, you can reach an essential step towards preventing your model’s failures.

Firefighting Fairness

By the time you’ve implemented the first two solutions, you might think that you’ve got most of it figured out. Unfortunately, the last question is likely the most difficult to address. We want to know if the dataset is both properly representative of the true population and if model performance is fair across different subsets. In this case, we want to be able to differentiate the reasons that a model might perform poorly on one subset compared to another. We can think of performance failures in terms of two underlying causes: problems with the training data and problems with the model task. On one hand, a subset may underperform because it is underrepresented in the training dataset – for instance, training data can be biased towards historical data collection methods and underrepresent key gender, race, or income subsets. During training, the model may only be learning the patterns in the dominant subsets and disregard performance on the rare ones. Of course, subset size is no guarantee of performance, which is where the second explanation becomes important. If we assume that the distributions for different subsets in a dataset are different, certain subset may have more complex pattern distributions, and in turn, may be harder to train and predict on. As a result, regardless of the number of training samples, a model may still predict worse.

Stratified Random Sampling is a great way of addressing the issues with underrepresented subsets. There are two primary ways to go about this – proportionate sampling and disproportionate sampling. First, rather than random sampling and training from our training dataset, we can take each population in the proportions that it exists in. Oftentimes, when random sampling, algorithms underrepresented low count subsets even more than they already exist in the data, leading to further skew in model performance on these subsets. Ensuring these are represented in proportional amounts forces the model to train on more particular examples, likely increasing performance for that subset. An alternative to the proportionate sampling, disproportionate sampling, can be used if the error is extreme. Through a disproportionate method, we can overrepresent poorly performant subsets in the overall population by oversampling from their samples in the training data. The subsets, thus, more strongly influence the loss function than originally in the training data, increasing the probability that a trained model would learn its patterns more accurately.

The stratified sampling techniques above are part of a large theme of solutions which focus on stressing model training on poorly performant subsets. Adjusting model weights is another example of such a technique. By upweighting samples from underrepresented subsets, the model’s loss function is modified to become more sensitive to those areas of the training space – minimizing the loss function, therefore, necessarily means performing better predictions on that space and corresponding subset.

Still, the techniques above are not guaranteed to work in all cases. When considering the second underlying cause for performance failure – a difficult to learn subset – data augmentation techniques are not very helpful. This is largely because even training exclusively on these subsets (without attention to anything else) is not guaranteed to improve model performance much. The subset just has too complex of a pattern.

For this case, adding more data to the training set is usually the way to go. Collecting more data ensures that any complex patterns are more easily visible.


Diagnosing and resolving model failures is difficult. Even if you know where to look, finding out if something is wrong can be a big task. You need to check your training data, production data, model, and all interactions between the three. Problems can pop up anywhere, after all. At Robust Intelligence, we make sure that you don’t have to worry about most of these things, and we make finding the solutions easier. Next time you’re frustrated with your model performance or don’t even know how it is performing, keep these steps and keep us in mind. We’d love to help.


Related articles

November 10, 2021
minute read

A New Frontier of Risk in Healthcare: Artificial Intelligence

May 24, 2021
minute read

Business Alliance with Tokio Marine

March 31, 2022
minute read

How RIME Could Have Prevented the Age of Ultron