Prototyping a machine learning model in a Jupyter notebook, training it over a given training dataset, and evaluating the model on a test set using mean accuracy metrics - these are standard practices nowadays for machine learning engineers and data scientists. Yet translating such a model into a production use case inherently raises expectations on the model’s reliability and robustness. Models that seemingly perform well in the prototyping phase can potentially run into a host of potential failure modes in the production setting.
In this blog post, we'll describe three common failure modes of ML models that data scientists oftentimes don't account for when they first prototype the models - these failures may then come back to bite you in a production setting! These failure modes are model input failures, performance bias failures, and robustness failures.
Model Input Failures
Setting up a functioning data pipeline is the necessary prerequisite to any company looking to utilize machine learning in a production setting. Unique to a machine learning pipeline is the fact that an upstream change in data processing can have hidden, negative, downstream effects on model performance and future data collection [2, 3].
This can happen in a variety of ways. A data engineer could unintentionally change the distribution of a feature by modifying how its computed, or create a bug that leads to the features being rendered as nan's. A data pipeline change could lead to your feature values being treated as strings rather than floats. The end users themselves could supply gibberish or invalid values that end up being used as features.
We observe that standard machine learning packages do not provide out of the box capacity to check whether the data is “valid” before returning a result. Validity can take numerous forms:
- Having numeric values being within an acceptable range.
- Ensuring that the model is invariant to type conversions.
- Detecting if features are missing from the model.
In these situations, machine learning packages both demonstrate inconsistent behavior and do not alert the user when an input feature clearly violates validity constraints.
Machine learning packages do not check for input validity. Imagine that a user enters an "Age" feature of 100,000 or accidentally converts the currency in the "Amount Payable" feature from USD to Yen. Chances are your ML model won't say anything about it! It'll happily return a prediction even though the input is clearly invalid.
Subtle changes to the input can adversely affect model output. As an example, the data type of a given feature could change from a numeric representation of the value (e.g. 123) to a string representation of the same value (e.g “123”). Surprisingly, we found that many existing ML packages are not robust to such changes. While sklearn produces an invariant prediction in the presence of such a type change, other libraries do not: LightGBM, XGBoost, Tensorflow, PyTorch all produce modified predictions in the presence of a numeric to string type change on a given feature.
It turns out most people are too lazy to do this. Breck et al.  noted that over 80% of teams at Google working on ML-centric projects neglected to implement tests on their data and model pipeline (including this very idea of having a feature schema!), even while the majority concurred that such tests were important. Data and input validation is clearly a tedious, challenging problem that you would probably rather offload to another platform.
Performance Bias Failures
The second failure mode is one that data scientists often times don't test for during their evaluation pipeline: performance bias failures. For instance, the standard objective used for training a binary classification model is log likelihood, taken as a mean of all values across the entire dataset. In a similar vein, metrics used for evaluating binary classification models include zero-one accuracy, F1 score, and the AUC for the ROC and PR curves, all of which are computed over the evaluation dataset of interest.
These default approaches towards training and evaluation over mean error can hide model biases towards certain subgroups. There is a large literature on ML Fairness [4, 5] analyzing the numerous causes, from data to model architecture, which can cause bias. We briefly describe a few implications of model bias:
- Discrimination towards gender or race. As an example, Bolukbasi et al.  found that word embeddings trained on Google News articles can broadly reflect gender stereotypes, notably with occupations. Deploying such a model in a production setting would only serve to amplify gender biases and has huge ethical ramifications.
- Hidden longer-term effects from certain subsets. If a recommender system model underperforms for the subset of new users, this underperformance will compound over time and potentially lead to longer-term engagement/retention drops.
- Unexpected discrepancies between data subsets. It is possible that the underperformance on a given feature can seem inexplicable and surprising. For instance, suppose that a fraud detection model performed much better on device_type=”iPhone” than device_type=”Android” subsets. It is important to raise this to the data scientist/developer so that they can dive into the source of this discrepancy - perhaps the label distribution between device types is actually quite different, and perhaps additional features need to be added.
Performance bias failures reflect an ongoing concern over fairness in the machine learning community, and affect even the biggest tech companies. In 2019, a Washington Post article reported that Apple's black box algorithm offered a Danish entrepreneur a 20x increase in credit limit over that of his wife, despite his wife having a higher credit score. Yet perhaps even more broadly, all of the above failures can also be symptoms of the general ML challenge of modeling the long tail, which affects everything from search traffic to popular image datasets  - the most popular categories are included and the rest are ignored.
We must have rigorous evaluation tools to ensure that we can detect when a subgroup is being ignored by the model.
Finally, we arrive at the last failure mode of robustness failures. This is a fun one because it's closely related to the popular research area of adversarial attacks. We've all seen some variant of this in the past (source: OpenAI):
As an example, imagine that you perform perturbation analysis for a single datapoint on a single feature, and the goal is to find the minimal perturbation that will cause a biggest change in the model score. These types of failures are not just problematic in an adversarial setting; the fact they can be exploited by malicious actors is only one side of the story. More broadly, they indicate a general lack of robustness that suggests the model may not generalize well to new or unseen data points.
When we perform this analysis, we find that even run-of-the-mill machine learning models (random forests, gradient boosted trees, simple MLP's) contain sharp changes in the decision boundary for unexpected feature values! We can uncover even more interesting insights when we perturb a set of features, beyond just a single feature - perhaps certain combinations of feature values uniquely cause the model output to change a drastic amount.
Why should I care about robustness failures? There are a few reasons. First, as illustrated with the panda image above, such perturbations induce mistakes by your model. There are regions in the input space where your model clearly does not provide the expected output; at best this can lead to a frustrating user experience and a loss of trust in the system, at worst these vulnerabilities can be exploited by an outside attacker. Moreover, robustness failures typically indicate sharp changes in your output space, and a high degree of nonlinearity can be indicative of overfitting. Robustness failures can also be another indication of performance bias - imagine swapping the "Gender" feature of an income prediction model, with all else being equal, and finding that the predictions change drastically.
In short, model input failures, performance bias failures, and robustness failures are three key considerations when comprehensively evaluating if a ML model is suitable for production. The tools and habits of data scientist practitioners generally do not account for these failures; we believe that additional tooling and effort must be spent towards expanding model evaluation in order to help identify failures in model bias and robustness.
 Eric Breck, Shanqing Cai, Eric Nielsen, Michael Salib, D. Sculley. “The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction.”
 Nithya Sambasivan, Shivani Kapania, Hannah Highfill, Diana Akrong, Praveen Paritosh, Lora Aroyo. “‘Everyone wants to do the model work, not the data work’: Data Cascades in High-Stakes AI”.
 D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, Dan Dennison. “Hidden Technical Debt in Machine Learning Systems”. NIPS’15.
 Tolga Bolukbasi, Kai-Wei Chang, James Zou, Venkatesh Saligrama, Adam Kalai. “Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings.” NIPS’16.
 Moritz Hardt, Eric Price, Nathan Srebro. “Equality of Opportunity in Supervised Learning.” NIPS’16.
 Ziwei Liu, Zhongqi Miao, Xiaohang Zhan, Jiayun Wang, Boqing Gong, Stella X. Yu. "Large-Scale Long-Tailed Recognition in an Open World." CVPR'19.