In recent years, we’ve seen adoption of machine learning technology accelerate at an exponential rate. Large tech companies now train, deploy, and run hundreds if not thousands of models every single day. Even beyond the tech industry, companies have begun internalizing their model development thanks to the democratization of machine learning tools and infrastructure. As more and more companies develop AI systems, however, there remains a critical question left to be answered:
How do you know your AI system is ready to be deployed into production?
Take a minute to think about how your team approaches this question today. Based on all the conversations we’ve had with our partners here at Robust Intelligence, I’m willing to bet your answer goes something like this:
- We look at model summary metrics like accuracy, false positive rate, or AUC.
- We calculate these metrics not only on the training data, but also on unseen test data.
- We use methods like cross validation for more accurate estimation of these metrics.
Don’t measure your AI systems with a single metric
While this is considered the industry standard today, such an approach is fundamentally limited in providing a comprehensive measure of your AI system’s production-readiness. By reducing model behavior to a few simple numbers, you might completely overlook numerous possibilities for model failure in production settings. Here are a few common shortcomings of the summary metric approach we’ve seen firsthand with our customers:
It does not capture model behavior with enough granularity
After all, these accuracy metrics measure average performance of the model aggregated over all the instances in the dataset. This aggregated view can mask specific subsets of data that the model consistently fails to perform on — subsets that correspond to a specific segment of real-world users, items, or actions. Even if the model performs well “on average,” failing to provide accurate insights for individual segments can lead to dissatisfaction and churn.
It treats machine learning as a black-box model rather than a system
Your AI system is not your Jupyter notebook. It’s a significantly more complex pipeline of components ranging from an API interface to services handling input validation, data preprocessing, storage, and so on. Each of these components has the potential to introduce new bugs into the system. Evaluating the performance of a machine learning model in an isolated research environment does not provide any protection from errors or engineering-related bugs that may be present in other parts of the pipeline.
It assumes the model will always be queried in a certain, predictable way
No matter the use case, AI systems are almost always used by real people, leaving them vulnerable to intentional or unintentional human error. Consider, for example, an internal machine learning service trained and deployed by a team of data scientists. Engineers from other parts of the company may inadvertently misuse the model by inputting data in a format that the data scientists never anticipated. It’s even harder to make guarantees about usage of external-facing services. In such cases, the end users may seek to exploit the model for their own benefit by feeding in adversarial inputs, or even try to intentionally cause system failures or crashes. Metrics like accuracy cannot capture corner cases, nor can they control for the noisy or adversarial data that you should expect to see in production.
It disregards potential system-level security vulnerabilities
Just like any other mission-critical software, ML systems should be routinely scanned and tested for any cybersecurity vulnerabilities that adversaries may exploit. For example, pickle, a Python serialization format widely used by data scientists, can impose serious security risks to your organization. Given that many models operate on highly sensitive or confidential data, it's even more vital to routinely investigate and test your ML development practices on cybersecurity grounds.
At first glance, it might seem daunting to consider all of these failure modes before deploying your model into the wild. But proactive assessment of AI systems is critical in order to prevent your business from incurring significantly larger costs down the line. In most cases, by the time you uncover an error in production, the damage from poor model performance or an exposed security vulnerability has already been done. To make matters worse, AI systems are notoriously complex and typically involve several different roles such as software engineers, data engineers, data scientists, and machine learning engineers. The entire time these teams are coordinating to diagnose the failure and build the proper controls, prolonged model downtime eats away at your company’s bottom line. Time-consuming reactive firefighting also jeopardizes planned roadmap, weakening your ability to execute on longer-term investments.
Why is testing AI systems so hard?
Unfortunately, exhaustively stress-testing your model before deployment is a task easier said than done. Data scientists and machine learning engineers are scarce resources who rarely have the additional bandwidth to spend time securing models on top of their existing responsibilities. And compared to calculating basic model evaluation metrics, tasks such as testing model behavior on corner cases or defending against model misuse are significantly more complex and open-ended, often requiring domain-specific expertise.
Solution: Platform for AI Security and Integrity
At Robust Intelligence, we develop technologies to address exactly those challenges that we've outlined above. Our enterprise offering RIME (Robust Intelligence Model Engine) provides a comprehensive platform for eliminating AI failures inherent in AI systems by stress-testing and ensuring they're truly production-ready. RIME lets you do the following with minimal to no extra engineering effort:
- Test model behaviors: Preemptively assess model performance in production-type environments by feeding in common pitfall test cases and algorithmically crafted data samples.
- Evaluate model robustness: Analyze model responses against sensitive and corner case inputs and identify vulnerable data patterns.
In short, RIME closes the loop between model development and model deployment. If you've worked on a software team, this is the stage where you'd typically run QA and various related tests (e.g. unit testing, integration testing, penetration testing) on your system. We believe that ML system development should embody the same, if not higher, level of rigor in assessing production-readiness.
In this post, we mainly focused on the post-training, pre-production phase of AI development. In future posts, we will also be writing about how RIME can overcome problems in other phases of the model life cycle (e.g. pre-training, post-deployment). If you’re interested in learning more about our platform or making your models production-ready, email me anytime at email@example.com.