Operational Risks of AI — Get Rid of Them, Now

Kojin Oshiba

Kojin is a co-founder of Robust Intelligence.

Kojin Oshiba

Kojin is a co-founder of Robust Intelligence.

AI is the future of every business. The availability of large-scale data and computing power is making AI technologies transformational for organizations. AI is steadily trickling into industries far removed from high tech, creating entirely new categories of products and possibilities.

The Pain: Operational Risk of AI

Everything comes with a cost, however, and AI is not an exception. While the benefits of AI are immense, it also introduces serious operational risks. Here are a few examples:

  • Broken data pipelines feed in corrupted data to models, producing garbage outputs
  • Bugs are introduced in model serving causing the whole system to crash
  • Models are misused by engineers outside your data science organization
  • Corner case inputs you didn't account for during development break the model in production
  • Drift in data significantly degrades the performance of your models
  • Models make discriminatory decisions with you not being aware
  • Bad actors try to "hack" the model decisions by feeding in malicious inputs

The list goes on and on... These are all example symptoms of the underlying disease: operational risks of AI.

Do any of these sound familiar to you? The chances are, if you've been involved with data science or machine learning, regardless of the industry or the companies you've been at, you've probably faced many problems like the above. I can say so with certainty — since the birth of the company, we've had countless conversations with AI practitioners in tech, finance, insurance, and government where they mentioned many of the issues listed above as the key challenges their AI teams face. We have also been the victims of these operational risks ourselves. Many members of the Robust Intelligence team have experienced this firsthand at companies ranging from large tech (Google, Uber, Salesforce) to mid-size tech (Wish, Postmates, Quora) to startups digital consulting firms. The operational risks of AI are prevalent and will only worsen as more companies adopt AI, build AI teams of increasing scale, and develop and deploy more models on more data.

Ignore operational risks of AI at your own peril

Are operational risks of AI that bad? If you're not convinced yet that it's a serious problem, let's consider some of the consequences of leaving them within your AI systems:

First, your data pipeline and model system will break. With issues like bugs and broken data pipelines, your AI system will crash, literally. Not only will it break, but it will also break all the time. If you've worked within this broad spectrum spanning data infrastructure engineering to model prototyping and productionization, you know how fragile these systems are. Data and ML pipelines are always actively under development, and the characteristics of the data change all the time.

Consequently, you will have to firefight these issues in production, leaving no room for focused development work. You and your team will waste your precious time digging through error logs, identifying the root cause of the problem, all while your model is crashing. How wasteful and nerve-wracking is that!

Even when you've fixed all the visible errors in your pipeline, you have only solved a subcomponent of the bigger problem. Perhaps an even more pernicious form of operational risk is the silent errors. The tricky thing with AI models is that even when the models are taking in garbage input or producing garbage output, they're not necessarily going to crash. For example, when the model is doing terribly on a specific subset of the data, or when the distribution of the input data is changing drastically and inducing wrong model predictions, you will not by default see any error logs or PagerDuty alerts in your system monitoring dashboard. These silent errors in your system are tough to triage and will have subtle but compounding effects on your downstream metrics. The model will continue to produce garbage predictions silently until a month later; you realize your customer churn is higher than ever.

Silent errors: garbage in garbage out behavior is not captured as system failures

Why is it so hard to get rid of these risks?

Most of the time, the priorities of data science teams are elsewhere, e.g., in developing more performant models, generating a new set of features, or improving the latency of the model service. Data scientists and machine learning engineers will, at best, get few hours a week to think about these risks. As a result, the risks are only partially tackled in manual ways, and they continue to pile up.

Data scientists specifically tend to focus on ad-hoc efforts towards model improvement. Yet this means data science teams will never eliminate the operational risks at the organizational level. If one data scientist asserts model behavior differently than another data scientist, it will be challenging to tell whether a model is production-ready or a post-production model is performing as anticipated.

Finally, getting rid of AI operational risks is dang hard. While the field of software engineering contains widely established practices of testing and documentation, ML-specific engineering introduces specific complexities, hidden dependencies, and anti-patterns unique to data pipelines and AI models (Sculley et al.).

There needs to be a way to measure operational risks of models across your organization in a unified manner. However, this entails both AI and engineering challenges:

  • AI challenge: how would you measure and mitigate operational risks of AI across your models exhaustively, effectively, and consistently?
  • Engineering challenge: how would you build an infrastructure that ensures both developed and deployed models to be constantly evaluated for operational risks?

These challenges are extremely tricky, and it is nearly impossible to overcome them while developing the actual AI models for your business needs.

Let's eliminate the operational risks of AI, together

The good news is that you're not tackling this problem alone, not anymore. At Robust Intelligence, we've translated years of research and industry experience to build Robust Intelligence Model Engine (RIME), with a single goal of eliminating operational risks inherent in AI systems. The platform provides two complementary tools that work in conjunction: automated unit testing of pre-production models and automated quality assurance of in-production models to ensure that your AI system is risk-free. I'll keep the product intro brief here, as the main purpose of this post was to introduce the concept of operational risks in AI and convince you of their seriousness. In our upcoming posts, we'll discuss the underlying principle that drives our product and why it's so effective at eliminating the operational risks of AI. In the meantime, if you'd like to learn more, feel free to reach out to Kojin Oshiba at kojin@robustintelligence.com.

Kojin Oshiba