Head in the Clouds: Designing the RI On-Cloud/On-Prem Deployment

Mukil Loganathan

Mukil is a Software Engineer at Robust Intelligence.

Mukil Loganathan

Mukil is a Software Engineer at Robust Intelligence.

If you are a data scientist or follow tech news at all, chances are you are no stranger to AI Failures. Over the last year, we have seen cases like the Zillow iBuying scandal where even large multi-billion-dollar companies fall prey to AI failures. Pretty scary, right? Companies around the world clearly think this is a pressing issue as the Robust Intelligence (RI) team has seen tremendous interest in the RI Model Engine (RIME) stress testing, continuous testing, and firewall products.  

RIME Lite?

With the team launching dozens of RIME Lite trials per month across a variety of machine learning domains and industries, you may be wondering “What is RIME Lite exactly?” As the name suggests, RIME Lite is a “Lite,” no-frills way to explore RIME’s proprietary stress testing engine. A data scientist can quickly spin up RIME on his or her own machine and stress test their models in a matter of minutes. RIME Lite has received great feedback from the data science teams we work with, but the product does have some limitations as it is not a hosted solution. Many of our customers expressed that they would love a more collaborative, full-featured version of this product, naturally spawning the idea for a RIME cloud/on-prem solution.

Cloud/On-Prem vs SaaS:

One might ask, why choose On-Cloud/On-Prem versus a more in vogue SaaS offering? The team thought long and hard, and zeroed in on more localized deployments for three main reasons:

  1. Data Privacy: Many of our customers work with extremely sensitive data and pour hundreds of thousands of dollars into developing new models. Naturally, they are reluctant to send sensitive information outside of their network, so many customers prefer localized solutions.  
  1. Customizability: Our customers often have different usage patterns/use cases for RIME. One customer may want to run nightly model stress tests while another customer wants data scientists to integrate RIME into Databricks pipelines. Deploying directly onto a customer’s infrastructure enables us to tailor a RIME deployment to fit the customer’s specific needs.
  1. Integration: By deploying onto customer infrastructure, our deployment automatically integrates with existing security schemes, logging mechanisms, VPNs, etc. This helps us get our product into the hands of data scientists as quickly as possible.

While there are clearly many advantages in building out localized deployments, architecting the stack and building out the installation process was easier said than done.

With great power comes great responsibility:

As the saying goes, the devil is in the details. We needed to create a service that is:

  1. Flexible: Because we deploy onto customer infrastructure, we must support a variety of infrastructure stacks. With customers using multiple cloud platforms, differing tool versions, and several operating systems, the RIME Teams solution had to work in an environment where no two solutions are exactly alike.
  1. Backward Compatible: Since everything is hosted by the customer, the RI team does not have direct access to any of the customer’s infrastructure. This makes updates and major releases that much more difficult. We also iterate very quickly, sometimes faster than customers like to update, meaning that we need to be backwards compatible for several versions.
  1. Secure: We cannot secure production models with insecure software or infrastructure. Any deployment needs to be airtight, with tightly controlled access points to prevent accidental leakage of data.
  1. Scalable: Due to the collaborative nature of the production system, there can be dozens of data scientists working with RIME at a time. To top it off, some datasets/models can require substantial amounts of compute and memory to stress test. We need a solution that can reliably handle this kind of load without having to wait hours to see the results of a single test run.
  1. Easy to deploy/use: Finally, we wanted to keep the deployment easy/quick to install. No data scientist wants to spend days setting up a step in their model pipeline.

After months of development, the platform team at RI delivered the RIME Teams solution, which has already been deployed to many of our customers.

Introducing RIME Teams:

Leveraging the power of Terraform and Helm, the platform team at RI created a Kubernetes powered solution that can be deployed to a customer in less than 2 hours. To setup a RIME installation, all an administrator needs to do is populate an RI-provided Terraform module with a few variables to configure their setup. The module does the heavy lifting from there, provisioning any necessary infrastructure for the installation. Customers are always free to use existing infrastructure, which can easily be imported into the module when needed. Once the infrastructure is in place, a quick Helm installation ensures RIME services are installed into a Kubernetes cluster, also provisioning external touchpoints like DNS records for select services. And voila, your data science team now has a working RIME installation. A data scientist is a quick pip install and auth credential away from being able to utilize RIME stress testing wherever they see fit! Kubernetes in-built scaling ensures that the deployment scales under heavy load (configurable by the customer), Helm provides an elegant migration mechanism when performing upgrades, and Terraform allows us to hook into any sort of infrastructure provider (AWS, bare-metal) etc. that a customer may desire. To update the RIME application, a customer only needs to change a few values in their installation setup and new features will be installed with a simple apply. As always, there is plenty of work left to do, building out new features and handling new uses cases, but the RIME Teams product marks an important milestone in our question to help data science teams avoid AI Operational Risk.


The Proof is in the Pudding:

Reading about our deployment process is all well and good, but you can only experience the power of the RIME by trying it. We love working with data scientists in any industry, and would love to hear any feedback or thoughts you may have. If this sounds intriguing to you or your team at all, or even if you just want to chat, please feel free to reach out to me (mukil@robustintelligence.com) or any of my colleagues.

Mukil Loganathan