July 1, 2021
minute read

Stress Testing NLP Models using the Declaration of Independence


This Fourth of July, the RI team decided to play around with a timely text: the Declaration of Independence. Below are the results of some fun machine learning experiments using two versions of the Declaration of Independence as our input. But first, a quick history lesson.

The Untold Story of King George III vs. the Declaration of Independence

On June 28, 1776, Thomas Jefferson submitted a draft of the Declaration of Independence to Congress. A loyalist spy caught wind of this and managed to sneak into Congress, transcribe the draft, and send a copy back to King George III of Great Britain. George, hard at work trying to sustain a profitable empire, didn’t have time to read each and every word of the pesky declaration, so he passed it off to his chief advisor, Spacy the Small. He tasked Spacy with highlighting all of the important objects, such as people, places, and countries (let us call all of these important objects named entities, and the task of automatically identifying and categorizing them named entity recognition), which were the only elements of the text he really cared about.

After a day of hard work, Spacy proudly presented a list of all the named entities in the text. The king, a naturally distrusting fellow, asked for some assurances of the quality of the results. The final version of the document would inevitably differ from the draft, and the king needed to know that he could trust Spacy on unseen pieces of text. Spacy, undeniably useful though he could be on narrowly-defined tasks, was rather incomprehensible in regular conversation (a meaningful shortcoming in any advisor). When probed, he could provide nothing but the most vague explanations and analyses of his own methods.

Having received no reassurances from Spacy, and not wishing to comb through the entire document himself, the clever king came up with the following scheme: he called in a second advisor named Robusta the Intelligent and gave her a list of rule-based revisions, or transformations, to apply to the text. These were transformations that George III expected to not meaningfully change the semantics of the document, and therefore to preserve the named entities. Each time Robusta applied a transformation, she was to take the revised document back to Spacy and have him once more highlight the named entities. By looking at the changes in Spacy’s highlighting caused by each revision, the king could begin to tease apart the aspects of the text that Spacy most relied on when highlighting, and which words Spacy was most uncertain about.


OK, so maybe this story isn’t entirely true. But it does illustrate the setup of many real-world ML systems nicely. Real-world decision-makers (King George III) often have lots of text data (the draft Declaration) at their disposal; to efficiently extract insights from these they use ML models (Spacy); the text data available pre-deployment is often unlabeled or only sparsely labeled, and there are few guarantees on how well similar this initial data will be to inference-time data (the finalized Declaration); using models to make decisions in such settings therefore introduces risk, and decision makers are (or should be) interested in reducing this risk.

In the experiments that follow, we (Robusta the Intelligent) use various transformations and these two versions of the Declaration of Independence (draft and finalized) to try and better understand an off-the-shelf Spacy model, which we use to perform Named Entity Recognition.

Transforming the Declaration of Independence

Transformations can be interesting for two main reasons:

  1. If we can identify transformations that a) preserve the relevant semantics, and b) generate plausible real-world data, then transformations allow us to test out-of-distribution robustness/generalization without needing labels and without needing to collect more data. 
  2. Transformations help marginalize the effect of specific attributes of the data on the model output, providing a more principled understanding of model behavior.

Here are some of the transformations that we came up with and applied to our two documents: 

  • upper_sent/lower_sent: Upper/lower case a sentence
  • upper_word/lower_word: Upper/lower case a word
  • isolate: Input a word on its own, without the enclosing sentence
  • remove_punc: Remove all punctuation from a sentence
  • append_sent: Pass in a sentence and the subsequent sentence
  • lemmatize: Lemmatize each word in a sentence (e.g. convert “walking” to “walk”)
  • replace_ampersand: Swap all occurrences of “&” in a sentence with “and”
  • random_aug: Add a random one-character typo to a word
  • keyboard_aug: Add a one-character typo to a word based on common keyboard typos (e.g. swap “a” for “s”)
  • ocr_aug: Add a one-character typo to a word based on common OCR typos (e.g. swap “l” for “1”)

As an example, the original model output for the title of the draft Declaration looks like:

When we capitalize the entire sentence, this turns into:

When we remove all punctuation from the sentence:

And when we lemmatize each word in the sentence:

Stress Testing the Model

So what can we do with these transformations? We can start by applying them to each word in a document, and replacing our lower-fidelity and potentially brittle model output with something “fuzzier” and more fine-grained. For example, the original model output for the finalized Declaration starts out:

But after performing our suite of transformations on the data and feeding each transformed version through the model, we can construct the following picture:

The number under each word refers to the number of transformed versions of this word that were recognized as an entity (of any kind). Very dark and very light words are therefore the word for which the model provides fairly consistent predictions, while the output for more grey words can easily be flipped.

By looking at the effectiveness of individual transformations, we can begin to understand which features of the data the model is most reliant upon, potentially to the point of overfitting. In the graph below we define effectiveness as the percent of words that a transformation caused to flip from being identified as an entity to not being identified as an entity, or vice versa. By this metric, capitalization (upper/lower_word) and spelling (keyboard_aug, rand_aug, ocr_aug) seem to be more important to the model than inflection (lemmatize) or context (isolate). 

From the augmented model output we can construct a “robustness score” for each word, which we define as the fraction of transformations that did not cause the prediction for the word to flip (0 meaning none of the transformations caused a prediction change, 1 meaning every transformation caused a prediction change). We can then parse out relationships between predictions and robustness, and if we wanted to even use this to estimate robustness of new data points without having to perform any auxiliary model queries. For example, one potential relationship we see below is that words that were classified as organisations (“ORG”) are on average slightly more robust than words classified as numbers (“CARDINAL”). More generally, words that received any classification at all are far less robust than words that went unclassified:

We can also look at the relationship between robustness and attributes of the raw data. Below we show the correlation between robustness and various numeric / binary attributes of the data. We control for whether or not a word was originally predicted to be an entity because this is highly correlated with many features of the data and, as we saw above, highly (anti-)correlated with robustness:

Some  noteworthy attributes that emerge from this analysis: 

  • text_char_len: longer words are far less robust than shorter ones
  • is_lower: amongst positive predictions, a word being all lower case suggests greater robustness, but amongst negative predictions it suggests lower robustness
  • is_upper, is_title, num_upper_in_sent: the presence of upper case letters in a word or sentence signals lower robustness among positive predictions
  • is_lemma: words that are already in their lemmatized form are more robust across the board


While the above experiments by no means constitute a rigorous process for verifying model robustness, they do suggest that:

  • NLP models can be very brittle, by which we mean large changes in outputs can be achieved with small changes to inputs. This intuitively makes sense, given how high-dimensional the input space is and, correspondingly, how sparse the training data is.
  • As a corollary, there are lots of cool ways of slicing NLP inputs and outputs and teasing out interesting relationships in a way that's very different from, say, the tabular and image domains.
  • To the extent that we saw some consistent patterns as to the most effective transformations and most susceptible data points, measuring robustness to simple perturbations does seem to offer insight into the mechanics of a model. This could be very helpful when trying to understand how your model will perform on new and unseen data.

As with all ML systems, the process of making, deploying, and maintaining NLP models can be made much more robust. We hope you enjoyed these experiments as much as we did. And happy Fourth everyone!


Related articles

August 9, 2022
minute read

Introducing the ML Model Attribution Challenge

November 1, 2021
minute read

IWI Uses RIME to Help Secure the Japanese Online Payments Market

March 9, 2022
minute read

What Is Model Monitoring? Your Complete Guide