RI Reading Group

At Robust Intelligence we are deeply interested in adversarial machine learning. Each month we do a reading group on the latest research in the field. This month we thought we’d share :)

Quick intro

Adversarial machine learning, broadly speaking, is the study of techniques that cause a machine learning model to misbehave, i.e. to have high loss at test or inference time. The most commonly studied settings are the ones in which an adversary has been given some degree of access to a trained but fixed model and to the training and test data. In such settings we can ask:

  1. Attacks: What is the adversary’s optimal algorithm, or “attack”, for manipulating the test data?
  2. Defenses: What is the modeler’s optimal training procedure, or “defense”, for producing models which are invulnerable to “attacks”? Such invulnerability is referred to as “adversarial robustness”.
  3. Detection and correction: How can we detect and correct for manipulations to the test data? Correction can also be thought of as a type of defense.

Different constraints on how much the adversary knows about the model and in what ways they can manipulate the test data lead to different and interesting results.

Why we care about adversarial ML

  • There are real world systems with real value attached to them that match the theoretical settings being studied. Perhaps the most salient example is online transaction fraud, wherein fraudsters artificially manipulate the details of a transaction (such as their location or the device they’re using) to try and fool ML fraud detectors.
  • If you could ideally define the constraints of the problem, such that the adversary could only change the test data in ways that did not change the ground truth labels, then adversarial robustness would lead to generalization and out-of-distribution robustness. Unfortunately that’s a bit of a catch-22 (you’d need to know the ground truth labelling function), but it establishes an intuitive link between performance in the adversarial setting and in non-adversarial settings. Empirically adversarial robustness has been tied to out-of-distribution robustness and transferability.
  • A good adversarial attack will efficiently find the regions of the input space where your model predictions are most variable and sensitive. This can serve as a great tool for understanding the landscape of your model function.

The readings

Our goal in reading academic literature is to develop a principled understanding of the field and to inspire our own algorithms. We try to distill each paper to its core principles, and emphasize creativity and innovation over incremental improvements in academic benchmarks when choosing what to read.

Without further adieu, here’s what we read this month.


Devin Willmott, Anit Kumar Sahu, Fatemeh Sheikholeslami, Filipe Condessa, Zico Kolter


High-level: Willmott, et al. present two algorithms for generating universal adversarial noise against a classifier given many training samples but only one or two model queries per sample. Universal adversarial noise refers to a constant noise vector which is applied to every sample indiscriminately.

The essential detail: the two-query algorithm (minus a good amount of detail):

  1. Choose a set of possible perturbations, e.g. a set of normal random vectors.
  2. Take one random vector and a batch of training data.
  3. Take one data point from the batch.
  4. Add and subtract the perturbation from the data point, and compute the delta in the loss between the two perturbed points.
  5. Repeat this for each instance in the batch. Average the loss deltas.
  6. Repeat this for every random vector. Compute a weighted sum of the random vectors, with each weighted by the average loss delta they induced.
  7. Add this weighted sum, an estimate of the loss gradient, to your adversarial noise. Repeat 1-7 as necessary.

The one-query algorithm uses a gaussian distribution to produce noise candidates, computes the loss caused by each candidate averaged over a batch of different inputs, and uses the CMA-ES algorithm to iteratively optimize this noise distribution.

Impact: We know that given many queries over a single input, there is enough information in those queries to create a tailored perturbation. Now we also know that given many inputs but one to two queries for each, there is enough information to create a universal perturbation.


Yuhang Wu, Sunpreet S. Arora, Yanhong Wu, and Hao Yang


High-level: Wu, et al. present an algorithm for detecting adversarial examples on DNNs. The algorithm takes advantage of the fact that the noise generated by an adversarial attack is much more variable in the region of an already adversarial input than in the region of a clean input.

The essential detail: the meat of the algorithm is in the pre-processing step:

Choose a “canonical” example $x_c$ of each class c from the training data. Choose a transformation function $t(\cdot)$. Choose an attack algorithm $att(\cdot)$. Choose a layer $m$ of your DNN $f$ to focus on; let $f_m(x)$ be the output of the $m$-th layer on input $x$.

  1. Given a new input $x$, find its predicted class $f(x) = c$ and retrieve the corresponding canonical example $x_c$. Also generate a transformed input $t(x) = x_t$.
  2. Generate $x’ = att(x)$. Repeat for $x_c, x_t$.
  3. Compute $d = f_m(x) - f_m(x’)$. Repeat for $x_c, x_t$.
  4. Compute the angle between each pairing of $d, d_c, d_t$.

The idea is that $d$, $d_c$, and $d_t$ will be similar and thus have small angles between them on originally clean $x$, and very different on already-adversarial $x$.

Left: The distribution of the angle between $d$ and $d_c$ in clean examples versus adversarial examples. Right: The distribution of the angle between $d_c$ and $d_t$. Taken from https://arxiv.org/pdf/2012.15386.pdf.

To actually train a detector, first augment your training data with adversarial examples. Perform the above preprocessing on the augmented training data to get a detection training set, where the detector input is the three angles and the label is whether those angles were generated from a clean or adversarial example.

Impact: Wu, et al.’s research suggests that the neighborhood of an adversarial example looks fundamentally different than that of an in-distribution example, and that there is enough signal in this difference to effectively detect adversarial examples.


Changhao Shi, Chester Holtz & Gal Mishne


High-level: Shi, et al. present a method for training a denoiser for DNNs, which learns to remove adversarial noise.

The essential detail: First, train an auxiliary model on a (somewhat arbitrary) self-supervised task over the training data. Example tasks include autoencoding and learning to detect rotation, after creating rotated versions of the training data. At inference time, gradient descent is performed on the input with respect to the auxiliary model loss. The idea is that the auxiliary model will perform best on in-distribution data, and so optimizing an input for the auxiliary model will make it look more in-distribution.

Left: simultaneously training a classifier g and an auxiliary model h (with some shared architecture f). Right: purifying an input at test time by minimizing $L_{aux}$. Taken from https://arxiv.org/pdf/2101.09387.pdf.

Impact: Shi, et al. show that self-supervised models can learn the training distribution well enough to be used for correcting deviations from it.


Bogdan Georgiev, Mayukh Mukherjeem, Lukas Franken


High-level: Georgiev, et al. present a high-fidelity method for studying the curvature of model decision boundaries. They find that typical adversarial training methods reduce larger-scale curvature in decision boundaries but that fine-grained curvature persists.

The essential detail: Very roughly: Given a classifier, choose a training example, place a ball around it and estimate the percent of that ball that would be misclassified w.r.t. the true label of the training example. Release a bunch of brownian motion particles from the training example and see how long it takes for them to hit the decision boundary. Holding the percent of the ball that would be misclassified constant, higher variance in the time to hit the boundary roughly corresponds to higher boundary curvature.

The “heat” radiating from a model, corresponding roughly to how long it would take for a brownian particle to hit a decision boundary starting at any point. Taken from https://arxiv.org/pdf/2101.06061.pdf.

Impact: Georgiev, et al. improve the sensitivity with which we can study model landscapes and use this to show that adversarial training is ineffective beyond a certain granularity.


Xiaoyang Wang, Bo Li, Yibo Zhang, Bhavya Kailkhura, Klara Nahrstedt


High-level: Wang, et al. present a generic, RL-based method for selecting features that lead to robust models.

The essential detail: Let a deep-Q learning RL agent play around with different feature subsets, where the actions are adding a new feature to the selected features and the reward is the negation of the adversarial loss of a (fixed architecture) model trained on the features selected so far. To make the problem more tractable, compute a couple different rankings of the features using traditional feature importance metrics, and limit the agent’s action space to popping off a feature from the head of one of the ranked lists.

Impact: Wang, et al. show that RL can be used to solve the combinatorial problem of feature selection and create a useful tool for training robust models.


Aishan Liu, Xianglong Liu, Jun Guo, Jiakai Wang, Yuqing Ma, Ze Zhao, Xinghai Gao, and Gang Xiao


High-level: Liu, et al. provide a set of 20+ metrics for evaluating the robustness of a DNN model and test data set. Note, the data metrics are meant for an adversarially augmented test set and measure how comprehensive it is in testing the robustness of the model; they are not measuring how robust a model trained on such data would be.

The essential detail: The metrics measure things like neuron coverage induced by the test data, the imperceptibility of adversarial examples in the test data, and the performance (loss) and model behaviour (activations at different layers) in a variety of adversarial settings.

Impact: Liu, et al.’s suite of metrics is a great jumping off point for verifying the robustness of a DNN.


Giulio Rossolini, Alessandro Biondi, Giorgio Carlo Buttazzo


High-level: Rossolini, et al. propose a computationally light-weight method for detecting adversarial and out-of-distribution inputs to DNNs which learns the distribution of a hidden layer on training data and uses that to efficiently measure if new inputs look out-of-distribution.

The essential detail: The authors present four ways of quantifying the typical activation distributions. The first two ways look at the output of each neuron, storing a range and histogram of seen values. The third method sorts neurons by their output and for each neuron counts the frequency with which it is in the top k neurons. The fourth method learns the average activation of the entire network for each class, and performs kNN on the activations from the new input to see if the originally predicted class matches the kNN predicted class.

Taken from https://arxiv.org/pdf/2101.12100.pdf

Impact: Rossolini, et al. show that lower-dimensional, latent learned representations carry enough signal to allow for effective yet computationally efficient detection of adversarial examples.

More from this month

Fundamental Tradeoffs In Distributionally Adversarial Training: https://arxiv.org/pdf/2101.06309.pdf

Towards Imperceptible Query-limited Adversarial Attacks with Perceptual Feature Fidelity Loss: https://arxiv.org/pdf/2102.00449.pdf

Noise Sensitivity-Based Energy Efficient and Robust Adversary Detection in Neural Networks: https://arxiv.org/pdf/2101.01543.pdf

Admix: Enhancing the Transferability of Adversarial Attacks: https://arxiv.org/pdf/2102.00436.pdf

Model Patching: Closing the Subgroup Performance Gap with Data Augmentation: https://arxiv.org/pdf/2008.06775.pdf

Robust Machine Learning Systems: Challenges, Current Trends, Perspectives, and the Road Ahead: https://arxiv.org/pdf/2101.02559.pdf

Better Robustness by More Coverage: Adversarial Training with Mixup Augmentation for Robust Fine-tuning: https://arxiv.org/pdf/2012.15699.pdf

Small Input Noise Is Enough To Defend Against Query-based Black-box Attacks: https://arxiv.org/pdf/2101.04829.pdf

Meta Adversarial Training: https://arxiv.org/pdf/2101.11453.pdf

Adversarial Attacks for Tabular Data: Application to Fraud Detection and Imbalanced Data: https://arxiv.org/pdf/2101.08030.pdf

Adversarial Learning with Cost-Sensitive Classes: https://arxiv.org/pdf/2101.12372.pdf

Lowkey: Leveraging Adversarial Attacks To Protect Social Media Users From Facial Recognition: https://arxiv.org/pdf/2101.07922.pdf

Robustness Gym: Unifying the NLP Evaluation Landscape: https://arxiv.org/pdf/2101.04840.pdf

Adversarial Attack Attribution: Discovering Attributable Signals in Adversarial ML Attacks: https://arxiv.org/pdf/2101.02899.pdf

Temporally-Transferable Perturbations: Efficient, One-Shot Adversarial Attacks for Online Visual Object Trackers: https://arxiv.org/pdf/2012.15183.pdf

Cortical Features For Defense Against Adversarial Audio Attacks: https://arxiv.org/pdf/2102.00313.pdf