Your Cookie Preferences

We use different types of cookies to optimize your experience on our website. Click on the categories below to learn more about their purposes. You may choose which types of cookies to allow and can change your preferences at any time. Remember that disabling cookies may affect your experience on the website. You can learn more about how we use cookies by visiting our

Essential Cookies

Provider: .providername.com

Name

Purpose

Type

Expires In

__cf_bm

Cloudflare places the cookie on end-user devices that access customer sites protected by Bot Management or Bot Fight Mode.

server_cookie

30 minutes

Provider: .providername.com

Name

Purpose

Type

Expires In

_tibcpv

Used to record unique visitor views of the consent banner.

http_cookie

1 year

Analytics and Customization Cookies

Name

Purpose

Marketo Munchkin

Marketo's custom JavaScript tracking code, called Munchkin, tracks all individuals who visit your website so you can react to their visits with automated marketing campaigns.

Name

Purpose

Google Tag

The Google tag (gtag.js) is a single tag you can add to a website to use a variety of Google products and services (e.g., Google Ads, Google Analytics, Campaign Manager, Display & Video 360, Search Ads 360).

Advertising Cookies

Provider: .providername.com

Name

Purpose

Type

Expires In

__cf_bm

Cloudflare places the cookie on end-user devices that access customer sites protected by Bot Management or Bot Fight Mode.

server_cookie

30 minutes

Provider: .providername.com

Name

Purpose

Type

Expires In

_tibcpv

Used to record unique visitor views of the consent banner.

http_cookie

1 year

January 27, 2022

minute read

Pickle Serialization in Data Science: A Ticking Time Bomb

Engineering

Author

Authors

Robust Intelligence

During the last moments of 2021, we learned of a new vulnerability in a previously-inconspicuous library called “Log4j”. What started out as a bug report, soon escalated into a worldwide national-security event. Disruptions were felt throughout myriad popular services. Google reportedly had 500 engineers going through their codebases for impact analysis. The U.S. federal government issued an emergency directive requiring its agencies to mitigate these vulnerabilities immediately. Attacks by nation/state actors in China, Iran, and North Korea were detected in real time. CVE-2021-44228 received a CVSS score of 10.0 — indicating the highest possible severity. Memes were made.

We won’t go into too much technical detail here. However, the spine-chilling crux of the log4j exploit is that it allowed attackers to execute arbitrary code in privileged processes in many, many servers throughout the world. Crucially, “arbitrary” here means, well, anything.

Should you care?

You’re a data scientist or an ML engineer. You don’t use Log4j (or don’t know that you do...). You don’t even use Java. You wouldn’t be caught dead using any JVM-based language. You just use python. You’re all set. Well… do you use pickle files? Or any other Python serializers? Does the snippet “model = torch.load(path)” ring any bells?

If you do, or your team does, you might want to bear with us here, as your organization’s internal machines might be at a similar risk as the one Log4j-using ones used to be.

You probably serialize

Modern data science is not done in a vacuum. Practitioners rely heavily on open-source and third-party shared models and benchmark data. This can be a good thing: globally-available and shared artifacts can help companies quickly adopt new and state-of-the-art algorithmic approaches and datasets, which is crucial in today's reality of unprecedentedly speedy progress. Without adopting third-party models, a production system can quickly become obsolete. However, are the sources of these artifacts, and their distribution chain, always trustworthy? Are they safe to use?

To answer that, let’s examine serialization. Serialization is the mechanism by which third-party models and data are typically imported into your process memory. Serialization solves a problem that almost every ****programmer, especially dealing with data, encounters daily: writing in-memory objects into files, so they can be stored persistently and shared, to then be “deserialized” back into process memory.

Pickle serialization is amazing

To most programmers, using Pickle — or serializers built on top of it like the one PyTorch uses — seems like a form of typical Python magic. It’s so simple! Consider the following piece of code

and voila, your object of type A **is stored under myfile.pkl. Then,

if you run this once the object is serialized (by the first script), you will notice that a neatly contains your serialized object, and the script successfully prints “3”.

This seems so natural, but let’s stop for a second and appreciate a few things that Pickle is doing for us here. First, those of us who ever wrote a custom class whose objects are serializable into a non-pickle format (e.g. JSON, or using a programming language other than Python) might greatly appreciate the fact that, with Pickle, you usually don’t have to write any custom serialization code. Pickle takes care of deciding the order of field serialization, the storage memory layout, recursive calls to serialize non-primitive object fields, and most everything else. Second, notice that the second script does not need to call import json, but somehow the call to a.print() “magically” manages to use json functionality — how convenient!

No free lunch

Unfortunately, pickle’s incredibly simple interface comes at a cost. Pickle’s deserializer, which is called into whenever we invoke “pickle.load” (or “torch.load”!), is a full-fledged virtual machine, able to run arbitrary code within the process that loads the object. It is expressly built to allow serialized objects to come with arbitrary instructions on how to deserialize them. In other words, Pickle deserialization readily supports running arbitrary code specified by the serializer (the original author of the file). It is, in fact, as simple as a few lines of code, by employing the designated “reduce” function. For example, try the following (you can replace “torch” with “pickle” and “save” with “load”). Feel free to also jump to the Appendix below, for a slightly more in-depth view of the Pickle virtual machine’s opcodes.

Python’s documentation cautions against loose usage of pickle deserialization. It was meant for efficiency and ease of use — not security, the manual warns us with this scary bright orange box.

Not use Pickle? That train left.

Unfortunately, despite the (in)security issue of using Pickle to deserialize files of untrusted origin being known for years, and despite explicit recommendations against this in the documentation, it has become a standard practice to do just that. Extremely popular libraries like Hugging Face Transformers use pickle (or Torch serialization) freely to share and import models, as well as many (most?) implementations you would find in artifact-sharing sites like Model Zoo. Deserializing Pickle files of questionable provenance, thus exposing oneself to arbitrary code execution, has become a second nature to data scientists.

Can Pickle be fixed?

Unfortunately, Pickle will not be “fixed” (=be made secure) in a future version, nor is there a straightforward way to detect or prevent exploits. Pickle’s vulnerability is tightly tied to its impressive usefulness. In technical terms, Pickle’s virtual machine is not a sandbox, and neither is Python’s interpreter — which means that getting a guarantee of safe deserialization for arbitrary files is going to be a formidable challenge.

What can we do?

At Robust Intelligence, we run a series of stress tests that also include Pickle file security. Contact us to learn more!

Appendix: Disassembling Pickles

Below, we see the “disassembled” (=made human-readable) pickle VM opcodes for our “payload” class. Even without knowing opcode semantics, which we will not get into here, we can imagine that this code loads the “posix.system()” (equivalent to “os.system”) function, and calls it, passing the string ‘echo “boom”’ as its argument.

Author

Authors

Robust Intelligence

Social

Follow us on LinkedIn

September 20, 2024

minute read

Extracting Training Data from Chatbots

For:

September 10, 2024

minute read

Leveraging Hardened Cybersecurity Frameworks for AI Security through the Common Weakness Enumeration (CWE)

For:

September 6, 2024

minute read

AI Governance Policy Roundup (August 2024)

For:

+ More Articles

No items found.

+ More Articles

January 27, 2022

minute read

Pickle Serialization in Data Science: A Ticking Time Bomb

Engineering

Author

Authors

Robust Intelligence

Should you care?

If you do, or your team does, you might want to bear with us here, as your organization’s internal machines might be at a similar risk as the one Log4j-using ones used to be.

You probably serialize

Pickle serialization is amazing

To most programmers, using Pickle — or serializers built on top of it like the one PyTorch uses — seems like a form of typical Python magic. It’s so simple! Consider the following piece of code

and voila, your object of type A **is stored under myfile.pkl. Then,

if you run this once the object is serialized (by the first script), you will notice that a neatly contains your serialized object, and the script successfully prints “3”.

No free lunch

Python’s documentation cautions against loose usage of pickle deserialization. It was meant for efficiency and ease of use — not security, the manual warns us with this scary bright orange box.

Not use Pickle? That train left.

Can Pickle be fixed?

What can we do?

At Robust Intelligence, we run a series of stress tests that also include Pickle file security. Contact us to learn more!

Appendix: Disassembling Pickles

Author

Authors

Robust Intelligence

Blog

March 17, 2022

minute read

Robust Intelligence Awarded IDIQ Contract to Eliminate AI Risk for the US Air Force

For:

August 15, 2022

minute read

Introducing ML:Integrity

For:

September 9, 2021

minute read

Daniel Glogowski: How Military Service and Salesforce AI Shaped our Head of Product

For:

No items found.

+ More Articles

Your Cookie Preferences

Essential Cookies

Provider: .providername.com

Provider: .providername.com

Analytics and Customization Cookies

Performance and Functionality Cookies

Advertising Cookies

Provider: .providername.com

Provider: .providername.com

Pickle Serialization in Data Science: A Ticking Time Bomb

Should you care?

You probably serialize

Pickle serialization is amazing

No free lunch

Not use Pickle? That train left.

Can Pickle be fixed?

What can we do?

Appendix: Disassembling Pickles

Follow us on LinkedIn

Related articles

Extracting Training Data from Chatbots

Leveraging Hardened Cybersecurity Frameworks for AI Security through the Common Weakness Enumeration (CWE)

AI Governance Policy Roundup (August 2024)

Related articles

Ready to learn more?

Pickle Serialization in Data Science: A Ticking Time Bomb

Should you care?

You probably serialize

Pickle serialization is amazing

No free lunch

Not use Pickle? That train left.

Can Pickle be fixed?

What can we do?

Appendix: Disassembling Pickles

Related articles

Robust Intelligence Awarded IDIQ Contract to Eliminate AI Risk for the US Air Force

Introducing ML:Integrity

Daniel Glogowski: How Military Service and Salesforce AI Shaped our Head of Product

Achieve AI Integrity Today

Your Cookie Preferences

Essential Cookies

Provider: .providername.com

Provider: .providername.com

Analytics and Customization Cookies

Performance and Functionality Cookies

Advertising Cookies

Provider: .providername.com

Provider: .providername.com

Should you care?

You probably serialize

Pickle serialization is amazing

No free lunch

Not use Pickle? That train left.

Can Pickle be fixed?

What can we do?

Appendix: Disassembling Pickles

Follow us on LinkedIn

Subscribe to our newsletter

Related articles

Extracting Training Data from Chatbots

Leveraging Hardened Cybersecurity Frameworks for AI Security through the Common Weakness Enumeration (CWE)

AI Governance Policy Roundup (August 2024)

Related articles

Ready to learn more?

Should you care?

You probably serialize

Pickle serialization is amazing

No free lunch

Not use Pickle? That train left.

Can Pickle be fixed?

What can we do?

Appendix: Disassembling Pickles

Subscribe to our newsletter

Related articles

Robust Intelligence Awarded IDIQ Contract to Eliminate AI Risk for the US Air Force

Introducing ML:Integrity

Daniel Glogowski: How Military Service and Salesforce AI Shaped our Head of Product

Achieve AI Integrity Today