0:00
/
0:00
Transcript

Why Python Pickle can't be trusted for storing AI Models - Live Demo

10 minute live-coding explaining the vulnerability and why you should stick to Safetensors for LLM Checkpoints.

Welcome to Neural Bits. Each week, I write about practical, production-ready AI/ML Engineering. Join over 6200 engineers and build real-world AI Systems.


AI Engineers working with LLMs might overlook the fact that, a while ago, LLM checkpoints on HuggingFace were uploaded as `.bin` or `.pt` files saved directly using `torch.save` or a similar method.

That was a vector of attack, as `bin` and `pt` were pickled serialized files, which means arbitrary code could be executed during unpickling.

Imagine: I train a model, inject a subprocess.run([”rm”, “-r”, “./”]) in a custom class, and serialize that alongside the model weights in the checkpoint.pt

If you were to download that checkpoint and load it in Python, you could imagine what would’ve happened.

Currently, Safetensors have become the standard format for storing all the models on HuggingFace.

But why is that? In this article, you’ll learn why.


Dangers of Pickle

When saving a model using `torch.save`, PyTorch uses the default Python Pickler to serialize it into a `.pt` checkpoint.

Pickle Files: The New ML Model Attack Vector | HiddenLayer
Figure 1: A Binary file serialized using Pickle contains malicious code. Source

These checkpoints didn’t only include the model weights, but also metadata fields, the Optimizer state, the Learning Rate progression, etc. The danger of Pickle is that it can serialize any Python object, and that opens the surface for malicious code to be injected into any custom classes.

The Role of Python Pickle:

→ Torch save uses pickle underneath.
→ Pickle is the standard serializer supported across all PyTorch versions.
→ Can serialize Python Objects
→ 𝘾𝙤𝙣𝙨𝙞𝙙𝙚𝙧𝙚𝙙 𝙪𝙣𝙨𝙖𝙛𝙚, 𝙣𝙤𝙩 𝙩𝙮𝙥𝙚 𝙚𝙣𝙛𝙤𝙧𝙘𝙚𝙙
→ Code can be executed during deserialization.
→ Slower on large models & datasets
→ Can include model architecture and other parameters (Optimizer, LR, Loss Function).

Pickle is not only slow but also considered unsafe.

On the other hand, Safetensors is solving all those caveats.


Why HuggingFace Safetensors

If you’ve been working with LLMs or StableDiffusion and have downloaded weights from HuggingFace, you might have noticed they usually come with the .𝘀𝗮𝗳𝗲𝘁𝗲𝗻𝘀𝗼𝗿𝘀 extension.

Figure 2: The HuggingFace Safetensors format, which is type-safe. Source

What’s the advantage of .𝘴𝘢𝘧𝘦𝘵𝘦𝘯𝘴𝘰𝘳𝘴?
→ A new serialization format proposed by HuggingFace
→ 𝙏𝙮𝙥𝙚 𝙚𝙣𝙛𝙤𝙧𝙘𝙚𝙙, 𝙣𝙤𝙩 𝙛𝙡𝙚𝙭𝙞𝙗𝙡𝙚 (big change)
→ Focus on security, preventing code execution during deserialization.
→ Faster loading times and size on disk.
→ Cannot contain Python code.
→ Serialises model weights as (nBYTES_header, BYTES_header, REST_OF_FILE).
→ 𝙕𝙚𝙧𝙤-𝘾𝙤𝙥𝙮 𝙖𝙘𝙧𝙤𝙨𝙨, excluding the headers.
→ Enables lazy-loading and can inspect the file without loading it entirely.


If you’ve found this tip useful, share the article and restack it, so others can find it.

Share


Appreciate your Feedback

I’m starting with a few shorter videos on basic concepts, planning to scale that in the future as I’ll introduce major changes to the Newsletter.

I’d love to hear your feedback on this video walkthrough.

Loading...
Loading...
Loading...

Thank you!
👋

Discussion about this video

User's avatar