Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Exploiting machine learning Pickle files (trailofbits.com)
63 points by ingve on March 18, 2021 | hide | past | favorite | 21 comments


Something you won't gather from skim-reading the headline is that this is that the author has also created a tool, Fickling: https://github.com/trailofbits/fickling - to aid in playing around with pickle files.

From the article: [Fickling] can help you reverse engineer, test, and even create malicious pickle files.


Was surprised to learn that it's used in ML models. I was under the impression that it's pretty slow[1]. Maybe it's used here because it's Python aware, and doesn't have trouble saving complex data structures?

[1] https://www.benfrederickson.com/dont-pickle-your-data/


As someone with a toe in the deep learning research space, were you to look at commonly used ML code, then you'd find software engineering problems that are far bigger than just pickling. I think it underscores the distinction between computer science and software engineering; that is, the theoreticians of the former and those who actually deploy them in the latter.

Researchers, especially sleep-deprived grad students, have borderline unreadable code for papers since they don't care about deployment. I'd imagine the enterprise engineers who create development pipelines, however, take such risks into consideration.


Having deployed such code in the past, I have some very unhappy memories of list comprehensions with a bunch of single letter variables.

Nothing like tracking a bug to 1 line and finding out the line does 12 different things.


Yeah, I'm also surprised by this. The standard library itself explicitly warns you in a big red box that pickle is not secure. It also doesn't support NumPy, though NumPy has its own native persistence modules. I'd have expected people to be using something like hdf5 for this.


Soumith and I had discussions on hdf5 in the beginning of the AI renaissance (around 2014?). Torch maintainers definitely aware of better format at that time (original Torch in Lua has high-level File APIs for serialization).

It becomes muddy when we moved from Caffe / TensorFlow to dynamic models with PyTorch where it is harder to see how to persist model (which means both the executable objects and the weights) efficiently and safely.

At the end of the day, I think "export" and "checkpointing" should be two different things. An "exported" model should be safe to deploy and run on platforms like Azure ML while a "checkpointing" model should be treated like code and everything goes. That is probably where ONNX should be (for exporting).


the soopar geniuses at facebook clearly know better! /s

(in truth, because pytorch serialised models can include python code for jit scripts, it's nonobvious what a good way to store python code is -- but torch has recently moved to a zipfile impl as of 1.6: https://pytorch.org/docs/stable/generated/torch.save.html)

>I'd have expected people to be using something like hdf5 for this.

Amusingly, matlab was way ahead of the pack here; matfiles have been hdf5 since r2006b, back when we just called it "matlab 7.3".


yep! in past projects i've actually used the .mat file format for both C and python bits because their wrapping of hdf5 is sufficiently well done.

solved a lot of problems! :)


pickle.dump/load is only slow if your main objects has references to many small nested objects: e.g. a large python dicts with million of key values that are small Python str or int objects for instance.

If your main object has only references to a few large sub-objects (e.g. a bunch of multi-MB or GB numpy arrays to store the numerical parameters of a machine learning moodel), then it can be very fast, basically IO-bottlenecked by writing or reading the bytes to/from the disk.


Interesting. The link I posted is a lot (100k) of small 4-field objects, but they aren't nested.


Nesting is actually not such a problem in itself, it just hides the fact that your seemingly simple object on the surface might have a reference to a large collection of small subojects that will be slow to pickle.


I’m surprised pickled models are used for sharing with 3rd parties.

But internally in projects I see it used all the time. It’s easy and it works and you trust internal code.


Internal code has a habit of becoming external code.


For must use-cases the performance cost of loading the model are pretty low compared to either their training cost or making thousands of inference calls (when used in an API). Mayebe they matter if you do AWS Lambda with ML Models, but mostly pickle performance is absolutely fine.

BUT the security problems still remain and weigh much higher


This was exploited by one team during an ML challenge (ai-han-solo) at Defcon CTF Finals 2019. Dropped a meterpreter every time you loaded that team's tensorflow model.


What's even worse is that ML frameworks (also newer ones) don't have or support built in authenticity/integrity checking when loading model and model architecture. Developers have to build their own solutions, like checking a hash or signature themselves - very few do.

This threat model of an ML system is quite interesting also, it highlights the various security challenges a typical ML system faces: https://embracethered.com/blog/posts/2020/husky-ai-threat-mo...


Our team is literally running into this right now. XGBoost claims their non-pickle implementation of model file (which is just json) is "experimental"


been there, done that (json parameter storage). it's slow as molasses and precision loss from storing floats as text can cause issues.


XGBoost also has a binary file format which is neither pickle nor json. It's shared across their Python, Java and R frontends and parsed by the C++ library.


Except the library claims that will be moved away from. To the JSON system


why doesn't python have first class support for serde? it's such a basic and core function.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: