Something you won't gather from skim-reading the headline is that this is that the author has also created a tool, Fickling: https://github.com/trailofbits/fickling - to aid in playing around with pickle files.
From the article: [Fickling] can help you reverse engineer, test, and even create malicious pickle files.
Was surprised to learn that it's used in ML models. I was under the impression that it's pretty slow[1]. Maybe it's used here because it's Python aware, and doesn't have trouble saving complex data structures?
As someone with a toe in the deep learning research space, were you to look at commonly used ML code, then you'd find software engineering problems that are far bigger than just pickling. I think it underscores the distinction between computer science and software engineering; that is, the theoreticians of the former and those who actually deploy them in the latter.
Researchers, especially sleep-deprived grad students, have borderline unreadable code for papers since they don't care about deployment. I'd imagine the enterprise engineers who create development pipelines, however, take such risks into consideration.
Yeah, I'm also surprised by this. The standard library itself explicitly warns you in a big red box that pickle is not secure. It also doesn't support NumPy, though NumPy has its own native persistence modules. I'd have expected people to be using something like hdf5 for this.
Soumith and I had discussions on hdf5 in the beginning of the AI renaissance (around 2014?). Torch maintainers definitely aware of better format at that time (original Torch in Lua has high-level File APIs for serialization).
It becomes muddy when we moved from Caffe / TensorFlow to dynamic models with PyTorch where it is harder to see how to persist model (which means both the executable objects and the weights) efficiently and safely.
At the end of the day, I think "export" and "checkpointing" should be two different things. An "exported" model should be safe to deploy and run on platforms like Azure ML while a "checkpointing" model should be treated like code and everything goes. That is probably where ONNX should be (for exporting).
the soopar geniuses at facebook clearly know better! /s
(in truth, because pytorch serialised models can include python code for jit scripts, it's nonobvious what a good way to store python code is -- but torch has recently moved to a zipfile impl as of 1.6: https://pytorch.org/docs/stable/generated/torch.save.html)
>I'd have expected people to be using something like hdf5 for this.
Amusingly, matlab was way ahead of the pack here; matfiles have been hdf5 since r2006b, back when we just called it "matlab 7.3".
pickle.dump/load is only slow if your main objects has references to many small nested objects: e.g. a large python dicts with million of key values that are small Python str or int objects for instance.
If your main object has only references to a few large sub-objects (e.g. a bunch of multi-MB or GB numpy arrays to store the numerical parameters of a machine learning moodel), then it can be very fast, basically IO-bottlenecked by writing or reading the bytes to/from the disk.
Nesting is actually not such a problem in itself, it just hides the fact that your seemingly simple object on the surface might have a reference to a large collection of small subojects that will be slow to pickle.
For must use-cases the performance cost of loading the model are pretty low compared to either their training cost or making thousands of inference calls (when used in an API). Mayebe they matter if you do AWS Lambda with ML Models, but mostly pickle performance is absolutely fine.
BUT the security problems still remain and weigh much higher
This was exploited by one team during an ML challenge (ai-han-solo) at Defcon CTF Finals 2019. Dropped a meterpreter every time you loaded that team's tensorflow model.
What's even worse is that ML frameworks (also newer ones) don't have or support built in authenticity/integrity checking when loading model and model architecture. Developers have to build their own solutions, like checking a hash or signature themselves - very few do.
XGBoost also has a binary file format which is neither pickle nor json. It's shared across their Python, Java and R frontends and parsed by the C++ library.
From the article: [Fickling] can help you reverse engineer, test, and even create malicious pickle files.