Python pickle protocol choice?

PythonPython 2.7NumpyPickle

Python Problem Overview


I an using python 2.7 and trying to pickle an object. I am wondering what the real difference is between the pickle protocols.

import numpy as np
import pickle

class Data(object):
  def __init__(self):
    self.a = np.zeros((100, 37000, 3), dtype=np.float32)

d = Data()
print("data size: ", d.a.nbytes / 1000000.0)
print("highest protocol: ", pickle.HIGHEST_PROTOCOL)
pickle.dump(d, open("noProt", "w"))
pickle.dump(d, open("prot0", "w"), protocol=0)
pickle.dump(d, open("prot1", "w"), protocol=1)
pickle.dump(d, open("prot2", "w"), protocol=2)


out >> data size:  44.4
out >> highest protocol:  2

then I found that the saved files have different sizes on disk:

  • noProt: 177.6MB

  • prot0: 177.6MB

  • prot1: 44.4MB

  • prot2: 44.4MB

I know that prot0 is a human readable text file, so I don't want to use it. I guess protocol 0 is the one given by default.

I wonder what's the difference between protocols 1 and 2, is there a reason why I should chose one or another?

What's is the better to use, pickle or cPickle?

Python Solutions


Solution 1 - Python

Use the latest protocol that supports the lowest Python version you want to support reading the data. Newer protocol versions support new language features and include optimisations.

From the pickle module data format documentation:

> There are currently 6 different protocols which can be used for pickling. The higher the protocol used, the more recent the version of Python needed to read the pickle produced. > > * Protocol version 0 is the original “human-readable” protocol and is backwards compatible with earlier versions of Python. > * Protocol version 1 is an old binary format which is also compatible with earlier versions of Python. > * Protocol version 2 was introduced in Python 2.3. It provides much more efficient pickling of new-style classes. Refer to PEP 307 for information about improvements brought by protocol 2. > * Protocol version 3 was added in Python 3.0. It has explicit support for bytes objects and cannot be unpickled by Python 2.x. This was the default protocol in Python 3.0–3.7. > * Protocol version 4 was added in Python 3.4. It adds support for very large objects, pickling more kinds of objects, and some data format optimizations. It is the default protocol starting with Python 3.8. Refer to PEP 3154 for information about improvements brought by protocol 4. > * Protocol version 5 was added in Python 3.8. It adds support for out-of-band data and speedup for in-band data. Refer to PEP 574 for information about improvements brought by protocol 5.

and from the [pickle.Pickler(...) class section](

> The optional protocol argument, an integer, tells the pickler to use the given protocol; supported protocols are 0 to HIGHEST_PROTOCOL. If not specified, the default is DEFAULT_PROTOCOL. If a negative number is specified, HIGHEST_PROTOCOL is selected.

So when you want to support loading the pickled data with Python 3.4 or newer, pick protocol 4. If you need to support Python 2.7 still, pick protocol 2, especially if you are using custom classes derived from object (new-style classes) (which any modern code does, these days).

However, if you are exchanging pickled data with other Python versions or otherwise need to maintain backwards compatibility with older Python versions, it's easiest to just stick with the highest protocol version you can lay your hands on:

with open("prot2", 'wb') as pfile:
    pickle.dump(d, pfile, protocol=pickle.HIGHEST_PROTOCOL)

pickle.HIGHEST_PROTOCOL will always be the right version for the current Python version. Because this is a binary format, make sure to use 'wb' as the file mode!

Python 3 no longer distinguishes between cPickle and pickle, always use pickle when using Python 3. It uses a compiled C extension under the hood.

If you are still using Python 2, then cPickle and pickle are mostly compatible, the differences lie in the API offered. For most use-cases, just stick with cPickle; it is faster. Quoting the documentation again:

> First, cPickle can be up to 1000 times faster than pickle because the former is implemented in C. Second, in the cPickle module the callables Pickler() and Unpickler() are functions, not classes. This means that you cannot use them to derive custom pickling and unpickling subclasses. Most applications have no need for this functionality and should benefit from the greatly improved performance of the cPickle module.

Solution 2 - Python

For people using Python 3, there are, as of Python 3.5, five possible protocols to choose from:

There are currently 5 different protocols which can be used for pickling. The higher the protocol used, the more recent the version of Python needed to read the pickle produced [[doc][1]]:

> - Protocol version 0 is the original “human-readable” protocol and is backwards compatible with earlier versions of Python.

> - Protocol version 1 is an old binary format which is also compatible with earlier versions of Python. > - Protocol version 2 was introduced in Python 2.3. It provides much more efficient pickling of new-style classes. Refer to PEP 307 for > information about improvements brought by protocol 2. > - Protocol version 3 was added in Python 3.0. It has explicit support for bytes objects and cannot be unpickled by Python 2.x. This > is the default protocol, and the recommended protocol when > compatibility with other Python 3 versions is required. > - Protocol version 4 was added in Python 3.4. It adds support for very large objects, pickling more kinds of objects, and some data > format optimizations. Refer to PEP 3154 for information about > improvements brought by protocol 4. > - Protocol version 5 was added in Python 3.8. It adds support for out-of-band data and speedup for in-band data. Refer to PEP 574 for information about improvements brought by protocol 5.

A general rule is that you should use the highest possible protocol that is backward compatible with what you want to use it for. So if you want it to be backward compatible with Python 2, then protocol version 2 is a good choice, if you want it to be backward compatible with all Python versions then version 1 is good. If you do not care about backward compatibility then using pickle.HIGHEST_PROTOCOL automatically gives you the highest protocol for your Python version.

Also in Python 3, importing pickle automatically imports the C implementation.

Another point to note in terms of compatibility is that, by default protocols 3 and 4 use unicode encoding of strings whereas earlier protocols do not. So in Python 3, if you load a pickled file which was pickled in Python 2, you will probably have to explicitly specify the encoding in order to load it properly.

[1]: https://docs.python.org/3/library/pickle.html#data-stream-format "doc"

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionCobryView Question on Stackoverflow
Solution 1 - PythonMartijn PietersView Answer on Stackoverflow
Solution 2 - Pythonpatapouf_aiView Answer on Stackoverflow