Python pickle protocol choice?

Python Problem Overview

I an using python 2.7 and trying to pickle an object. I am wondering what the real difference is between the pickle protocols.

import numpy as np
import pickle

class Data(object):
  def __init__(self):
    self.a = np.zeros((100, 37000, 3), dtype=np.float32)

d = Data()
print("data size: ", d.a.nbytes / 1000000.0)
print("highest protocol: ", pickle.HIGHEST_PROTOCOL)
pickle.dump(d, open("noProt", "w"))
pickle.dump(d, open("prot0", "w"), protocol=0)
pickle.dump(d, open("prot1", "w"), protocol=1)
pickle.dump(d, open("prot2", "w"), protocol=2)


out >> data size:  44.4
out >> highest protocol:  2

then I found that the saved files have different sizes on disk:

noProt: 177.6MB
prot0: 177.6MB
prot1: 44.4MB
prot2: 44.4MB

I know that prot0 is a human readable text file, so I don't want to use it. I guess protocol 0 is the one given by default.

I wonder what's the difference between protocols 1 and 2, is there a reason why I should chose one or another?

What's is the better to use, pickle or cPickle?

Python Solutions

Solution 1 - Python

Use the latest protocol that supports the lowest Python version you want to support reading the data. Newer protocol versions support new language features and include optimisations.

From the pickle module data format documentation:

> There are currently 6 different protocols which can be used for pickling. The higher the protocol used, the more recent the version of Python needed to read the pickle produced. > > * Protocol version 0 is the original “human-readable” protocol and is backwards compatible with earlier versions of Python. > * Protocol version 1 is an old binary format which is also compatible with earlier versions of Python. > * Protocol version 2 was introduced in Python 2.3. It provides much more efficient pickling of new-style classes. Refer to PEP 307 for information about improvements brought by protocol 2. > * Protocol version 3 was added in Python 3.0. It has explicit support for bytes objects and cannot be unpickled by Python 2.x. This was the default protocol in Python 3.0–3.7. > * Protocol version 4 was added in Python 3.4. It adds support for very large objects, pickling more kinds of objects, and some data format optimizations. It is the default protocol starting with Python 3.8. Refer to PEP 3154 for information about improvements brought by protocol 4. > * Protocol version 5 was added in Python 3.8. It adds support for out-of-band data and speedup for in-band data. Refer to PEP 574 for information about improvements brought by protocol 5.

and from the [pickle.Pickler(...) class section](

> The optional protocol argument, an integer, tells the pickler to use the given protocol; supported protocols are 0 to HIGHEST_PROTOCOL. If not specified, the default is DEFAULT_PROTOCOL. If a negative number is specified, HIGHEST_PROTOCOL is selected.

So when you want to support loading the pickled data with Python 3.4 or newer, pick protocol 4. If you need to support Python 2.7 still, pick protocol 2, especially if you are using custom classes derived from object (new-style classes) (which any modern code does, these days).

However, if you are exchanging pickled data with other Python versions or otherwise need to maintain backwards compatibility with older Python versions, it's easiest to just stick with the highest protocol version you can lay your hands on:

with open("prot2", 'wb') as pfile:
    pickle.dump(d, pfile, protocol=pickle.HIGHEST_PROTOCOL)

pickle.HIGHEST_PROTOCOL will always be the right version for the current Python version. Because this is a binary format, make sure to use 'wb' as the file mode!

Python 3 no longer distinguishes between cPickle and pickle, always use pickle when using Python 3. It uses a compiled C extension under the hood.

If you are still using Python 2, then cPickle and pickle are mostly compatible, the differences lie in the API offered. For most use-cases, just stick with cPickle; it is faster. Quoting the documentation again:

> First, cPickle can be up to 1000 times faster than pickle because the former is implemented in C. Second, in the cPickle module the callables Pickler() and Unpickler() are functions, not classes. This means that you cannot use them to derive custom pickling and unpickling subclasses. Most applications have no need for this functionality and should benefit from the greatly improved performance of the cPickle module.

Solution 2 - Python

For people using Python 3, there are, as of Python 3.5, five possible protocols to choose from:

There are currently 5 different protocols which can be used for pickling. The higher the protocol used, the more recent the version of Python needed to read the pickle produced [[doc][1]]:

> - Protocol version 0 is the original “human-readable” protocol and is backwards compatible with earlier versions of Python.

> - Protocol version 1 is an old binary format which is also compatible with earlier versions of Python. > - Protocol version 2 was introduced in Python 2.3. It provides much more efficient pickling of new-style classes. Refer to PEP 307 for > information about improvements brought by protocol 2. > - Protocol version 3 was added in Python 3.0. It has explicit support for bytes objects and cannot be unpickled by Python 2.x. This > is the default protocol, and the recommended protocol when > compatibility with other Python 3 versions is required. > - Protocol version 4 was added in Python 3.4. It adds support for very large objects, pickling more kinds of objects, and some data > format optimizations. Refer to PEP 3154 for information about > improvements brought by protocol 4. > - Protocol version 5 was added in Python 3.8. It adds support for out-of-band data and speedup for in-band data. Refer to PEP 574 for information about improvements brought by protocol 5.

A general rule is that you should use the highest possible protocol that is backward compatible with what you want to use it for. So if you want it to be backward compatible with Python 2, then protocol version 2 is a good choice, if you want it to be backward compatible with all Python versions then version 1 is good. If you do not care about backward compatibility then using pickle.HIGHEST_PROTOCOL automatically gives you the highest protocol for your Python version.

Also in Python 3, importing pickle automatically imports the C implementation.

Another point to note in terms of compatibility is that, by default protocols 3 and 4 use unicode encoding of strings whereas earlier protocols do not. So in Python 3, if you load a pickled file which was pickled in Python 2, you will probably have to explicitly specify the encoding in order to load it properly.

[1]: https://docs.python.org/3/library/pickle.html#data-stream-format "doc"

Content Type	Original Author	Original Content on Stackoverflow
Question	Cobry	View Question on Stackoverflow
Solution 1 - Python	Martijn Pieters	View Answer on Stackoverflow
Solution 2 - Python	patapouf_ai	View Answer on Stackoverflow

Python pickle protocol choice?

Python Problem Overview

Python Solutions

Solution 1 - Python

Solution 2 - Python

I need a workaround for Resharper when it says 'Failed to modify Documents'. Does anybody know why it does this and how to get around it?

How to write CSV output to stdout?

Attributions