Pickle or json?

PythonJsonPickle

Python Problem Overview


I need to save to disk a little dict object whose keys are of the type str and values are ints and then recover it. Something like this:

{'juanjo': 2, 'pedro':99, 'other': 333}

What is the best option and why? Serialize it with pickle or with simplejson?

I am using Python 2.6.

Python Solutions


Solution 1 - Python

I prefer JSON over pickle for my serialization. Unpickling can run arbitrary code, and using pickle to transfer data between programs or store data between sessions is a security hole. JSON does not introduce a security hole and is standardized, so the data can be accessed by programs in different languages if you ever need to.

Solution 2 - Python

If you do not have any interoperability requirements (e.g. you are just going to use the data with Python) and a binary format is fine, go with cPickle which gives you really fast Python object serialization.

If you want interoperability or you want a text format to store your data, go with JSON (or some other appropriate format depending on your constraints).

Solution 3 - Python

You might also find this interesting, with some charts to compare: http://kovshenin.com/archives/pickle-vs-json-which-is-faster/

Solution 4 - Python

If you are primarily concerned with speed and space, use cPickle because cPickle is faster than JSON.

If you are more concerned with interoperability, security, and/or human readability, then use JSON.


The tests results referenced in other answers were recorded in 2010, and the updated tests in 2016 with cPickle [protocol 2][2] show:

  • cPickle 3.8x faster loading
  • cPickle 1.5x faster reading
  • cPickle slightly smaller encoding

Reproduce this yourself with [this gist][1], which is based on the [Konstantin's benchmark][4] referenced in other answers, but using cPickle with protocol 2 instead of pickle, and using json instead of simplejson (since json is faster than simplejson), e.g.

wget https://gist.github.com/jdimatteo/af317ef24ccf1b3fa91f4399902bb534/raw/03e8dbab11b5605bc572bc117c8ac34cfa959a70/pickle_vs_json.py
python pickle_vs_json.py

Results with python 2.7 on a decent 2015 Xeon processor:

Dir	Entries	Method	Time	Length

dump	10	JSON	0.017	1484510
load	10	JSON	0.375	-
dump	10	Pickle	0.011	1428790
load	10	Pickle	0.098	-
dump	20	JSON	0.036	2969020
load	20	JSON	1.498	-
dump	20	Pickle	0.022	2857580
load	20	Pickle	0.394	-
dump	50	JSON	0.079	7422550
load	50	JSON	9.485	-
dump	50	Pickle	0.055	7143950
load	50	Pickle	2.518	-
dump	100	JSON	0.165	14845100
load	100	JSON	37.730	-
dump	100	Pickle	0.107	14287900
load	100	Pickle	9.907	-

[Python 3.4 with pickle protocol 3 is even faster.][3]

[1]: https://gist.github.com/jdimatteo/af317ef24ccf1b3fa91f4399902bb534 "this gist" [2]: https://docs.python.org/2/library/pickle.html#data-stream-format "protocol 2" [3]: https://stackoverflow.com/a/26860404/1007353 [4]: https://konstantin.blog/2010/pickle-vs-json-which-is-faster/

Solution 5 - Python

JSON or pickle? How about JSON and pickle!

You can use jsonpickle. It easy to use and the file on disk is readable because it's JSON.

See jsonpickle Documentation

Solution 6 - Python

I have tried several methods and found out that using cPickle with setting the protocol argument of the dumps method as: cPickle.dumps(obj, protocol=cPickle.HIGHEST_PROTOCOL) is the fastest dump method.

import msgpack
import json
import pickle
import timeit
import cPickle
import numpy as np

num_tests = 10

obj = np.random.normal(0.5, 1, [240, 320, 3])

command = 'pickle.dumps(obj)'
setup = 'from __main__ import pickle, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("pickle:  %f seconds" % result)

command = 'cPickle.dumps(obj)'
setup = 'from __main__ import cPickle, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("cPickle:   %f seconds" % result)


command = 'cPickle.dumps(obj, protocol=cPickle.HIGHEST_PROTOCOL)'
setup = 'from __main__ import cPickle, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("cPickle highest:   %f seconds" % result)

command = 'json.dumps(obj.tolist())'
setup = 'from __main__ import json, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("json:   %f seconds" % result)


command = 'msgpack.packb(obj.tolist())'
setup = 'from __main__ import msgpack, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("msgpack:   %f seconds" % result)

Output:

pickle         :   0.847938 seconds
cPickle        :   0.810384 seconds
cPickle highest:   0.004283 seconds
json           :   1.769215 seconds
msgpack        :   0.270886 seconds

Solution 7 - Python

Personally, I generally prefer JSON because the data is human-readable. Definitely, if you need to serialize something that JSON won't take, than use pickle.

But for most data storage, you won't need to serialize anything weird and JSON is much easier and always allows you to pop it open in a text editor and check out the data yourself.

The speed is nice, but for most datasets the difference is negligible; Python generally isn't too fast anyways.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionJuanjo ContiView Question on Stackoverflow
Solution 1 - PythonMike GrahamView Answer on Stackoverflow
Solution 2 - PythonHåvard SView Answer on Stackoverflow
Solution 3 - PythonkovsheninView Answer on Stackoverflow
Solution 4 - PythonJDiMatteoView Answer on Stackoverflow
Solution 5 - PythonPaul HildebrandtView Answer on Stackoverflow
Solution 6 - PythonAhmed AbobakrView Answer on Stackoverflow
Solution 7 - PythonrickcnagyView Answer on Stackoverflow