Hashing a dictionary?

PythonHashDictionary

Python Problem Overview


For caching purposes I need to generate a cache key from GET arguments which are present in a dict.

Currently I'm using sha1(repr(sorted(my_dict.items()))) (sha1() is a convenience method that uses hashlib internally) but I'm curious if there's a better way.

Python Solutions


Solution 1 - Python

Using sorted(d.items()) isn't enough to get us a stable repr. Some of the values in d could be dictionaries too, and their keys will still come out in an arbitrary order. As long as all the keys are strings, I prefer to use:

json.dumps(d, sort_keys=True)

That said, if the hashes need to be stable across different machines or Python versions, I'm not certain that this is bulletproof. You might want to add the separators and ensure_ascii arguments to protect yourself from any changes to the defaults there. I'd appreciate comments.

Solution 2 - Python

If your dictionary is not nested, you could make a frozenset with the dict's items and use hash():

hash(frozenset(my_dict.items()))

This is much less computationally intensive than generating the JSON string or representation of the dictionary.

UPDATE: Please see the comments below, why this approach might not produce a stable result.

Solution 3 - Python

EDIT: If all your keys are strings, then before continuing to read this answer, please see Jack O'Connor's significantly simpler (and faster) solution (which also works for hashing nested dictionaries).

Although an answer has been accepted, the title of the question is "Hashing a python dictionary", and the answer is incomplete as regards that title. (As regards the body of the question, the answer is complete.)

Nested Dictionaries

If one searches Stack Overflow for how to hash a dictionary, one might stumble upon this aptly titled question, and leave unsatisfied if one is attempting to hash multiply nested dictionaries. The answer above won't work in this case, and you'll have to implement some sort of recursive mechanism to retrieve the hash.

Here is one such mechanism:

import copy

def make_hash(o):

  """
  Makes a hash from a dictionary, list, tuple or set to any level, that contains
  only other hashable types (including any lists, tuples, sets, and
  dictionaries).
  """
  
  if isinstance(o, (set, tuple, list)):
    
    return tuple([make_hash(e) for e in o])    

  elif not isinstance(o, dict):

    return hash(o)

  new_o = copy.deepcopy(o)
  for k, v in new_o.items():
    new_o[k] = make_hash(v)

  return hash(tuple(frozenset(sorted(new_o.items()))))

Bonus: Hashing Objects and Classes

The hash() function works great when you hash classes or instances. However, here is one issue I found with hash, as regards objects:

class Foo(object): pass
foo = Foo()
print (hash(foo)) # 1209812346789
foo.a = 1
print (hash(foo)) # 1209812346789

The hash is the same, even after I've altered foo. This is because the identity of foo hasn't changed, so the hash is the same. If you want foo to hash differently depending on its current definition, the solution is to hash off whatever is actually changing. In this case, the __dict__ attribute:

class Foo(object): pass
foo = Foo()
print (make_hash(foo.__dict__)) # 1209812346789
foo.a = 1
print (make_hash(foo.__dict__)) # -78956430974785

Alas, when you attempt to do the same thing with the class itself:

print (make_hash(Foo.__dict__)) # TypeError: unhashable type: 'dict_proxy'

The class __dict__ property is not a normal dictionary:

print (type(Foo.__dict__)) # type <'dict_proxy'>

Here is a similar mechanism as previous that will handle classes appropriately:

import copy

DictProxyType = type(object.__dict__)

def make_hash(o):

  """
  Makes a hash from a dictionary, list, tuple or set to any level, that 
  contains only other hashable types (including any lists, tuples, sets, and
  dictionaries). In the case where other kinds of objects (like classes) need 
  to be hashed, pass in a collection of object attributes that are pertinent. 
  For example, a class can be hashed in this fashion:
  
    make_hash([cls.__dict__, cls.__name__])
    
  A function can be hashed like so:
  
    make_hash([fn.__dict__, fn.__code__])
  """
  
  if type(o) == DictProxyType:
    o2 = {}
    for k, v in o.items():
      if not k.startswith("__"):
        o2[k] = v
    o = o2  
      
  if isinstance(o, (set, tuple, list)):
        
    return tuple([make_hash(e) for e in o])    
    
  elif not isinstance(o, dict):
    
    return hash(o)

  new_o = copy.deepcopy(o)
  for k, v in new_o.items():
    new_o[k] = make_hash(v)
  
  return hash(tuple(frozenset(sorted(new_o.items()))))

You can use this to return a hash tuple of however many elements you'd like:

# -7666086133114527897
print (make_hash(func.__code__))

# (-7666086133114527897, 3527539)
print (make_hash([func.__code__, func.__dict__]))

# (-7666086133114527897, 3527539, -509551383349783210)
print (make_hash([func.__code__, func.__dict__, func.__name__]))

NOTE: all of the above code assumes Python 3.x. Did not test in earlier versions, although I assume make_hash() will work in, say, 2.7.2. As far as making the examples work, I do know that

func.__code__ 

should be replaced with

func.func_code

Solution 4 - Python

Here is a clearer solution.

def freeze(o):
  if isinstance(o,dict):
  	return frozenset({ k:freeze(v) for k,v in o.items()}.items())
   
  if isinstance(o,list):
  	return tuple([freeze(v) for v in o])
  	
  return o
   

def make_hash(o):
	"""
	makes a hash out of anything that contains only list,dict and hashable types including string and numeric types
	"""
	return hash(freeze(o))  

Solution 5 - Python

The code below avoids using the Python hash() function because it will not provide hashes that are consistent across restarts of Python (see https://stackoverflow.com/questions/27522626/hash-function-in-python-3-3-returns-different-results-between-sessions). make_hashable() will convert the object into nested tuples and make_hash_sha256() will also convert the repr() to a base64 encoded SHA256 hash.

import hashlib
import base64

def make_hash_sha256(o):
    hasher = hashlib.sha256()
    hasher.update(repr(make_hashable(o)).encode())
    return base64.b64encode(hasher.digest()).decode()

def make_hashable(o):
    if isinstance(o, (tuple, list)):
        return tuple((make_hashable(e) for e in o))

    if isinstance(o, dict):
        return tuple(sorted((k,make_hashable(v)) for k,v in o.items()))

    if isinstance(o, (set, frozenset)):
        return tuple(sorted(make_hashable(e) for e in o))

    return o

o = dict(x=1,b=2,c=[3,4,5],d={6,7})
print(make_hashable(o))
# (('b', 2), ('c', (3, 4, 5)), ('d', (6, 7)), ('x', 1))

print(make_hash_sha256(o))
# fyt/gK6D24H9Ugexw+g3lbqnKZ0JAcgtNW+rXIDeU2Y=

Solution 6 - Python

MD5 HASH

The method which resulted in the most stable results for me was using md5 hashes and json.stringify

from typing import Dict, Any
import hashlib
import json

def dict_hash(dictionary: Dict[str, Any]) -> str:
    """MD5 hash of a dictionary."""
    dhash = hashlib.md5()
    # We need to sort arguments so {'a': 1, 'b': 2} is
    # the same as {'b': 2, 'a': 1}
    encoded = json.dumps(dictionary, sort_keys=True).encode()
    dhash.update(encoded)
    return dhash.hexdigest()

Solution 7 - Python

While hash(frozenset(x.items()) and hash(tuple(sorted(x.items())) work, that's doing a lot of work allocating and copying all the key-value pairs. A hash function really should avoid a lot of memory allocation.

A little bit of math can help here. The problem with most hash functions is that they assume that order matters. To hash an unordered structure, you need a commutative operation. Multiplication doesn't work well as any element hashing to 0 means the whole product is 0. Bitwise & and | tend towards all 0's or 1's. There are two good candidates: addition and xor.

from functools import reduce
from operator import xor

class hashable(dict):
    def __hash__(self):
        return reduce(xor, map(hash, self.items()), 0)

    # Alternative
    def __hash__(self):
        return sum(map(hash, self.items()))

One point: xor works, in part, because dict guarantees keys are unique. And sum works because Python will bitwise truncate the results.

If you want to hash a multiset, sum is preferable. With xor, {a} would hash to the same value as {a, a, a} because x ^ x ^ x = x.

If you really need the guarantees that SHA makes, this won't work for you. But to use a dictionary in a set, this will work fine; Python containers are resiliant to some collisions, and the underlying hash functions are pretty good.

Solution 8 - Python

Updated from 2013 reply...

None of the above answers seem reliable to me. The reason is the use of items(). As far as I know, this comes out in a machine-dependent order.

How about this instead?

import hashlib

def dict_hash(the_dict, *ignore):
    if ignore:  # Sometimes you don't care about some items
        interesting = the_dict.copy()
        for item in ignore:
            if item in interesting:
                interesting.pop(item)
        the_dict = interesting
    result = hashlib.sha1(
        '%s' % sorted(the_dict.items())
    ).hexdigest()
    return result

Solution 9 - Python

To preserve key order, instead of hash(str(dictionary)) or hash(json.dumps(dictionary)) I would prefer quick-and-dirty solution:

from pprint import pformat
h = hash(pformat(dictionary))

It will work even for types like DateTime and more that are not JSON serializable.

Solution 10 - Python

You could use the third-party frozendict module to freeze your dict and make it hashable.

from frozendict import frozendict
my_dict = frozendict(my_dict)

For handling nested objects, you could go with:

import collections.abc

def make_hashable(x):
    if isinstance(x, collections.abc.Hashable):
        return x
    elif isinstance(x, collections.abc.Sequence):
        return tuple(make_hashable(xi) for xi in x)
    elif isinstance(x, collections.abc.Set):
        return frozenset(make_hashable(xi) for xi in x)
    elif isinstance(x, collections.abc.Mapping):
        return frozendict({k: make_hashable(v) for k, v in x.items()})
    else:
        raise TypeError("Don't know how to make {} objects hashable".format(type(x).__name__))

If you want to support more types, use functools.singledispatch (Python 3.7):

@functools.singledispatch
def make_hashable(x):
    raise TypeError("Don't know how to make {} objects hashable".format(type(x).__name__))

@make_hashable.register
def _(x: collections.abc.Hashable):
    return x

@make_hashable.register
def _(x: collections.abc.Sequence):
    return tuple(make_hashable(xi) for xi in x)

@make_hashable.register
def _(x: collections.abc.Set):
    return frozenset(make_hashable(xi) for xi in x)

@make_hashable.register
def _(x: collections.abc.Mapping):
    return frozendict({k: make_hashable(v) for k, v in x.items()})

# add your own types here

Solution 11 - Python

You can use the maps library to do this. Specifically, maps.FrozenMap

import maps
fm = maps.FrozenMap(my_dict)
hash(fm)

To install maps, just do:

pip install maps

It handles the nested dict case too:

import maps
fm = maps.FrozenMap.recurse(my_dict)
hash(fm)

Disclaimer: I am the author of the maps library.

Solution 12 - Python

Use DeepHash from DeepDiff Module

from deepdiff import DeepHash
obj = {'a':'1',b:'2'}
hashes = DeepHash(obj)[obj]

Solution 13 - Python

Answers from this thread with the highest upvotes didn't work for me as their hashing functions give different results on different machines due to PYTHOPYTHONHASHSEED.

I adjusted all the hints from this thread and came up with a solution that works for me.

import collections
import hashlib
import json


def simplify_object(o):
    if isinstance(o, dict):
        ordered_dict = collections.OrderedDict(sorted(o.items()))
        for k, v in ordered_dict.items():
            v = simplify_object(v)
            ordered_dict[str(k)] = v
        o = ordered_dict
    elif isinstance(o, (list, tuple, set)):
        o = [simplify_object(el) for el in o]
    else:
        o = str(o).strip()
    return o


def make_hash(o):
    o = simplify_object(o)
    bytes_val = json.dumps(o, sort_keys=True, ensure_ascii=True, default=str)
    hash_val = hashlib.sha1(bytes_val.encode()).hexdigest()
    return hash_val

Solution 14 - Python

One way to approach the problem is to make a tuple of the dictionary's items:

hash(tuple(my_dict.items()))

Solution 15 - Python

I do it like this:

hash(str(my_dict))

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionThiefMasterView Question on Stackoverflow
Solution 1 - PythonJack O'ConnorView Answer on Stackoverflow
Solution 2 - PythonImranView Answer on Stackoverflow
Solution 3 - PythonjomidoView Answer on Stackoverflow
Solution 4 - Pythonsmartnut007View Answer on Stackoverflow
Solution 5 - PythonClaudio FaheyView Answer on Stackoverflow
Solution 6 - PythonShay DahanView Answer on Stackoverflow
Solution 7 - PythonBenView Answer on Stackoverflow
Solution 8 - PythonSteve YeagoView Answer on Stackoverflow
Solution 9 - Pythonshirk3yView Answer on Stackoverflow
Solution 10 - PythonEricView Answer on Stackoverflow
Solution 11 - PythonPedro CattoriView Answer on Stackoverflow
Solution 12 - PythonJithinMView Answer on Stackoverflow
Solution 13 - PythonTom WojcikView Answer on Stackoverflow
Solution 14 - PythonAnonymousView Answer on Stackoverflow
Solution 15 - PythongarbanzioView Answer on Stackoverflow