How to create a dictionary of two pandas DataFrame columns

PythonDictionaryPandasDataframe

Python Problem Overview


What is the most efficient way to organise the following pandas Dataframe:

data =

Position	Letter
1	        a
2	        b
3	        c
4	        d
5	        e

into a dictionary like alphabet[1 : 'a', 2 : 'b', 3 : 'c', 4 : 'd', 5 : 'e']?

Python Solutions


Solution 1 - Python

In [9]: pd.Series(df.Letter.values,index=df.Position).to_dict()
Out[9]: {1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e'}

Speed comparion (using Wouter's method)

In [6]: df = pd.DataFrame(randint(0,10,10000).reshape(5000,2),columns=list('AB'))

In [7]: %timeit dict(zip(df.A,df.B))
1000 loops, best of 3: 1.27 ms per loop

In [8]: %timeit pd.Series(df.A.values,index=df.B).to_dict()
1000 loops, best of 3: 987 us per loop

Solution 2 - Python

I found a faster way to solve the problem, at least on realistically large datasets using: df.set_index(KEY).to_dict()[VALUE]

Proof on 50,000 rows:

df = pd.DataFrame(np.random.randint(32, 120, 100000).reshape(50000,2),columns=list('AB'))
df['A'] = df['A'].apply(chr)

%timeit dict(zip(df.A,df.B))
%timeit pd.Series(df.A.values,index=df.B).to_dict()
%timeit df.set_index('A').to_dict()['B']

Output:

100 loops, best of 3: 7.04 ms per loop  # WouterOvermeire
100 loops, best of 3: 9.83 ms per loop  # Jeff
100 loops, best of 3: 4.28 ms per loop  # Kikohs (me)

Solution 3 - Python

In Python 3.6 the fastest way is still the WouterOvermeire one. Kikohs' proposal is slower than the other two options.

import timeit

setup = '''
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(32, 120, 100000).reshape(50000,2),columns=list('AB'))
df['A'] = df['A'].apply(chr)
'''

timeit.Timer('dict(zip(df.A,df.B))', setup=setup).repeat(7,500)
timeit.Timer('pd.Series(df.A.values,index=df.B).to_dict()', setup=setup).repeat(7,500)
timeit.Timer('df.set_index("A").to_dict()["B"]', setup=setup).repeat(7,500)

Results:

1.1214002349999777 s  # WouterOvermeire
1.1922008498571748 s  # Jeff
1.7034366211428602 s  # Kikohs

Solution 4 - Python

dict (zip(data['position'], data['letter']))

this will give you:

{1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e'}

Solution 5 - Python

TL;DR

>>> import pandas as pd
>>> df = pd.DataFrame({'Position':[1,2,3,4,5], 'Letter':['a', 'b', 'c', 'd', 'e']})
>>> dict(sorted(df.values.tolist())) # Sort of sorted... 
{'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}
>>> from collections import OrderedDict
>>> OrderedDict(df.values.tolist())
OrderedDict([('a', 1), ('b', 2), ('c', 3), ('d', 4), ('e', 5)])

In Long

Explaining solution: dict(sorted(df.values.tolist()))

Given:

df = pd.DataFrame({'Position':[1,2,3,4,5], 'Letter':['a', 'b', 'c', 'd', 'e']})

[out]:

 Letter	Position
0	a	1
1	b	2
2	c	3
3	d	4
4	e	5

Try:

# Get the values out to a 2-D numpy array, 
df.values

[out]:

array([['a', 1],
       ['b', 2],
       ['c', 3],
       ['d', 4],
       ['e', 5]], dtype=object)

Then optionally:

# Dump it into a list so that you can sort it using `sorted()`
sorted(df.values.tolist()) # Sort by key

Or:

# Sort by value:
from operator import itemgetter
sorted(df.values.tolist(), key=itemgetter(1))

[out]:

[['a', 1], ['b', 2], ['c', 3], ['d', 4], ['e', 5]]

Lastly, cast the list of list of 2 elements into a dict.

dict(sorted(df.values.tolist())) 

[out]:

{'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}

Answering @sbradbio comment:

If there are multiple values for a specific key and you would like to keep all of them, it's the not the most efficient but the most intuitive way is:

from collections import defaultdict
import pandas as pd

multivalue_dict = defaultdict(list)

df = pd.DataFrame({'Position':[1,2,4,4,4], 'Letter':['a', 'b', 'd', 'e', 'f']})

for idx,row in df.iterrows():
    multivalue_dict[row['Position']].append(row['Letter'])

[out]:

>>> print(multivalue_dict)
defaultdict(list, {1: ['a'], 2: ['b'], 4: ['d', 'e', 'f']})

Solution 6 - Python

Here are two other ways tested with the following df.

df = pd.DataFrame(np.random.randint(0,10,10000).reshape(5000,2),columns=list('AB'))

using to_records()

dict(df.to_records(index=False))

using MultiIndex.from_frame()

dict(pd.MultiIndex.from_frame(df))

Time of each.

24.6 ms ± 847 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
1.86 ms ± 11.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Solution 7 - Python

I like the Wouter method, however the behaviour with duplicate values might not be what is expected and this scenario is not discussed one way or the other by the OP unfortunately. Wouter, will always choose the last value for each key encountered. So in other words, it will keep overwriting the value for each key.

The expected behavior in my mind would be more like https://stackoverflow.com/questions/60085490 where a list is kept for each key.

So for the case of keeping duplicates, let me submit df.groupby('Position')['Letter'].apply(list).to_dict() (Or perhaps even a set instead of a list)

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
Questionuser1083734View Question on Stackoverflow
Solution 1 - PythonJeffView Answer on Stackoverflow
Solution 2 - PythonKirellView Answer on Stackoverflow
Solution 3 - PythonpakobillView Answer on Stackoverflow
Solution 4 - Pythonuser18475123View Answer on Stackoverflow
Solution 5 - PythonalvasView Answer on Stackoverflow
Solution 6 - Pythonrhug123View Answer on Stackoverflow
Solution 7 - PythonLaptopProfileView Answer on Stackoverflow