Insert a Pandas Dataframe into mongodb using PyMongo

PythonMongodbPython 2.7PandasPymongo

Python Problem Overview


What is the quickest way to insert a pandas DataFrame into mongodb using PyMongo?

Attempts

db.myCollection.insert(df.to_dict())

gave an error

> InvalidDocument: documents must have only string keys, the key was > Timestamp('2013-11-23 13:31:00', tz=None) >


 db.myCollection.insert(df.to_json())

gave an error

> TypeError: 'str' object does not support item assignment >


 db.myCollection.insert({id: df.to_json()})

gave an error

InvalidDocument: documents must have only string a keys, key was <built-in function id>


df

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 150 entries, 2013-11-23 13:31:26 to 2013-11-23 13:24:07
Data columns (total 3 columns):
amount    150  non-null values
price     150  non-null values
tid       150  non-null values
dtypes: float64(2), int64(1)

Python Solutions


Solution 1 - Python

Here you have the very quickest way. Using the insert_many method from pymongo 3 and 'records' parameter of to_dict method.

db.collection.insert_many(df.to_dict('records'))

Solution 2 - Python

I doubt there is a both quickest and simple method. If you don't worry about data conversion, you can do

>>> import json
>>> df = pd.DataFrame.from_dict({'A': {1: datetime.datetime.now()}})
>>> df
                           A
1 2013-11-23 21:14:34.118531

>>> records = json.loads(df.T.to_json()).values()
>>> db.myCollection.insert(records)

But in case you try to load data back, you'll get:

>>> df = read_mongo(db, 'myCollection')
>>> df
                     A
0  1385241274118531000
>>> df.dtypes
A    int64
dtype: object
    

so you'll have to convert 'A' columnt back to datetimes, as well as all not int, float or str fields in your DataFrame. For this example:

>>> df['A'] = pd.to_datetime(df['A'])
>>> df
                           A
0 2013-11-23 21:14:34.118531

Solution 3 - Python

odo can do it using

odo(df, db.myCollection)

Solution 4 - Python

If your dataframe has missing data (i.e None,nan) and you don't want null key values in your documents:

db.insert_many(df.to_dict("records")) will insert keys with null values. If you don't want the empty key values in your documents you can use a modified version of pandas .to_dict("records") code below:

from pandas.core.common import _maybe_box_datetimelike
my_list = [dict((k, _maybe_box_datetimelike(v)) for k, v in zip(df.columns, row) if v != None and v == v) for row in df.values]
db.insert_many(my_list)

where the if v != None and v == v I've added checks to make sure the value is not None or nan before putting it in the row's dictionary. Now your .insert_many will only include keys with values in the documents (and no null data types).

Solution 5 - Python

I think there is cool ideas in this question. In my case I have been spending time more taking care of the movement of large dataframes. In those case pandas tends to allow you the option of chunksize (for examples in the pandas.DataFrame.to_sql). So I think I con contribute here by adding the function I am using in this direction.

def write_df_to_mongoDB(  my_df,\
                          database_name = 'mydatabasename' ,\
                          collection_name = 'mycollectionname',
                          server = 'localhost',\
                          mongodb_port = 27017,\
                          chunk_size = 100):
    #"""
    #This function take a list and create a collection in MongoDB (you should
    #provide the database name, collection, port to connect to the remoete database,
    #server of the remote database, local port to tunnel to the other machine)
    #
    #---------------------------------------------------------------------------
    #Parameters / Input
    #    my_list: the list to send to MongoDB
    #    database_name:  database name
    #
    #    collection_name: collection name (to create)
    #    server: the server of where the MongoDB database is hosted
    #        Example: server = 'XXX.XXX.XX.XX'
    #    this_machine_port: local machine port.
    #        For example: this_machine_port = '27017'
    #    remote_port: the port where the database is operating
    #        For example: remote_port = '27017'
    #    chunk_size: The number of items of the list that will be send at the
    #        some time to the database. Default is 100.
    #
    #Output
    #    When finished will print "Done"
    #----------------------------------------------------------------------------
    #FUTURE modifications.
    #1. Write to SQL
    #2. Write to csv
    #----------------------------------------------------------------------------
    #30/11/2017: Rafael Valero-Fernandez. Documentation
    #"""



    #To connect
    # import os
    # import pandas as pd
    # import pymongo
    # from pymongo import MongoClient

    client = MongoClient('localhost',int(mongodb_port))
    db = client[database_name]
    collection = db[collection_name]
    # To write
    collection.delete_many({})  # Destroy the collection
    #aux_df=aux_df.drop_duplicates(subset=None, keep='last') # To avoid repetitions
    my_list = my_df.to_dict('records')
    l =  len(my_list)
    ran = range(l)
    steps=ran[chunk_size::chunk_size]
    steps.extend([l])

    # Inser chunks of the dataframe
    i = 0
    for j in steps:
        print j
        collection.insert_many(my_list[i:j]) # fill de collection
        i = j

    print('Done')
    return

Solution 6 - Python

how about this:

db.myCollection.insert({id: df.to_json()})

id will be a unique string for that df

Solution 7 - Python

I use the following part to insert the dataframe to a collection in the database.

df.reset_index(inplace=True)
data_dict = df.to_dict("records")
myCollection.insert_many(data_dict)

Solution 8 - Python

Just make string keys!

import json
dfData = json.dumps(df.to_dict('records'))
savaData = {'_id': 'a8e42ed79f9dae1cefe8781760231ec0', 'df': dfData}
res = client.insert_one(savaData)

##### load dfData
data = client.find_one({'_id': 'a8e42ed79f9dae1cefe8781760231ec0'}).get('df')
dfData = json.loads(data)
df = pd.DataFrame.from_dict(dfData)

Solution 9 - Python

If you want to send several at one time:

db.myCollection.insert_many(df.apply(lambda x: x.to_dict(), axis=1).to_list())

Solution 10 - Python

If you want to make sure that you're not raising InvalidDocument errors, then something like the following is a good idea. This is because mongo does not recognize types such as np.int64, np.float64, etc.

from pymongo import MongoClient
client = MongoClient()
db = client.test 
col = db.col


def createDocsFromDF(df, collection = None, insertToDB=False):
    docs = [] 
    fields = [col for col in df.columns]
    for i in range(len(df)):
        doc = {col:df[col][i] for col in df.columns if col != 'index'}
        for key, val in doc.items():
            # we have to do this, because mongo does not recognize these np. types
            if type(val) == np.int64:
                doc[key] = int(val)
            if type(val) == np.float64:
                doc[key] = float(val)
            if type(val) == np.bool_:
                doc[key] = bool(val)
        docs.append(doc) 
    if insertToDB and collection:
        db.collection.insert_many(docs)
    return docs 

Solution 11 - Python

For upserts this worked.

for r in df2.to_dict(orient="records"):
    db['utest-pd'].update_one({'a':r['a']},{'$set':r})

Does it one record at a time but it didn't seem upsert_many was able to work with more than one filter value for different records.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionNyxynyxView Question on Stackoverflow
Solution 1 - PythondieguicoView Answer on Stackoverflow
Solution 2 - PythonalkoView Answer on Stackoverflow
Solution 3 - PythonFemto TraderView Answer on Stackoverflow
Solution 4 - PythonRadical EdwardView Answer on Stackoverflow
Solution 5 - PythonRafael ValeroView Answer on Stackoverflow
Solution 6 - PythonPasteBTView Answer on Stackoverflow
Solution 7 - PythonOshidi View Answer on Stackoverflow
Solution 8 - PythonJon PanView Answer on Stackoverflow
Solution 9 - PythonRaphael MazzineView Answer on Stackoverflow
Solution 10 - PythonbpbirchView Answer on Stackoverflow
Solution 11 - PythoncitynormanView Answer on Stackoverflow