How to import data from mongodb to pandas?

PythonMongodbPandasPymongo

Python Problem Overview


I have a large amount of data in a collection in mongodb which I need to analyze. How do i import that data to pandas?

I am new to pandas and numpy.

EDIT: The mongodb collection contains sensor values tagged with date and time. The sensor values are of float datatype.

Sample Data:

{
"_cls" : "SensorReport",
"_id" : ObjectId("515a963b78f6a035d9fa531b"),
"_types" : [
	"SensorReport"
],
"Readings" : [
	{
		"a" : 0.958069536790466,
		"_types" : [
			"Reading"
		],
		"ReadingUpdatedDate" : ISODate("2013-04-02T08:26:35.297Z"),
		"b" : 6.296118156595,
		"_cls" : "Reading"
	},
	{
		"a" : 0.95574014778624,
		"_types" : [
			"Reading"
		],
		"ReadingUpdatedDate" : ISODate("2013-04-02T08:27:09.963Z"),
		"b" : 6.29651468650064,
		"_cls" : "Reading"
	},
	{
		"a" : 0.953648289182713,
		"_types" : [
			"Reading"
		],
		"ReadingUpdatedDate" : ISODate("2013-04-02T08:27:37.545Z"),
		"b" : 7.29679823731148,
		"_cls" : "Reading"
	},
	{
		"a" : 0.955931884300997,
		"_types" : [
			"Reading"
		],
		"ReadingUpdatedDate" : ISODate("2013-04-02T08:28:21.369Z"),
		"b" : 6.29642922525632,
		"_cls" : "Reading"
	},
	{
		"a" : 0.95821381,
		"_types" : [
			"Reading"
		],
		"ReadingUpdatedDate" : ISODate("2013-04-02T08:41:20.801Z"),
		"b" : 7.28956613,
		"_cls" : "Reading"
	},
	{
		"a" : 4.95821335,
		"_types" : [
			"Reading"
		],
		"ReadingUpdatedDate" : ISODate("2013-04-02T08:41:36.931Z"),
		"b" : 6.28956574,
		"_cls" : "Reading"
	},
	{
		"a" : 9.95821341,
		"_types" : [
			"Reading"
		],
		"ReadingUpdatedDate" : ISODate("2013-04-02T08:42:09.971Z"),
		"b" : 0.28956488,
		"_cls" : "Reading"
	},
	{
		"a" : 1.95667927,
		"_types" : [
			"Reading"
		],
		"ReadingUpdatedDate" : ISODate("2013-04-02T08:43:55.463Z"),
		"b" : 0.29115237,
		"_cls" : "Reading"
	}
],
"latestReportTime" : ISODate("2013-04-02T08:43:55.463Z"),
"sensorName" : "56847890-0",
"reportCount" : 8
}

Python Solutions


Solution 1 - Python

pymongo might give you a hand, followings are some codes I'm using:

import pandas as pd
from pymongo import MongoClient


def _connect_mongo(host, port, username, password, db):
    """ A util for making a connection to mongo """

    if username and password:
        mongo_uri = 'mongodb://%s:%s@%s:%s/%s' % (username, password, host, port, db)
        conn = MongoClient(mongo_uri)
    else:
        conn = MongoClient(host, port)


    return conn[db]


def read_mongo(db, collection, query={}, host='localhost', port=27017, username=None, password=None, no_id=True):
    """ Read from Mongo and Store into DataFrame """

    # Connect to MongoDB
    db = _connect_mongo(host=host, port=port, username=username, password=password, db=db)

    # Make a query to the specific DB and Collection
    cursor = db[collection].find(query)

    # Expand the cursor and construct the DataFrame
    df =  pd.DataFrame(list(cursor))

    # Delete the _id
    if no_id:
        del df['_id']

    return df

Solution 2 - Python

You can load your mongodb data to pandas DataFrame using this code. It works for me. Hopefully for you too.

import pymongo
import pandas as pd
from pymongo import MongoClient
client = MongoClient()
db = client.database_name
collection = db.collection_name
data = pd.DataFrame(list(collection.find()))

Solution 3 - Python

Monary does exactly that, and it's super fast. (another link)

See this cool post which includes a quick tutorial and some timings.

Solution 4 - Python

As per PEP, simple is better than complicated:

import pandas as pd
df = pd.DataFrame.from_records(db.<database_name>.<collection_name>.find())

You can include conditions as you would working with regular mongoDB database or even use find_one() to get only one element from the database, etc.

and voila!

Solution 5 - Python

import pandas as pd
from odo import odo

data = odo('mongodb://localhost/db::collection', pd.DataFrame)

Solution 6 - Python

Another option I found very useful is:

from pandas.io.json import json_normalize

cursor = my_collection.find()
df = json_normalize(cursor)

(or json_normalize(list(cursor)), depending on your python/pandas versions).

This way you get the unfolding of nested mongodb documents for free.

Solution 7 - Python

For dealing with out-of-core (not fitting into RAM) data efficiently (i.e. with parallel execution), you can try Python Blaze ecosystem: Blaze / Dask / Odo.

Blaze (and Odo) has out-of-the-box functions to deal with MongoDB.

A few useful articles to start off:

And an article which shows what amazing things are possible with Blaze stack: Analyzing 1.7 Billion Reddit Comments with Blaze and Impala (essentially, querying 975 Gb of Reddit comments in seconds).

P.S. I'm not affiliated with any of these technologies.

Solution 8 - Python

Using

pandas.DataFrame(list(...))

will consume a lot of memory if the iterator/generator result is large

better to generate small chunks and concat at the end

def iterator2dataframes(iterator, chunk_size: int):
  """Turn an iterator into multiple small pandas.DataFrame

  This is a balance between memory and efficiency
  """
  records = []
  frames = []
  for i, record in enumerate(iterator):
    records.append(record)
    if i % chunk_size == chunk_size - 1:
      frames.append(pd.DataFrame(records))
      records = []
  if records:
    frames.append(pd.DataFrame(records))
  return pd.concat(frames)

Solution 9 - Python

http://docs.mongodb.org/manual/reference/mongoexport

export to csv and use read_csv or JSON and use DataFrame.from_records()

Solution 10 - Python

You can also use pymongoarrow -- it's an official library offered by MongoDB for exporting mongodb data to pandas, numPy, parquet files, etc.

Solution 11 - Python

Following this great answer by waitingkuo I would like to add the possibility of doing that using chunksize in line with .read_sql() and .read_csv(). I enlarge the answer from Deu Leung by avoiding go one by one each 'record' of the 'iterator' / 'cursor'. I will borrow previous read_mongo function.

def read_mongo(db, 
           collection, query={}, 
           host='localhost', port=27017, 
           username=None, password=None,
           chunksize = 100, no_id=True):
""" Read from Mongo and Store into DataFrame """


# Connect to MongoDB
#db = _connect_mongo(host=host, port=port, username=username, password=password, db=db)
client = MongoClient(host=host, port=port)
# Make a query to the specific DB and Collection
db_aux = client[db]


# Some variables to create the chunks
skips_variable = range(0, db_aux[collection].find(query).count(), int(chunksize))
if len(skips_variable)<=1:
    skips_variable = [0,len(skips_variable)]

# Iteration to create the dataframe in chunks.
for i in range(1,len(skips_variable)):

    # Expand the cursor and construct the DataFrame
    #df_aux =pd.DataFrame(list(cursor_aux[skips_variable[i-1]:skips_variable[i]]))
    df_aux =pd.DataFrame(list(db_aux[collection].find(query)[skips_variable[i-1]:skips_variable[i]]))

    if no_id:
        del df_aux['_id']

    # Concatenate the chunks into a unique df
    if 'df' not in locals():
        df =  df_aux
    else:
        df = pd.concat([df, df_aux], ignore_index=True)
        
return df

Solution 12 - Python

A similar approach like Rafael Valero, waitingkuo and Deu Leung using pagination:

def read_mongo(
       # db, 
       collection, query=None, 
       # host='localhost', port=27017, username=None, password=None,
       chunksize = 100, page_num=1, no_id=True):

    # Connect to MongoDB
    db = _connect_mongo(host=host, port=port, username=username, password=password, db=db)

    # Calculate number of documents to skip
    skips = chunksize * (page_num - 1)

    # Sorry, this is in spanish
    # https://www.toptal.com/python/c%C3%B3digo-buggy-python-los-10-errores-m%C3%A1s-comunes-que-cometen-los-desarrolladores-python/es
    if not query:
        query = {}

    # Make a query to the specific DB and Collection
    cursor = db[collection].find(query).skip(skips).limit(chunksize)

    # Expand the cursor and construct the DataFrame
    df =  pd.DataFrame(list(cursor))

    # Delete the _id
    if no_id:
        del df['_id']

    return df

Solution 13 - Python

You can achieve what you want with pdmongo in three lines:

import pdmongo as pdm
import pandas as pd
df = pdm.read_mongo("MyCollection", [], "mongodb://localhost:27017/mydb")

If your data is very large, you can do an aggregate query first by filtering data you do not want, then map them to your desired columns.

Here is an example of mapping Readings.a to column a and filtering by reportCount column:

import pdmongo as pdm
import pandas as pd
df = pdm.read_mongo("MyCollection", [{'$match': {'reportCount': {'$gt': 6}}}, {'$unwind': '$Readings'}, {'$project': {'a': '$Readings.a'}}], "mongodb://localhost:27017/mydb")

read_mongo accepts the same arguments as pymongo aggregate

Solution 14 - Python

You can use the "pandas.json_normalize" method:

import pandas as pd
display(pd.json_normalize( x ))
display(pd.json_normalize( x , record_path="Readings" ))

It should display two tables, where x is your cursor or:

from bson import ObjectId
def ISODate(st):
    return st

x = {
"_cls" : "SensorReport",
"_id" : ObjectId("515a963b78f6a035d9fa531b"),
"_types" : [
    "SensorReport"
],
"Readings" : [
    {
        "a" : 0.958069536790466,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:26:35.297Z"),
        "b" : 6.296118156595,
        "_cls" : "Reading"
    },
    {
        "a" : 0.95574014778624,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:27:09.963Z"),
        "b" : 6.29651468650064,
        "_cls" : "Reading"
    },
    {
        "a" : 0.953648289182713,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:27:37.545Z"),
        "b" : 7.29679823731148,
        "_cls" : "Reading"
    },
    {
        "a" : 0.955931884300997,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:28:21.369Z"),
        "b" : 6.29642922525632,
        "_cls" : "Reading"
    },
    {
        "a" : 0.95821381,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:41:20.801Z"),
        "b" : 7.28956613,
        "_cls" : "Reading"
    },
    {
        "a" : 4.95821335,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:41:36.931Z"),
        "b" : 6.28956574,
        "_cls" : "Reading"
    },
    {
        "a" : 9.95821341,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:42:09.971Z"),
        "b" : 0.28956488,
        "_cls" : "Reading"
    },
    {
        "a" : 1.95667927,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:43:55.463Z"),
        "b" : 0.29115237,
        "_cls" : "Reading"
    }
],
"latestReportTime" : ISODate("2013-04-02T08:43:55.463Z"),
"sensorName" : "56847890-0",
"reportCount" : 8
}

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionNithinView Question on Stackoverflow
Solution 1 - PythonwaitingkuoView Answer on Stackoverflow
Solution 2 - Pythonsaimadhu.polamuriView Answer on Stackoverflow
Solution 3 - Pythonshx2View Answer on Stackoverflow
Solution 4 - PythonCy BuView Answer on Stackoverflow
Solution 5 - PythonfengwtView Answer on Stackoverflow
Solution 6 - PythonIkar PohorskýView Answer on Stackoverflow
Solution 7 - PythonDennis GolomazovView Answer on Stackoverflow
Solution 8 - PythonDeo LeungView Answer on Stackoverflow
Solution 9 - PythonJeffView Answer on Stackoverflow
Solution 10 - PythonRachelle Palmer FrickView Answer on Stackoverflow
Solution 11 - PythonRafael ValeroView Answer on Stackoverflow
Solution 12 - PythonJordy CuanView Answer on Stackoverflow
Solution 13 - PythonpakallisView Answer on Stackoverflow
Solution 14 - PythonAhmad ZuhairView Answer on Stackoverflow