How to create a large pandas dataframe from an sql query without running out of memory?

PythonSqlPandasBigdata

Python Problem Overview


I have trouble querying a table of > 5 million records from MS SQL Server database. I want to select all of the records, but my code seems to fail when selecting to much data into memory.

This works:

import pandas.io.sql as psql
sql = "SELECT TOP 1000000 * FROM MyTable" 
data = psql.read_frame(sql, cnxn)

...but this does not work:

sql = "SELECT TOP 2000000 * FROM MyTable" 
data = psql.read_frame(sql, cnxn)

It returns this error:

File "inference.pyx", line 931, in pandas.lib.to_object_array_tuples
(pandas\lib.c:42733) Memory Error

I have read here that a similar problem exists when creating a dataframe from a csv file, and that the work-around is to use the 'iterator' and 'chunksize' parameters like this:

read_csv('exp4326.csv', iterator=True, chunksize=1000)

Is there a similar solution for querying from an SQL database? If not, what is the preferred work-around? Should I use some other methods to read the records in chunks? I read a bit of discussion here about working with large datasets in pandas, but it seems like a lot of work to execute a SELECT * query. Surely there is a simpler approach.

Python Solutions


Solution 1 - Python

As mentioned in a comment, starting from pandas 0.15, you have a chunksize option in read_sql to read and process the query chunk by chunk:

sql = "SELECT * FROM My_Table"
for chunk in pd.read_sql_query(sql , engine, chunksize=5):
    print(chunk)

Reference: http://pandas.pydata.org/pandas-docs/version/0.15.2/io.html#querying

Solution 2 - Python

Update: Make sure to check out the answer below, as Pandas now has built-in support for chunked loading.

You could simply try to read the input table chunk-wise and assemble your full dataframe from the individual pieces afterwards, like this:

import pandas as pd
import pandas.io.sql as psql
chunk_size = 10000
offset = 0
dfs = []
while True:
  sql = "SELECT * FROM MyTable limit %d offset %d order by ID" % (chunk_size,offset) 
  dfs.append(psql.read_frame(sql, cnxn))
  offset += chunk_size
  if len(dfs[-1]) < chunk_size:
    break
full_df = pd.concat(dfs)

It might also be possible that the whole dataframe is simply too large to fit in memory, in that case you will have no other option than to restrict the number of rows or columns you're selecting.

Solution 3 - Python

Code solution and remarks.

# Create empty list
dfl = []  

# Create empty dataframe
dfs = pd.DataFrame()  

# Start Chunking
for chunk in pd.read_sql(query, con=conct, ,chunksize=10000000):

    # Start Appending Data Chunks from SQL Result set into List
    dfl.append(chunk)

# Start appending data from list to dataframe
dfs = pd.concat(dfl, ignore_index=True)

However, my memory analysis tells me that even though the memory is released after each chunk is extracted, the list is growing bigger and bigger and occupying that memory resulting in a net net no gain on free RAM.

Would love to hear what the author / others have to say.

Solution 4 - Python

The best way I found to handle this is to leverage the SQLAlchemy steam_results connection options

conn = engine.connect().execution_options(stream_results=True)

And passing the conn object to pandas in

pd.read_sql("SELECT *...", conn, chunksize=10000)

This will ensure that the cursor is handled server-side rather than client-side

Solution 5 - Python

You can use Server Side Cursors (a.k.a. stream results)

import pandas as pd
from sqlalchemy import create_engine

def process_sql_using_pandas():
    engine = create_engine(
        "postgresql://postgres:pass@localhost/example"
    )
    conn = engine.connect().execution_options(
        stream_results=True)

    for chunk_dataframe in pd.read_sql(
            "SELECT * FROM users", conn, chunksize=1000):
        print(f"Got dataframe w/{len(chunk_dataframe)} rows")
        # ... do something with dataframe ...

if __name__ == '__main__':
    process_sql_using_pandas()

As mentioned in the comments by others, using the chunksize argument in pd.read_sql("SELECT * FROM users", engine, chunksize=1000) does not solve the problem as it still loads the whole data in the memory and then gives it to you chunk by chunk.

More explanation here

Solution 6 - Python

If you want to limit the number of rows in output, just use:

data = psql.read_frame(sql, cnxn,chunksize=1000000).__next__()

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionslizbView Question on Stackoverflow
Solution 1 - PythonKamil SindiView Answer on Stackoverflow
Solution 2 - PythonThePhysicistView Answer on Stackoverflow
Solution 3 - Pythonflying_fluid_fourView Answer on Stackoverflow
Solution 4 - PythonJulien KervizicView Answer on Stackoverflow
Solution 5 - PythonEhsan FathiView Answer on Stackoverflow
Solution 6 - PythonDharsanBView Answer on Stackoverflow