python pandas to_sql with sqlalchemy : how to speed up exporting to MS SQL?

PythonSqlPandasSqlalchemyPyodbc

Python Problem Overview


I have a dataframe with ca 155,000 rows and 12 columns. If I export it to csv with dataframe.to_csv , the output is an 11MB file (which is produced instantly).

If, however, I export to a Microsoft SQL Server with the to_sql method, it takes between 5 and 6 minutes! No columns are text: only int, float, bool and dates. I have seen cases where ODBC drivers set nvarchar(max) and this slows down the data transfer, but it cannot be the case here.

Any suggestions on how to speed up the export process? Taking 6 minutes to export 11 MBs of data makes the ODBC connection practically unusable.

Thanks!

My code is:

import pandas as pd
from sqlalchemy import create_engine, MetaData, Table, select
ServerName = "myserver"
Database = "mydatabase"
TableName = "mytable"

engine = create_engine('mssql+pyodbc://' + ServerName + '/' + Database)
conn = engine.connect()

metadata = MetaData(conn)

my_data_frame.to_sql(TableName,engine)

Python Solutions


Solution 1 - Python

I recently had the same problem and feel like to add an answer to this for others. to_sql seems to send an INSERT query for every row which makes it really slow. But since 0.24.0 there is a method parameter in pandas.to_sql() where you can define your own insertion function or just use method='multi' to tell pandas to pass multiple rows in a single INSERT query, which makes it a lot faster.

Note that your Database may has a parameter limit. In that case you also have to define a chunksize.

So the solution should simply look like to this:

my_data_frame.to_sql(TableName, engine, chunksize=<yourParameterLimit>, method='multi')

If you do not know your database parameter limit, just try it without the chunksize parameter. It will run or give you an error telling you your limit.

Solution 2 - Python

The DataFrame.to_sql method generates insert statements to your ODBC connector which then is treated by the ODBC connector as regular inserts.

When this is slow, it is not the fault of pandas.

Saving the output of the DataFrame.to_sql method to a file, then replaying that file over an ODBC connector will take the same amount of time.

The proper way of bulk importing data into a database is to generate a csv file and then use a load command, which in the MS flavour of SQL databases is called BULK INSERT

For example:

BULK INSERT mydatabase.myschema.mytable
FROM 'mydatadump.csv';

The syntax reference is as follows:

BULK INSERT 
   [ database_name . [ schema_name ] . | schema_name . ] [ table_name | view_name ] 
      FROM 'data_file' 
     [ WITH 
    ( 
   [ [ , ] BATCHSIZE = batch_size ] 
   [ [ , ] CHECK_CONSTRAINTS ] 
   [ [ , ] CODEPAGE = { 'ACP' | 'OEM' | 'RAW' | 'code_page' } ] 
   [ [ , ] DATAFILETYPE = 
      { 'char' | 'native'| 'widechar' | 'widenative' } ] 
   [ [ , ] FIELDTERMINATOR = 'field_terminator' ] 
   [ [ , ] FIRSTROW = first_row ] 
   [ [ , ] FIRE_TRIGGERS ] 
   [ [ , ] FORMATFILE = 'format_file_path' ] 
   [ [ , ] KEEPIDENTITY ] 
   [ [ , ] KEEPNULLS ] 
   [ [ , ] KILOBYTES_PER_BATCH = kilobytes_per_batch ] 
   [ [ , ] LASTROW = last_row ] 
   [ [ , ] MAXERRORS = max_errors ] 
   [ [ , ] ORDER ( { column [ ASC | DESC ] } [ ,...n ] ) ] 
   [ [ , ] ROWS_PER_BATCH = rows_per_batch ] 
   [ [ , ] ROWTERMINATOR = 'row_terminator' ] 
   [ [ , ] TABLOCK ] 
   [ [ , ] ERRORFILE = 'file_name' ] 
    )] 

Solution 3 - Python

You can use this: what makes it faster is the method parameter of pandas to_sql. I hope this help helps.

The result of this on my experience was from infinite time to 8 secs.


df = pd.read_csv('test.csv')

conn = create_engine(<connection_string>)

start_time = time.time()
df.to_sql('table_name', conn, method='multi',index=False, if_exists='replace')
print("--- %s seconds ---" % (time.time() - start_time))

Solution 4 - Python

You can use d6tstack which has fast pandas to SQL functionality because it uses native DB import commands. It supports MS SQL, Postgres and MYSQL

uri_psql = 'postgresql+psycopg2://usr:pwd@localhost/db'
d6tstack.utils.pd_to_psql(df, uri_psql, 'table')
uri_mssql = 'mssql+pymssql://usr:pwd@localhost/db'
d6tstack.utils.pd_to_mssql(df, uri_mssql, 'table', 'schema') # experimental

Also useful for importing multiple CSV with data schema changes and/or preprocess with pandas before writing to db, see further down in examples notebook

d6tstack.combine_csv.CombinerCSV(glob.glob('*.csv'), 
	apply_after_read=apply_fun).to_psql_combine(uri_psql, 'table')

Solution 5 - Python

Why is pandas.DataFrame.to_sql slow?

When uploading data from pandas to Microsoft SQL Server, most time is actually spent in converting from pandas to Python objects to the representation needed by the MS SQL ODBC driver. One of the reasons pandas is much faster for analytics than basic Python code is that it works on lean native arrays of integers / floats / … that don't have the same overhead as their respective Python counterparts. The to_sql method is actually converting all these lean columns to many individual Python objects and thus doesn't get the usual performance treatment as the other pandas operations have.

Use turbodbc.Cursor.insertmanycolumns to speed this up

Given a pandas.DataFrame, you can use turbodbc and pyarrow to insert the data with less conversion overhead than happening with the conversion to Python objects.

import pyarrow as pa
import turbodbc

cursor = …  # cursor to a MS SQL connection initiated with turbodbc
df = …  # the pd.DataFrame to be inserted

# Convert the pandas.DataFrame to a pyarrow.Table, most of the columns
# will be zero-copy and thus this is quite fast.
table = pa.Table.from_pandas(table)

# Insert into the database
cursor.executemanycolumns("INSERT INTO my_table VALUES (?, ?, ?)",
                           table)

Why is this faster?

Instead of the conversion of pd.DataFrame -> collection of Python objects -> ODBC data structures, we are doing a conversion path pd.DataFrame -> pyarrow.Table -> ODBC structure. This is more performant due to:

  • Most of the columns of a pandas.DataFrame can be converted to columns of the pyarrow.Table without copying. The columns of the table will reference the same memory. So no actual conversion is done.
  • The conversion is done fully in native code with native types. This means that at no stage we occur the overhead of Python objects as long as we don't have object typed columns.

Solution 6 - Python

I was running out of time and memory (more than 18GB allocated for a DataFrame loaded from 120MB CSV) with this line:

df.to_sql('my_table', engine, if_exists='replace', method='multi', dtype={"text_field": db.String(64), "text_field2": db.String(128), "intfield1": db.Integer(), "intfield2": db.Integer(), "floatfield": db.Float()})

Here is the code that helped me to import and track progress of insertions at the same time:

import sqlalchemy as db
engine = db.create_engine('mysql://user:password@localhost:3306/database_name', echo=False)
connection = engine.connect()
metadata = db.MetaData()

my_table = db.Table('my_table', metadata,
              db.Column('text_field', db.String(64), index=True),
              db.Column('text_field2', db.String(128), index=True),
              db.Column('intfield1', db.Integer()),
              db.Column('intfield2', db.Integer()),
              db.Column('floatfield', db.Float())
             )
metadata.create_all(engine)
kw_dict = df.reset_index().sort_values(by="intfield2", ascending=False).to_dict(orient="records")

batch_size=10000
for batch_start in range(0, len(kw_dict), batch_size):
    print("Inserting {}-{}".format(batch_start, batch_start + batch_size))
    connection.execute(my_table.insert(), kw_dict[batch_start:batch_start + batch_size])

Solution 7 - Python

My solution to this problem is below if this helps anyone. From what I've read, pandas tosql method loads one record at a time.

You can make a bulk insert statement that loads 1000 lines and commits that transaction instead of committing a single row each time. This increases the speed massively.

import pandas as pd
from sqlalchemy import create_engine
import pymssql
import os

connect_string  = [your connection string]
engine = create_engine(connect_string,echo=False)
connection = engine.raw_connection()
cursor = connection.cursor()

def load_data(report_name):
    # my report_name variable is also my sql server table name so I use that variable to create table name string
    sql_table_name = 'AR_'+str(report_name)
    global chunk # to QC chunks that fail for some reason
    for chunk in pd.read_csv(report_full_path_new,chunksize=1000):
        chunk.replace('\'','\'\'',inplace=True,regex=True) #replace single quotes in data with double single quotes to escape it in mysql
        chunk.fillna('NULL',inplace=True)
        
        my_data = str(chunk.to_records(index=False).tolist()) # convert data to string 
        my_data = my_data[1:-1] # clean up the ends
        my_data = my_data.replace('\"','\'').replace('\'NULL\'','NULL') #convert blanks to NULLS for mysql
        sql_table_name = [your sql server table name]

        sql = """
        INSERT INTO {0} 
        VALUES {1}

         """.format(sql_table_name,my_data)

        cursor.execute(sql)
        # you must call commit() to persist your data if you don't set autocommit to True
        connection.commit()

Solution 8 - Python

With SQLAlchemy>=1.3, while creating engine object, set fast_executemany=True. Reference

Solution 9 - Python

As said in other answers, the reason for the slowdown and/or time out is because pandas is inserting many single rows over and over. The high volume of insert commands is slow and/or may be overloading the target database.

using method='multi' tells pandas to upload in chunks. This is much faster and won't time out as easily.

sqlEngine=create_engine('mysql+mysqlconnector://'+config['user']+':'+config['pass']+'@'+config['host']+'/'+config['dbname'])
dbConnection=sqlEngine.connect()
df.to_sql('table_name',con=dbConnection,method='multi',if_exists='append',index=False)
dbConnection.close()

Solution 10 - Python

Based on this answer - Aseem.

You can use the copy_from method to simulate a bulk load with a cursor object. This was tested on Postgres, try it with your DB:

import pandas as pd
from sqlalchemy import create_engine, MetaData, Table, select
from StringIO import StringIO

ServerName = "myserver"
Database = "mydatabase"
TableName = "mytable"

engine = create_engine('mssql+pyodbc://' + ServerName + '/' + Database) #don't forget to add a password if needed

my_data_frame.head(0).to_sql(TableName, engine, if_exists='replace', index=False)  # create an empty table - just for structure
conn = engine.raw_connection()
cur = conn.cursor()
output = StringIO()
my_data_frame.to_csv(output, sep='\t', header=False, index=False) # a CSV that will be used for the bulk load
output.seek(0)
cur.copy_from(output, TableName, null="")  # null values become ''
conn.commit()
conn.close()
cur.close()

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionPythonista anonymousView Question on Stackoverflow
Solution 1 - PythonNemesisMFView Answer on Stackoverflow
Solution 2 - PythonfirelynxView Answer on Stackoverflow
Solution 3 - Pythonglenn6452View Answer on Stackoverflow
Solution 4 - PythoncitynormanView Answer on Stackoverflow
Solution 5 - PythonUwe L. KornView Answer on Stackoverflow
Solution 6 - PythonanatolyView Answer on Stackoverflow
Solution 7 - PythonJoeView Answer on Stackoverflow
Solution 8 - PythonVishal GuptaView Answer on Stackoverflow
Solution 9 - Pythonembulldogs99View Answer on Stackoverflow
Solution 10 - PythonRoee AnuarView Answer on Stackoverflow