python pandas to_sql with sqlalchemy : how to speed up exporting to MS SQL?
PythonSqlPandasSqlalchemyPyodbcPython Problem Overview
I have a dataframe with ca 155,000 rows and 12 columns. If I export it to csv with dataframe.to_csv , the output is an 11MB file (which is produced instantly).
If, however, I export to a Microsoft SQL Server with the to_sql method, it takes between 5 and 6 minutes! No columns are text: only int, float, bool and dates. I have seen cases where ODBC drivers set nvarchar(max) and this slows down the data transfer, but it cannot be the case here.
Any suggestions on how to speed up the export process? Taking 6 minutes to export 11 MBs of data makes the ODBC connection practically unusable.
Thanks!
My code is:
import pandas as pd
from sqlalchemy import create_engine, MetaData, Table, select
ServerName = "myserver"
Database = "mydatabase"
TableName = "mytable"
engine = create_engine('mssql+pyodbc://' + ServerName + '/' + Database)
conn = engine.connect()
metadata = MetaData(conn)
my_data_frame.to_sql(TableName,engine)
Python Solutions
Solution 1 - Python
I recently had the same problem and feel like to add an answer to this for others.
to_sql
seems to send an INSERT
query for every row which makes it really slow. But since 0.24.0
there is a method
parameter in pandas.to_sql()
where you can define your own insertion function or just use method='multi'
to tell pandas to pass multiple rows in a single INSERT query, which makes it a lot faster.
Note that your Database may has a parameter limit. In that case you also have to define a chunksize.
So the solution should simply look like to this:
my_data_frame.to_sql(TableName, engine, chunksize=<yourParameterLimit>, method='multi')
If you do not know your database parameter limit, just try it without the chunksize parameter. It will run or give you an error telling you your limit.
Solution 2 - Python
The DataFrame.to_sql
method generates insert statements to your ODBC connector which then is treated by the ODBC connector as regular inserts.
When this is slow, it is not the fault of pandas.
Saving the output of the DataFrame.to_sql
method to a file, then replaying that file over an ODBC connector will take the same amount of time.
The proper way of bulk importing data into a database is to generate a csv file and then use a load command, which in the MS flavour of SQL databases is called BULK INSERT
For example:
BULK INSERT mydatabase.myschema.mytable
FROM 'mydatadump.csv';
The syntax reference is as follows:
BULK INSERT
[ database_name . [ schema_name ] . | schema_name . ] [ table_name | view_name ]
FROM 'data_file'
[ WITH
(
[ [ , ] BATCHSIZE = batch_size ]
[ [ , ] CHECK_CONSTRAINTS ]
[ [ , ] CODEPAGE = { 'ACP' | 'OEM' | 'RAW' | 'code_page' } ]
[ [ , ] DATAFILETYPE =
{ 'char' | 'native'| 'widechar' | 'widenative' } ]
[ [ , ] FIELDTERMINATOR = 'field_terminator' ]
[ [ , ] FIRSTROW = first_row ]
[ [ , ] FIRE_TRIGGERS ]
[ [ , ] FORMATFILE = 'format_file_path' ]
[ [ , ] KEEPIDENTITY ]
[ [ , ] KEEPNULLS ]
[ [ , ] KILOBYTES_PER_BATCH = kilobytes_per_batch ]
[ [ , ] LASTROW = last_row ]
[ [ , ] MAXERRORS = max_errors ]
[ [ , ] ORDER ( { column [ ASC | DESC ] } [ ,...n ] ) ]
[ [ , ] ROWS_PER_BATCH = rows_per_batch ]
[ [ , ] ROWTERMINATOR = 'row_terminator' ]
[ [ , ] TABLOCK ]
[ [ , ] ERRORFILE = 'file_name' ]
)]
Solution 3 - Python
You can use this: what makes it faster is the method
parameter of pandas to_sql
. I hope this help helps.
The result of this on my experience was from infinite time to 8 secs.
df = pd.read_csv('test.csv')
conn = create_engine(<connection_string>)
start_time = time.time()
df.to_sql('table_name', conn, method='multi',index=False, if_exists='replace')
print("--- %s seconds ---" % (time.time() - start_time))
Solution 4 - Python
You can use d6tstack which has fast pandas to SQL functionality because it uses native DB import commands. It supports MS SQL, Postgres and MYSQL
uri_psql = 'postgresql+psycopg2://usr:pwd@localhost/db'
d6tstack.utils.pd_to_psql(df, uri_psql, 'table')
uri_mssql = 'mssql+pymssql://usr:pwd@localhost/db'
d6tstack.utils.pd_to_mssql(df, uri_mssql, 'table', 'schema') # experimental
Also useful for importing multiple CSV with data schema changes and/or preprocess with pandas before writing to db, see further down in examples notebook
d6tstack.combine_csv.CombinerCSV(glob.glob('*.csv'),
apply_after_read=apply_fun).to_psql_combine(uri_psql, 'table')
Solution 5 - Python
pandas.DataFrame.to_sql
slow?
Why is When uploading data from pandas
to Microsoft SQL Server, most time is actually spent in converting from pandas
to Python objects to the representation needed by the MS SQL ODBC driver. One of the reasons pandas
is much faster for analytics than basic Python code is that it works on lean native arrays of integers / floats / … that don't have the same overhead as their respective Python counterparts. The to_sql
method is actually converting all these lean columns to many individual Python objects and thus doesn't get the usual performance treatment as the other pandas
operations have.
turbodbc.Cursor.insertmanycolumns
to speed this up
Use Given a pandas.DataFrame
, you can use turbodbc
and pyarrow
to insert the data with less conversion overhead than happening with the conversion to Python objects.
import pyarrow as pa
import turbodbc
cursor = … # cursor to a MS SQL connection initiated with turbodbc
df = … # the pd.DataFrame to be inserted
# Convert the pandas.DataFrame to a pyarrow.Table, most of the columns
# will be zero-copy and thus this is quite fast.
table = pa.Table.from_pandas(table)
# Insert into the database
cursor.executemanycolumns("INSERT INTO my_table VALUES (?, ?, ?)",
table)
Why is this faster?
Instead of the conversion of pd.DataFrame
-> collection of Python objects -> ODBC data structures, we are doing a conversion path pd.DataFrame
-> pyarrow.Table
-> ODBC structure. This is more performant due to:
- Most of the columns of a
pandas.DataFrame
can be converted to columns of thepyarrow.Table
without copying. The columns of the table will reference the same memory. So no actual conversion is done. - The conversion is done fully in native code with native types. This means that at no stage we occur the overhead of Python objects as long as we don't have
object
typed columns.
Solution 6 - Python
I was running out of time and memory (more than 18GB allocated for a DataFrame loaded from 120MB CSV) with this line:
df.to_sql('my_table', engine, if_exists='replace', method='multi', dtype={"text_field": db.String(64), "text_field2": db.String(128), "intfield1": db.Integer(), "intfield2": db.Integer(), "floatfield": db.Float()})
Here is the code that helped me to import and track progress of insertions at the same time:
import sqlalchemy as db
engine = db.create_engine('mysql://user:password@localhost:3306/database_name', echo=False)
connection = engine.connect()
metadata = db.MetaData()
my_table = db.Table('my_table', metadata,
db.Column('text_field', db.String(64), index=True),
db.Column('text_field2', db.String(128), index=True),
db.Column('intfield1', db.Integer()),
db.Column('intfield2', db.Integer()),
db.Column('floatfield', db.Float())
)
metadata.create_all(engine)
kw_dict = df.reset_index().sort_values(by="intfield2", ascending=False).to_dict(orient="records")
batch_size=10000
for batch_start in range(0, len(kw_dict), batch_size):
print("Inserting {}-{}".format(batch_start, batch_start + batch_size))
connection.execute(my_table.insert(), kw_dict[batch_start:batch_start + batch_size])
Solution 7 - Python
My solution to this problem is below if this helps anyone. From what I've read, pandas tosql method loads one record at a time.
You can make a bulk insert statement that loads 1000 lines and commits that transaction instead of committing a single row each time. This increases the speed massively.
import pandas as pd
from sqlalchemy import create_engine
import pymssql
import os
connect_string = [your connection string]
engine = create_engine(connect_string,echo=False)
connection = engine.raw_connection()
cursor = connection.cursor()
def load_data(report_name):
# my report_name variable is also my sql server table name so I use that variable to create table name string
sql_table_name = 'AR_'+str(report_name)
global chunk # to QC chunks that fail for some reason
for chunk in pd.read_csv(report_full_path_new,chunksize=1000):
chunk.replace('\'','\'\'',inplace=True,regex=True) #replace single quotes in data with double single quotes to escape it in mysql
chunk.fillna('NULL',inplace=True)
my_data = str(chunk.to_records(index=False).tolist()) # convert data to string
my_data = my_data[1:-1] # clean up the ends
my_data = my_data.replace('\"','\'').replace('\'NULL\'','NULL') #convert blanks to NULLS for mysql
sql_table_name = [your sql server table name]
sql = """
INSERT INTO {0}
VALUES {1}
""".format(sql_table_name,my_data)
cursor.execute(sql)
# you must call commit() to persist your data if you don't set autocommit to True
connection.commit()
Solution 8 - Python
With SQLAlchemy>=1.3
, while creating engine
object, set fast_executemany=True
. Reference
Solution 9 - Python
As said in other answers, the reason for the slowdown and/or time out is because pandas is inserting many single rows over and over. The high volume of insert commands is slow and/or may be overloading the target database.
using method='multi' tells pandas to upload in chunks. This is much faster and won't time out as easily.
sqlEngine=create_engine('mysql+mysqlconnector://'+config['user']+':'+config['pass']+'@'+config['host']+'/'+config['dbname'])
dbConnection=sqlEngine.connect()
df.to_sql('table_name',con=dbConnection,method='multi',if_exists='append',index=False)
dbConnection.close()
Solution 10 - Python
Based on this answer - Aseem.
You can use the copy_from method to simulate a bulk load with a cursor object. This was tested on Postgres, try it with your DB:
import pandas as pd
from sqlalchemy import create_engine, MetaData, Table, select
from StringIO import StringIO
ServerName = "myserver"
Database = "mydatabase"
TableName = "mytable"
engine = create_engine('mssql+pyodbc://' + ServerName + '/' + Database) #don't forget to add a password if needed
my_data_frame.head(0).to_sql(TableName, engine, if_exists='replace', index=False) # create an empty table - just for structure
conn = engine.raw_connection()
cur = conn.cursor()
output = StringIO()
my_data_frame.to_csv(output, sep='\t', header=False, index=False) # a CSV that will be used for the bulk load
output.seek(0)
cur.copy_from(output, TableName, null="") # null values become ''
conn.commit()
conn.close()
cur.close()