Executing an SQL query over a pandas dataset

Python Problem Overview

I have a pandas data set, called 'df'.

How can I do something like below;

df.query("select * from df")

Thank you.

For those who know R, there is a library called sqldf where you can execute SQL code in R, my question is basically, is there some library like sqldf in python

Python Solutions

Solution 1 - Python

This is not what pandas.query is supposed to do. You can look at package pandasql (same like sqldf in R )

import pandas as pd
import pandasql as ps

df = pd.DataFrame([[1234, 'Customer A', '123 Street', np.nan],
               [1234, 'Customer A', np.nan, '333 Street'],
               [1233, 'Customer B', '444 Street', '333 Street'],
              [1233, 'Customer B', '444 Street', '666 Street']], columns=
['ID', 'Customer', 'Billing Address', 'Shipping Address'])

q1 = """SELECT ID FROM df """

print(ps.sqldf(q1, locals()))

     ID
0  1234
1  1234
2  1233
3  1233

Update 2020-07-10 > > > update the pandasql

ps.sqldf("select * from df")

Solution 2 - Python

After some time of using this I realised the easiest way is to just do

from pandasql import sqldf

output = sqldf("select * from df")

Works like a charm where df is a pandas dataframe You can install pandasql: https://pypi.org/project/pandasql/

Solution 3 - Python

You can use DataFrame.query(condition) to return a subset of the data frame matching condition like this:

df = pd.DataFrame(np.arange(9).reshape(3,3), columns=list('ABC'))
df
   A  B  C
0  0  1  2
1  3  4  5
2  6  7  8

df.query('C < 6')
   A  B  C
0  0  1  2
1  3  4  5


df.query('2*B <= C')
   A  B  C
0  0  1  2


df.query('A % 2 == 0')
   A  B  C
0  0  1  2
2  6  7  8

This is basically the same effect as an SQL statement, except the SELECT * FROM df WHERE is implied.

Solution 4 - Python

There's actually a new package that I just released, called dataframe_sql. This gives you the ability to query pandas dataframes using SQL just as you want to. You can find the package here

Solution 5 - Python

Much better solution is to use duckdb

pip install duckdb

import pandas as pd
import duckdb
test_df = pd.DataFrame.from_dict({"i":[1, 2, 3, 4], "j":["one", "two", "three", "four"]})

duckdb.query("SELECT * FROM test_df where i>2").df() # returns a result dataframe

Performance improvement over pandasql: test data NYC yellow cabs ~120mb of csv data

nyc = pd.read_csv('https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2021-01.csv',low_memory=False)

from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())

pysqldf("SELECT * FROM nyc where trip_distance>10")
# wall time 16.1s

duckdb.query("SELECT * FROM nyc where trip_distance>10").df()
# wall time 183ms

A improvement of speed of roughly 100x

This article gives good details and claims 1000x improvement over pandasql: https://duckdb.org/2021/05/14/sql-on-pandas.html

Solution 6 - Python

Or, you can use the tools that do what they do best:

Install postgresql
Connect to the database: >from sqlalchemy import create_engine
>import urllib.parse
>engconnect = "{0}://{1}:{2}@{3}:{4}/{5}".format(dialect,user_uenc, pw_uenc, host,port, dbname)
>dbengine = create_engine(engconnect)
>database = dbengine.connect()
Dump the dataframe into postgres

>df.to_sql('mytablename', database, if_exists='replace')

Write your query with all the SQL nesting your brain can handle. >myquery = "select distinct * from mytablename"
Create a dataframe by running the query:

>newdf = pd.read_sql(myquery, database)

Solution 7 - Python

I think a better solution than pandassql would be duckdb. The way it handles the table name mapping to a dataframe object is a little cleaner imo. I have not evaluated performance though.

Solution 8 - Python

There is also FugueSQL

pip install fugue[sql]

import pandas as pd
from fugue_sql import fsql

comics_df = pd.DataFrame({'book': ['Secret Wars 8',
                                   'Tomb of Dracula 10',
                                   'Amazing Spider-Man 252',
                                   'New Mutants 98',
                                   'Eternals 1',
                                   'Amazing Spider-Man 300',
                                   'Department of Truth 1'],
                          'publisher': ['Marvel', 'Marvel', 'Marvel', 'Marvel', 'Marvel', 'Marvel', 'Image'],
                          'grade': [9.6, 5.0, 7.5, 8.0, 9.2, 6.5, 9.8],
                          'value': [400, 2500, 300, 600, 400, 750, 175]})

# which of my books are graded above 8.0?
query = """
SELECT book, publisher, grade, value FROM comics_df
WHERE grade > 8.0
PRINT
"""

fsql(query).run()

Output

PandasDataFrame
book:str                                                      |publisher:str|grade:double|value:long
--------------------------------------------------------------+-------------+------------+----------
Secret Wars 8                                                 |Marvel       |9.6         |400       
Eternals 1                                                    |Marvel       |9.2         |400       
Department of Truth 1                                         |Image        |9.8         |175       
Total count: 3

References

^{https://fugue-tutorials.readthedocs.io/tutorials/beginner/beginner_sql.html}

^{https://www.kdnuggets.com/2021/10/query-pandas-dataframes-sql.html}

Solution 9 - Python

Another solution is RBQL which provides SQL-like query language that allows using Python expression inside SELECT and WHERE statements. It also provides a convenient %rbql magic command to use in Jupyter/IPyhon:

# Get some test data:
!pip install vega_datasets
from vega_datasets import data
my_cars_df = data.cars()
# Install and use RBQL:
!pip install rbql
%load_ext rbql
%rbql SELECT * FROM my_cars_df WHERE a.Horsepower > 100 ORDER BY a.Weight_in_lbs DESC

In this example my_cars_df is a Pandas Dataframe.

You can try it in this demo Google Colab notebook.

Content Type	Original Author	Original Content on Stackoverflow
Question	Miguel Santos	View Question on Stackoverflow
Solution 1 - Python	BENY	View Answer on Stackoverflow
Solution 2 - Python	Miguel Santos	View Answer on Stackoverflow
Solution 3 - Python	user1717828	View Answer on Stackoverflow
Solution 4 - Python	Zach Brookler	View Answer on Stackoverflow
Solution 5 - Python	Leo Liu	View Answer on Stackoverflow
Solution 6 - Python	alphacrash	View Answer on Stackoverflow
Solution 7 - Python	user3418197	View Answer on Stackoverflow
Solution 8 - Python	SchemeSonic	View Answer on Stackoverflow
Solution 9 - Python	mechatroner	View Answer on Stackoverflow