Executing an SQL query over a pandas dataset
PythonSqlitePandasPython Problem Overview
I have a pandas data set, called 'df'.
How can I do something like below;
df.query("select * from df")
Thank you.
For those who know R, there is a library called sqldf where you can execute SQL code in R, my question is basically, is there some library like sqldf in python
Python Solutions
Solution 1 - Python
This is not what pandas.query
is supposed to do. You can look at package pandasql
(same like sqldf
in R )
import pandas as pd
import pandasql as ps
df = pd.DataFrame([[1234, 'Customer A', '123 Street', np.nan],
[1234, 'Customer A', np.nan, '333 Street'],
[1233, 'Customer B', '444 Street', '333 Street'],
[1233, 'Customer B', '444 Street', '666 Street']], columns=
['ID', 'Customer', 'Billing Address', 'Shipping Address'])
q1 = """SELECT ID FROM df """
print(ps.sqldf(q1, locals()))
ID
0 1234
1 1234
2 1233
3 1233
Update 2020-07-10
>
>
> update the pandasql
ps.sqldf("select * from df")
Solution 2 - Python
After some time of using this I realised the easiest way is to just do
from pandasql import sqldf
output = sqldf("select * from df")
Works like a charm where df
is a pandas dataframe
You can install pandasql: https://pypi.org/project/pandasql/
Solution 3 - Python
You can use DataFrame.query(condition)
to return a subset of the data frame matching condition
like this:
df = pd.DataFrame(np.arange(9).reshape(3,3), columns=list('ABC'))
df
A B C
0 0 1 2
1 3 4 5
2 6 7 8
df.query('C < 6')
A B C
0 0 1 2
1 3 4 5
df.query('2*B <= C')
A B C
0 0 1 2
df.query('A % 2 == 0')
A B C
0 0 1 2
2 6 7 8
This is basically the same effect as an SQL statement, except the SELECT * FROM df WHERE
is implied.
Solution 4 - Python
There's actually a new package that I just released, called dataframe_sql. This gives you the ability to query pandas dataframes using SQL just as you want to. You can find the package here
Solution 5 - Python
Much better solution is to use duckdb
pip install duckdb
import pandas as pd
import duckdb
test_df = pd.DataFrame.from_dict({"i":[1, 2, 3, 4], "j":["one", "two", "three", "four"]})
duckdb.query("SELECT * FROM test_df where i>2").df() # returns a result dataframe
Performance improvement over pandasql: test data NYC yellow cabs ~120mb of csv data
nyc = pd.read_csv('https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2021-01.csv',low_memory=False)
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())
pysqldf("SELECT * FROM nyc where trip_distance>10")
# wall time 16.1s
duckdb.query("SELECT * FROM nyc where trip_distance>10").df()
# wall time 183ms
A improvement of speed of roughly 100x
This article gives good details and claims 1000x improvement over pandasql: https://duckdb.org/2021/05/14/sql-on-pandas.html
Solution 6 - Python
Or, you can use the tools that do what they do best:
-
Install postgresql
-
Connect to the database: >from sqlalchemy import create_engine
>import urllib.parse
>engconnect = "{0}://{1}:{2}@{3}:{4}/{5}".format(dialect,user_uenc, pw_uenc, host,port, dbname)
>dbengine = create_engine(engconnect)
>database = dbengine.connect() -
Dump the dataframe into postgres
>df.to_sql('mytablename', database, if_exists='replace')
-
Write your query with all the SQL nesting your brain can handle. >myquery = "select distinct * from mytablename"
-
Create a dataframe by running the query:
>newdf = pd.read_sql(myquery, database)
Solution 7 - Python
I think a better solution than pandassql
would be duckdb. The way it handles the table name mapping to a dataframe object is a little cleaner imo. I have not evaluated performance though.
Solution 8 - Python
There is also FugueSQL
pip install fugue[sql]
import pandas as pd
from fugue_sql import fsql
comics_df = pd.DataFrame({'book': ['Secret Wars 8',
'Tomb of Dracula 10',
'Amazing Spider-Man 252',
'New Mutants 98',
'Eternals 1',
'Amazing Spider-Man 300',
'Department of Truth 1'],
'publisher': ['Marvel', 'Marvel', 'Marvel', 'Marvel', 'Marvel', 'Marvel', 'Image'],
'grade': [9.6, 5.0, 7.5, 8.0, 9.2, 6.5, 9.8],
'value': [400, 2500, 300, 600, 400, 750, 175]})
# which of my books are graded above 8.0?
query = """
SELECT book, publisher, grade, value FROM comics_df
WHERE grade > 8.0
PRINT
"""
fsql(query).run()
Output
PandasDataFrame
book:str |publisher:str|grade:double|value:long
--------------------------------------------------------------+-------------+------------+----------
Secret Wars 8 |Marvel |9.6 |400
Eternals 1 |Marvel |9.2 |400
Department of Truth 1 |Image |9.8 |175
Total count: 3
References
https://fugue-tutorials.readthedocs.io/tutorials/beginner/beginner_sql.html
https://www.kdnuggets.com/2021/10/query-pandas-dataframes-sql.html
Solution 9 - Python
Another solution is RBQL which provides SQL-like query language that allows using Python expression inside SELECT and WHERE statements. It also provides a convenient %rbql
magic command to use in Jupyter/IPyhon:
# Get some test data:
!pip install vega_datasets
from vega_datasets import data
my_cars_df = data.cars()
# Install and use RBQL:
!pip install rbql
%load_ext rbql
%rbql SELECT * FROM my_cars_df WHERE a.Horsepower > 100 ORDER BY a.Weight_in_lbs DESC
In this example my_cars_df
is a Pandas Dataframe.
You can try it in this demo Google Colab notebook.