How to read a Parquet file into Pandas DataFrame?

Python Problem Overview

How to read a modestly sized Parquet data-set into an in-memory Pandas DataFrame without setting up a cluster computing infrastructure such as Hadoop or Spark? This is only a moderate amount of data that I would like to read in-memory with a simple Python script on a laptop. The data does not reside on HDFS. It is either on the local file system or possibly in S3. I do not want to spin up and configure other services like Hadoop, Hive or Spark.

I thought Blaze/Odo would have made this possible: the Odo documentation mentions Parquet, but the examples seem all to be going through an external Hive runtime.

Python Solutions

Solution 1 - Python

pandas 0.21 introduces new functions for Parquet:

import pandas as pd
pd.read_parquet('example_pa.parquet', engine='pyarrow')

import pandas as pd
pd.read_parquet('example_fp.parquet', engine='fastparquet')

The above link explains:

> These engines are very similar and should read/write nearly identical parquet format files. These libraries differ by having different underlying dependencies (fastparquet by using numba, while pyarrow uses a c-library).

Solution 2 - Python

Update: since the time I answered this there has been a lot of work on this look at Apache Arrow for a better read and write of parquet. Also: http://wesmckinney.com/blog/python-parquet-multithreading/

There is a python parquet reader that works relatively well: https://github.com/jcrobak/parquet-python

It will create python objects and then you will have to move them to a Pandas DataFrame so the process will be slower than pd.read_csv for example.

Solution 3 - Python

Aside from pandas, Apache pyarrow also provides way to transform parquet to dataframe

The code is simple, just type:

import pyarrow.parquet as pq

df = pq.read_table(source=your_file_path).to_pandas()

For more information, see the document from Apache pyarrow Reading and Writing Single Files

Solution 4 - Python

Parquet

Step 1: Data to play with

df = pd.DataFrame({
    'student': ['personA007', 'personB', 'x', 'personD', 'personE'],
    'marks': [20,10,22,21,22],
})

Step 2: Save as Parquet

df.to_parquet('sample.parquet')

Step 3: Read from Parquet

df = pd.read_parquet('sample.parquet')

Solution 5 - Python

Considering the .parquet file named data

parquet_file = '../data.parquet'

open( parquet_file, 'w+' )

Then use pandas.to_parquet (this function requires either the fastparquet or pyarrow library)

parquet_df.to_parquet(parquet_file)

Then, use pandas.read_parquet() to get a dataframe

new_parquet_df = pd.read_parquet(parquet_file)

Solution 6 - Python

When writing to parquet, consider using brotli compression. I'm getting a 70% size reduction of 8GB file parquet file by using brotli compression. Brotli makes for a smaller file and faster read/writes than gzip, snappy, pickle. Although pickle can do tuples whereas parquet does not.

df.to_parquet('df.parquet.brotli',compression='brotli')
df = pd.read_parquet('df.parquet.brotli')

Solution 7 - Python

Parquet files are always large. so read it using dask.

import dask.dataframe as dd
from dask import delayed
from fastparquet import ParquetFile
import glob

files = glob.glob('data/*.parquet')

@delayed
def load_chunk(path):
    return ParquetFile(path).to_pandas()

df = dd.from_delayed([load_chunk(f) for f in files])

df.compute()

Content Type	Original Author	Original Content on Stackoverflow
Question	Daniel Mahler	View Question on Stackoverflow
Solution 1 - Python	chrisaycock	View Answer on Stackoverflow
Solution 2 - Python	danielfrg	View Answer on Stackoverflow
Solution 3 - Python	WY Hsu	View Answer on Stackoverflow
Solution 4 - Python	Harish Masand	View Answer on Stackoverflow
Solution 5 - Python	Gonçalo Peres	View Answer on Stackoverflow
Solution 6 - Python	BSalita	View Answer on Stackoverflow
Solution 7 - Python	RaaHul Dutta	View Answer on Stackoverflow