How can I filter lines on load in Pandas read_csv function?

Python Problem Overview

How can I filter which lines of a CSV to be loaded into memory using pandas? This seems like an option that one should find in read_csv. Am I missing something?

Example: we've a CSV with a timestamp column and we'd like to load just the lines that with a timestamp greater than a given constant.

Python Solutions

Solution 1 - Python

There isn't an option to filter the rows before the CSV file is loaded into a pandas object.

You can either load the file and then filter using df[df['field'] > constant], or if you have a very large file and you are worried about memory running out, then use an iterator and apply the filter as you concatenate chunks of your file e.g.:

import pandas as pd
iter_csv = pd.read_csv('file.csv', iterator=True, chunksize=1000)
df = pd.concat([chunk[chunk['field'] > constant] for chunk in iter_csv])

You can vary the chunksize to suit your available memory. See here for more details.

Solution 2 - Python

I didn't find a straight-forward way to do it within context of read_csv. However, read_csv returns a DataFrame, which can be filtered by selecting rows by boolean vector df[bool_vec]:

filtered = df[(df['timestamp'] > targettime)]

This is selecting all rows in df (assuming df is any DataFrame, such as the result of a read_csv call, that at least contains a datetime column timestamp) for which the values in the timestamp column are greater than the value of targettime. Similar question.

Solution 3 - Python

If the filtered range is contiguous (as it usually is with time(stamp) filters), then the fastest solution is to hard-code the range of rows. Simply combine skiprows=range(1, start_row) with nrows=end_row parameters. Then the import takes seconds where the accepted solution would take minutes. A few experiments with the initial start_row are not a huge cost given the savings on import times. Notice we kept header row by using range(1,..).

Solution 4 - Python

An alternative to the accepted answer is to apply read_csv() to a StringIO, obtained by filtering the input file.

with open(<file>) as f:
    text = "\n".join([line for line in f if <condition>])

df = pd.read_csv(StringIO(text))

This solution is often faster than the accepted answer when the filtering condition retains only a small portion of the lines

Solution 5 - Python

Consider you have the below dataframe

+----+--------+
| Id | Name   |
+----+--------+
|  1 | Sarath |
|  2 | Peter  |
|  3 | James  |
+----+--------+

If you need to filter a record where Id = 1 then you can use the below code.

df = pd.read_csv('Filename.csv', sep = '|')
df = df [(df ["Id"] == 1)]

This will produce the below output.

+----+--------+
| Id | Name   |
+----+--------+
|  1 | Sarath |
+----+--------+

Solution 6 - Python

If you are on linux you can use grep.

# to import either on Python2 or Python3
import pandas as pd
from time import time # not needed just for timing
try:
    from StringIO import StringIO
except ImportError:
    from io import StringIO


def zgrep_data(f, string):
    '''grep multiple items f is filepath, string is what you are filtering for'''

    grep = 'grep' # change to zgrep for gzipped files
    print('{} for {} from {}'.format(grep,string,f))
    start_time = time()
    if string == '':
        out = subprocess.check_output([grep, string, f])
        grep_data = StringIO(out)
        data = pd.read_csv(grep_data, sep=',', header=0)

    else:
        # read only the first row to get the columns. May need to change depending on 
        # how the data is stored
        columns = pd.read_csv(f, sep=',', nrows=1, header=None).values.tolist()[0]    
    
        out = subprocess.check_output([grep, string, f])
        grep_data = StringIO(out)
    
        data = pd.read_csv(grep_data, sep=',', names=columns, header=None)

    print('{} finished for {} - {} seconds'.format(grep,f,time()-start_time))
    return data

Solution 7 - Python

You can specify nrows parameter.

import pandas as pd
df = pd.read_csv('file.csv', nrows=100)

This code works well in version 0.20.3.

Content Type	Original Author	Original Content on Stackoverflow
Question	benjaminwilson	View Question on Stackoverflow
Solution 1 - Python	Matti John	View Answer on Stackoverflow
Solution 2 - Python	Griffin	View Answer on Stackoverflow
Solution 3 - Python	mirekphd	View Answer on Stackoverflow
Solution 4 - Python	M. Page	View Answer on Stackoverflow
Solution 5 - Python	Sarath KS	View Answer on Stackoverflow
Solution 6 - Python	Christopher Bell	View Answer on Stackoverflow
Solution 7 - Python	user1083290	View Answer on Stackoverflow