pandas - find first occurrence

PythonPandasDataframeGroup ByFind

Python Problem Overview


Suppose I have a structured dataframe as follows:

df = pd.DataFrame({"A":['a','a','a','b','b'],
                   "B":[1]*5})

The A column has previously been sorted. I wish to find the first row index of where df[df.A!='a']. The end goal is to use this index to break the data frame into groups based on A.

Now I realise that there is a groupby functionality. However, the dataframe is quite large and this is a simplified toy example. Since A has been sorted already, it would be faster if I can just find the 1st index of where df.A!='a'. Therefore it is important that whatever method that you use the scanning stops once the first element is found.

Python Solutions


Solution 1 - Python

idxmax and argmax will return the position of the maximal value or the first position if the maximal value occurs more than once.

use idxmax on df.A.ne('a')

df.A.ne('a').idxmax()

3

or the numpy equivalent

(df.A.values != 'a').argmax()

3

However, if A has already been sorted, then we can use searchsorted

df.A.searchsorted('a', side='right')

array([3])

Or the numpy equivalent

df.A.values.searchsorted('a', side='right')

3

Solution 2 - Python

I found there is first_valid_index function for Pandas DataFrames that will do the job, one could use it as follows:

df[df.A!='a'].first_valid_index()

3

However, this function seems to be very slow. Even taking the first index of the filtered dataframe is faster:

df.loc[df.A!='a','A'].index[0]

Below I compare the total time(sec) of repeating calculations 100 times for these two options and all the codes above:

                      total_time_sec	ratio wrt fastest algo
searchsorted numpy:	       0.0007	     1.00
argmax numpy:	           0.0009	     1.29
for loop:	               0.0045	     6.43
searchsorted pandas:       0.0075	    10.71
idxmax pandas:	           0.0267	    38.14
index[0]:	               0.0295	    42.14
first_valid_index pandas:  0.1181	   168.71

Notice numpy's searchsorted is the winner and first_valid_index shows worst performance. Generally, numpy algorithms are faster, and the for loop does not do so bad, but it's just because the dataframe has very few entries.

For a dataframe with 10,000 entries where the desired entries are closer to the end the results are different, with searchsorted delivering the best performance:

                     total_time_sec	ratio wrt fastest algo
searchsorted numpy:	       0.0007	    1.00
searchsorted pandas:	   0.0076	   10.86
argmax numpy:	           0.0117	   16.71
index[0]:	               0.0815	  116.43
idxmax pandas:	           0.0904	  129.14
first_valid_index pandas:  0.1691	  241.57
for loop:	               9.6504	13786.29

The code to produce these results is below:

import timeit

# code snippet to be executed only once 
mysetup = '''import pandas as pd
import numpy as np
df = pd.DataFrame({"A":['a','a','a','b','b'],"B":[1]*5})
'''

# code snippets whose execution time is to be measured   
mycode_set = ['''
df[df.A!='a'].first_valid_index()
''']
message = ["first_valid_index pandas:"]

mycode_set.append( '''df.loc[df.A!='a','A'].index[0]''')
message.append("index[0]: ")

mycode_set.append( '''df.A.ne('a').idxmax()''')
message.append("idxmax pandas: ")

mycode_set.append(  '''(df.A.values != 'a').argmax()''')
message.append("argmax numpy: ")
           
mycode_set.append( '''df.A.searchsorted('a', side='right')''')
message.append("searchsorted pandas: ")

mycode_set.append( '''df.A.values.searchsorted('a', side='right')''' )
message.append("searchsorted numpy: ")
           
mycode_set.append( '''for index in range(len(df['A'])):
    if df['A'][index] != 'a':
        ans = index
        break
        ''')
message.append("for loop: ")

total_time_in_sec = []
for i in range(len(mycode_set)):
    mycode = mycode_set[i]
    total_time_in_sec.append(np.round(timeit.timeit(setup = mysetup,\
         stmt = mycode, number = 100),4))

output = pd.DataFrame(total_time_in_sec, index = message, \
                      columns = ['total_time_sec' ])
output["ratio wrt fastest algo"] = \
np.round(output.total_time_sec/output["total_time_sec"].min(),2)

output = output.sort_values(by = "total_time_sec")
display(output)

For the larger dataframe:

mysetup = '''import pandas as pd
import numpy as np
n = 10000
lt = ['a' for _ in range(n)]
b = ['b' for _ in range(5)]
lt[-5:] = b
df = pd.DataFrame({"A":lt,"B":[1]*n})
'''

Solution 3 - Python

Using pandas groupby() to group by column or list of columns. Then first() to get the first value in each group.

import pandas as pd

df = pd.DataFrame({"A":['a','a','a','b','b'],
                   "B":[1]*5})

#Group df by column and get the first value in each group                   
grouped_df = df.groupby("A").first()

#Reset indices to match format
first_values = grouped_df.reset_index()

print(first_values)
>>>    A  B
   0   a  1
   1   b  1

Solution 4 - Python

If you just want to find the first instance without going through the entire dataframe, you can go the for-loop way.

df = pd.DataFrame({"A":['a','a','a','b','b'],"B":[1]*5})
for index in range(len(df['A'])):
    if df['A'][index] != 'a':
        print(index)
        break

The index is the row number of the 1st index of where df.A!='a'

Solution 5 - Python

For multiple conditions:

Let's say we have:

s = pd.Series(['a', 'a', 'c', 'c', 'b', 'd'])

And we want to find the first item different than a and c, we do:

n = np.logical_and(s.values != 'a', s.values != 'c').argmax()

Times:

import numpy as np
import pandas as pd
from datetime import datetime

ITERS = 1000

def pandas_multi_condition(s):
    ts = datetime.now()
    for i in range(ITERS):
        n = s[(s != 'a') & (s != 'c')].index[0]
    print(n)
    print(datetime.now() - ts)

def numpy_bitwise_and(s):
    ts = datetime.now()
    for i in range(ITERS):
        n = np.logical_and(s.values != 'a', s.values != 'c').argmax()
    print(n)
    print(datetime.now() - ts)

s = pd.Series(['a', 'a', 'c', 'c', 'b', 'd'])

print('pandas_multi_condition():')
pandas_multi_condition(s)
print()
print('numpy_bitwise_and():')
numpy_bitwise_and(s)

Output:

pandas_multi_condition():
4
0:00:01.144767

numpy_bitwise_and():
4
0:00:00.019013

Solution 6 - Python

You can iterate by dataframe rows (it is slow) and create your own logic to get values that you wanted:

def getMaxIndex(df, col)
    max = -999999
    rtn_index = 0
    for index, row in df.iterrows():
            if row[col] > max:
                max = row[col]
                rtn_index = index
    return rtn_index 

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionsachinrukView Question on Stackoverflow
Solution 1 - PythonpiRSquaredView Answer on Stackoverflow
Solution 2 - PythonAnna K.View Answer on Stackoverflow
Solution 3 - PythonGonçalo PeresView Answer on Stackoverflow
Solution 4 - PythonVaishaliView Answer on Stackoverflow
Solution 5 - PythonAlaa M.View Answer on Stackoverflow
Solution 6 - PythonAndré de Mattos FerrazView Answer on Stackoverflow