Getting a list of indices where pandas boolean series is True
PythonPandasBooleanSeriesBoolean IndexingPython Problem Overview
I have a pandas series with boolean entries. I would like to get a list of indices where the values are True
.
For example the input pd.Series([True, False, True, True, False, False, False, True])
should yield the output [0,2,3,7]
.
I can do it with a list comprehension, but is there something cleaner or faster?
Python Solutions
Solution 1 - Python
Boolean Indexing
Using >>> s = pd.Series([True, False, True, True, False, False, False, True])
>>> s[s].index
Int64Index([0, 2, 3, 7], dtype='int64')
If need a np.array
object, get the .values
>>> s[s].index.values
array([0, 2, 3, 7])
np.nonzero
Using >>> np.nonzero(s)
(array([0, 2, 3, 7]),)
np.flatnonzero
Using >>> np.flatnonzero(s)
array([0, 2, 3, 7])
np.where
Using >>> np.where(s)[0]
array([0, 2, 3, 7])
np.argwhere
Using >>> np.argwhere(s).ravel()
array([0, 2, 3, 7])
pd.Series.index
Using >>> s.index[s]
array([0, 2, 3, 7])
filter
Using python's built-in >>> [*filter(s.get, s.index)]
[0, 2, 3, 7]
list comprehension
Using >>> [i for i in s.index if s[i]]
[0, 2, 3, 7]
Solution 2 - Python
As an addition to rafaelc's answer, here are the according times (from quickest to slowest) for the following setup
import numpy as np
import pandas as pd
s = pd.Series([x > 0.5 for x in np.random.random(size=1000)])
np.where
Using >>> timeit np.where(s)[0]
12.7 µs ± 77.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
np.flatnonzero
Using >>> timeit np.flatnonzero(s)
18 µs ± 508 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
pd.Series.index
Using The time difference to boolean indexing was really surprising to me, since the boolean indexing is usually more used.
>>> timeit s.index[s]
82.2 µs ± 38.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Boolean Indexing
Using >>> timeit s[s].index
1.75 ms ± 2.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
If you need a np.array
object, get the .values
>>> timeit s[s].index.values
1.76 ms ± 3.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
If you need a slightly easier to read version <-- not in original answer
>>> timeit s[s==True].index
1.89 ms ± 3.52 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
pd.Series.where
<-- not in original answer
Using >>> timeit s.where(s).dropna().index
2.22 ms ± 3.32 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> timeit s.where(s == True).dropna().index
2.37 ms ± 2.19 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
pd.Series.mask
<-- not in original answer
Using >>> timeit s.mask(s).dropna().index
2.29 ms ± 1.43 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> timeit s.mask(s == True).dropna().index
2.44 ms ± 5.82 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
list comprehension
Using >>> timeit [i for i in s.index if s[i]]
13.7 ms ± 40.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
filter
Using python's built-in >>> timeit [*filter(s.get, s.index)]
14.2 ms ± 28.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
np.nonzero
<-- did not work out of the box for me
Using >>> timeit np.nonzero(s)
ValueError: Length of passed values is 1, index implies 1000.
np.argwhere
<-- did not work out of the box for me
Using >>> timeit np.argwhere(s).ravel()
ValueError: Length of passed values is 1, index implies 1000.
Solution 3 - Python
Also works:
s.where(lambda x: x).dropna().index
, and
it has the advantage of being easy to chain pipe - if your series is being computed on the fly, you don't need to assign it to a variable.
Note that if s
is computed from r
: s = cond(r)
than you can also use: r.where(lambda x: cond(x)).dropna().index
.