Filtering multiple items in a multi-index Python Panda dataframe

PythonFilterPandasIndexing

Python Problem Overview


I have the following table:

Note: Both NSRCODE and PBL_AWI are index's

Note: the % Of area column would be filled out just have not done so yet.

NSRCODE  PBL_AWI          Area           % Of Area
CM       BONS             44705.492941
         BTNN            253854.591990
         FONG             41625.590370
         FONS             16814.159680
         Lake             57124.819333
         River             1603.906642
         SONS            583958.444751
         STNN             45603.837177
         clearcut        106139.013930
         disturbed       127719.865675
         lowland         118795.578059
         upland         2701289.270193
LBH      BFNN            289207.169650
         BONS           9140084.716743
         BTNI             33713.160390
         BTNN          19748004.789040
         FONG           1687122.469691
         FONS           5169959.591270
         FTNI            317251.976160
         FTNN           6536472.869395
         Lake            258046.508310
         River            44262.807900
         SONS           4379097.677405
         burn regen      744773.210860
         clearcut         54066.756790
         disturbed       597561.471686
         lowland       12591619.141842
         upland        23843453.638117

How do I filter out item in the "PBL_AWI" index? For example I want to keep ['Lake', 'River', 'Upland']

Python Solutions


Solution 1 - Python

You can get_level_values in conjunction with Boolean slicing.

In [50]:

print df[np.in1d(df.index.get_level_values(1), ['Lake', 'River', 'Upland'])]
                          Area
NSRCODE PBL_AWI               
CM      Lake      57124.819333
        River      1603.906642
LBH     Lake     258046.508310
        River     44262.807900

The same idea can be expressed in many different ways, such as df[df.index.get_level_values('PBL_AWI').isin(['Lake', 'River', 'Upland'])]

Note that you have 'upland' in your data instead of 'Upland'

Solution 2 - Python

Another (maybe cleaner) way might be this one:

print(df[df.index.isin(['Lake', 'River', 'Upland'], level=1)])

The parameter level specifies the index number (starting with 0) or index name (here: level='PBL_AWI')

Solution 3 - Python

Also (from here):

def filter_by(df, constraints):
    """Filter MultiIndex by sublevels."""
    indexer = [constraints[name] if name in constraints else slice(None)
               for name in df.index.names]
    return df.loc[tuple(indexer)] if len(df.shape) == 1 else df.loc[tuple(indexer),]

pd.Series.filter_by = filter_by
pd.DataFrame.filter_by = filter_by

... to be used as

df.filter_by({'PBL_AWI' : ['Lake', 'River', 'Upland']})

(untested with Panels and higher dimension elements, but I do expect it to work)

Solution 4 - Python

df.filter(regex=...,axis=...) is even more succinct, because it works on both index=0 and column=1 axis. You do not need to worry about levels, and you can be lazy with regex. Complete example for filter on index:

df.filter(regex='Lake|River|Upland',axis=0)

enter image description here

if you transpose it, and try to filter on columns (axis=1 by default), it works as well:

df.T.filter(regex='Lake|River|Upland')

enter image description here

Now, with regex you can also easily fix upper lower case issue with Upland:

upland = re.compile('Upland', re.IGNORECASE)
df.filter(regex=upland ,axis=0)

enter image description here

This is the command to read above input table:

df = pd.read_csv(io.StringIO(inpute_table), sep="\s{2,}").set_index(['NSRCODE', 'PBL_AWI'])

Solution 5 - Python

A simpler approach using .loc would be

df.loc[(slice(None),['Lake', 'River', 'Upland']),:]

or for Series

df.loc[(slice(None),['Lake', 'River', 'Upland'])]

slice(None) means no filtering on the first level index. We can filter the second level index using a list of values ['Lake', 'River', 'Upland']

Solution 6 - Python

This is an answer to a slight variant of the question asked that might save someone else a little time. If you are looking for a wildcard type match to a label whose exact value you don't know, you can use something like this:

q_labels = [ label for label in df.index.levels[1] if label.startswith('Q') ]
new_df = df[ df.index.isin(q_labels, level=1) ]

Solution 7 - Python

You can use isin inside query:

In [9]: df.query("PBL_AWI.isin(['Lake', 'River', 'Upland']).values")
Out[9]: 
                     Area
NSRCODE PBL_AWI          
CM      Lake     57124.82
        River     1603.91
LBH     Lake    258046.51
        River    44262.81

However, due to case sensitivity, 'upland' (lower case) won't be found. Therefore I recommend using fullmatch and set case=False:

In [10]: df.query("PBL_AWI.str.fullmatch('Lake|River|Upland', case=False).values")
Out[10]: 
                       Area
NSRCODE PBL_AWI            
CM      Lake       57124.82
        River       1603.91
        upland   2701289.27
LBH     Lake      258046.51
        River      44262.81
        upland  23843453.64

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionTristan ForwardView Question on Stackoverflow
Solution 1 - PythonCT ZhuView Answer on Stackoverflow
Solution 2 - PythonNateView Answer on Stackoverflow
Solution 3 - PythonPietro BattistonView Answer on Stackoverflow
Solution 4 - PythonjediView Answer on Stackoverflow
Solution 5 - PythonNeeraz LakkapragadaView Answer on Stackoverflow
Solution 6 - PythonDaniel CotterView Answer on Stackoverflow
Solution 7 - PythonrachwaView Answer on Stackoverflow