Truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()

PythonPandasDataframeBooleanFiltering

Python Problem Overview


I want to filter my dataframe with an or condition to keep rows with a particular column's values that are outside the range [-0.25, 0.25]. I tried:

df = df[(df['col'] < -0.25) or (df['col'] > 0.25)]

But I get the error:

> Truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()

Python Solutions


Solution 1 - Python

The or and and python statements require truth-values. For pandas these are considered ambiguous so you should use "bitwise" | (or) or & (and) operations:

df = df[(df['col'] < -0.25) | (df['col'] > 0.25)]

These are overloaded for these kind of data structures to yield the element-wise or (or and).


Just to add some more explanation to this statement:

The exception is thrown when you want to get the bool of a pandas.Series:

>>> import pandas as pd
>>> x = pd.Series([1])
>>> bool(x)
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

What you hit was a place where the operator implicitly converted the operands to bool (you used or but it also happens for and, if and while):

>>> x or x
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> x and x
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> if x:
...     print('fun')
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> while x:
...     print('fun')
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Besides these 4 statements there are several python functions that hide some bool calls (like any, all, filter, ...) these are normally not problematic with pandas.Series but for completeness I wanted to mention these.


In your case the exception isn't really helpful, because it doesn't mention the right alternatives. For and and or you can use (if you want element-wise comparisons):

  • numpy.logical_or:

      >>> import numpy as np
      >>> np.logical_or(x, y)
    

    or simply the | operator:

      >>> x | y
    
  • numpy.logical_and:

      >>> np.logical_and(x, y)
    

    or simply the & operator:

      >>> x & y
    

If you're using the operators then make sure you set your parentheses correctly because of the operator precedence.

There are several logical numpy functions which should work on pandas.Series.


The alternatives mentioned in the Exception are more suited if you encountered it when doing if or while. I'll shortly explain each of these:

  • If you want to check if your Series is empty:

      >>> x = pd.Series([])
      >>> x.empty
      True
      >>> x = pd.Series([1])
      >>> x.empty
      False
    

    Python normally interprets the length of containers (like list, tuple, ...) as truth-value if it has no explicit boolean interpretation. So if you want the python-like check, you could do: if x.size or if not x.empty instead of if x.

  • If your Series contains one and only one boolean value:

      >>> x = pd.Series([100])
      >>> (x > 50).bool()
      True
      >>> (x < 50).bool()
      False
    
  • If you want to check the first and only item of your Series (like .bool() but works even for not boolean contents):

      >>> x = pd.Series([100])
      >>> x.item()
      100
    
  • If you want to check if all or any item is not-zero, not-empty or not-False:

      >>> x = pd.Series([0, 1, 2])
      >>> x.all()   # because one element is zero
      False
      >>> x.any()   # because one (or more) elements are non-zero
      True
    

Solution 2 - Python

Well pandas use bitwise & | and each condition should be wrapped in a ()

For example following works

data_query = data[(data['year'] >= 2005) & (data['year'] <= 2010)]

But the same query without proper brackets does not

data_query = data[(data['year'] >= 2005 & data['year'] <= 2010)]

Solution 3 - Python

For boolean logic, use & and |.

np.random.seed(0)
df = pd.DataFrame(np.random.randn(5,3), columns=list('ABC'))

>>> df
          A         B         C
0  1.764052  0.400157  0.978738
1  2.240893  1.867558 -0.977278
2  0.950088 -0.151357 -0.103219
3  0.410599  0.144044  1.454274
4  0.761038  0.121675  0.443863

>>> df.loc[(df.C > 0.25) | (df.C < -0.25)]
          A         B         C
0  1.764052  0.400157  0.978738
1  2.240893  1.867558 -0.977278
3  0.410599  0.144044  1.454274
4  0.761038  0.121675  0.443863

To see what is happening, you get a column of booleans for each comparison, e.g.

df.C > 0.25
0     True
1    False
2    False
3     True
4     True
Name: C, dtype: bool

When you have multiple criteria, you will get multiple columns returned. This is why the join logic is ambiguous. Using and or or treats each column separately, so you first need to reduce that column to a single boolean value. For example, to see if any value or all values in each of the columns is True.

# Any value in either column is True?
(df.C > 0.25).any() or (df.C < -0.25).any()
True

# All values in either column is True?
(df.C > 0.25).all() or (df.C < -0.25).all()
False

One convoluted way to achieve the same thing is to zip all of these columns together, and perform the appropriate logic.

>>> df[[any([a, b]) for a, b in zip(df.C > 0.25, df.C < -0.25)]]
          A         B         C
0  1.764052  0.400157  0.978738
1  2.240893  1.867558 -0.977278
3  0.410599  0.144044  1.454274
4  0.761038  0.121675  0.443863

For more details, refer to Boolean Indexing in the docs.

Solution 4 - Python

Or, alternatively, you could use Operator module. More detailed information is here Python docs

import operator
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame(np.random.randn(5,3), columns=list('ABC'))
df.loc[operator.or_(df.C > 0.25, df.C < -0.25)]

          A         B         C
0  1.764052  0.400157  0.978738
1  2.240893  1.867558 -0.977278
3  0.410599  0.144044  1.454274
4  0.761038  0.121675  0.4438

Solution 5 - Python

This excellent answer explains very well what is happening and provides a solution. I would like to add another solution that might be suitable in similar cases: using the query method:

df = df.query("(col > 0.25) or (col < -0.25)")

See also <http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-query>;.

(Some tests with a dataframe I'm currently working with suggest that this method is a bit slower than using the bitwise operators on series of booleans: 2 ms vs. 870 µs)

A piece of warning: At least one situation where this is not straightforward is when column names happen to be python expressions. I had columns named WT_38hph_IP_2, WT_38hph_input_2 and log2(WT_38hph_IP_2/WT_38hph_input_2) and wanted to perform the following query: "(log2(WT_38hph_IP_2/WT_38hph_input_2) > 1) and (WT_38hph_IP_2 > 20)"

I obtained the following exception cascade:

  • KeyError: 'log2'
  • UndefinedVariableError: name 'log2' is not defined
  • ValueError: "log2" is not a supported function

I guess this happened because the query parser was trying to make something from the first two columns instead of identifying the expression with the name of the third column.

A possible workaround is proposed here.

Solution 6 - Python

I was getting error in this command:

if df != '':
    pass

But it worked when I changed it to this:

if df is not '':
    pass

Solution 7 - Python

If you have more than one value:

df['col'].all()

If its only a single value:

df['col'].item()

Solution 8 - Python

This is quite a common question for beginners when making multiple conditions in Pandas. Generally speaking, there are two possible conditions causing this error:

Condition 1: Python Operator Precedence

There is a paragraph of Boolean indexing | Indexing and selecting data — pandas documentation explains this

> Another common operation is the use of boolean vectors to filter the data. The operators are: | for or, & for and, and ~ for not. These must be grouped by using parentheses. > > By default Python will evaluate an expression such as df['A'] > 2 & df['B'] < 3 as df['A'] > (2 & df['B']) < 3, while the desired evaluation order is (df['A'] > 2) & (df['B'] < 3).

# Wrong
df['col'] < -0.25 | df['col'] > 0.25

# Right
(df['col'] < -0.25) | (df['col'] > 0.25)

There are some possible ways to get rid off the parentheses, I will cover this later.


Condition 2: Improper Operator/Statement

As is explained in previous quotation, you need use | for or, & for and, and ~ for not

# Wrong
(df['col'] < -0.25) or (df['col'] > 0.25)

# Right
(df['col'] < -0.25) | (df['col'] > 0.25)

Another possible situation is that you are using a boolean Series in if statement.

# Wrong
if pd.Series([True, False]):
    pass

It's clear that Python if statement accepts boolean like expression rather than Pandas Series. You should use pandas.Series.any or methods listed in the error message to convert the Series to a value according to your need.

For example:

# Right
if df['col'].eq(0).all():
    # If you want all column values equal to zero
    print('do something')

# Right
if df['col'].eq(0).any():
    # If you want at least one column value equal to zero
    print('do something')

Let's talk about ways to escape the parentheses in the first situation.

  1. Use Pandas mathematical functions

Pandas has defined a lot of mathematical functions including comparison as follows:

As a result, you can use

df = df[(df['col'] < -0.25) | (df['col'] > 0.25)]

# is equal to

df = df[df['col'].lt(-0.25) | df['col'].gt(0.25)]
  1. Use pandas.Series.between()

If you want to select rows in between two values, you can use pandas.Series.between

  • df['col].between(left, right) is equal to
    (left <= df['col']) & (df['col'] <= right);
  • df['col].between(left, right, inclusive='left) is equal to
    (left <= df['col']) & (df['col'] < right);
  • df['col].between(left, right, inclusive='right') is equal to
    (left < df['col']) & (df['col'] <= right);
  • df['col].between(left, right, inclusive='neither') is equal to
    (left < df['col']) & (df['col'] < right);
df = df[(df['col'] > -0.25) & (df['col'] < 0.25)]

# is equal to

df = df[df['col'].between(-0.25, 0.25, inclusive='neither')]
  1. Use pandas.DataFrame.query()

Document referenced before has a chapter The query() Method explains this well.

pandas.DataFrame.query() can help you select a DataFrame with a condition string. Within the query string, you can use both bitwise operators(& and |) and their boolean cousins(and and or). Moreover, you can omit the parentheses, but I don't recommend for readable reason.

df = df[(df['col'] < -0.25) | (df['col'] > 0.25)]

# is equal to

df = df.query('col < -0.25 or col > 0.25')
  1. Use pandas.DataFrame.eval()

pandas.DataFrame.eval() evaluates a string describing operations on DataFrame columns. Thus, we can use this method to build our multiple condition. The syntax is same with pandas.DataFrame.query().

df = df[(df['col'] < -0.25) | (df['col'] > 0.25)]

# is equal to

df = df[df.eval('col < -0.25 or col > 0.25')]

pandas.DataFrame.query() and pandas.DataFrame.eval() can do more things than I describe here, you are recommended to read their documentation and have fun with them.

Solution 9 - Python

You need to use bitwise operators | instead of or and & instead of and in pandas, you can't simply use the bool statements from python.

For much complex filtering create a mask and apply the mask on the dataframe.
Put all your query in the mask and apply it.
Suppose,

mask = (df["col1"]>=df["col2"]) & (stock["col1"]<=df["col2"])
df_new = df[mask]

Solution 10 - Python

I'll try to give the benchmark of the three most common way (also mentioned above):

from timeit import repeat

setup = """
import numpy as np;
import random;
x = np.linspace(0,100);
lb, ub = np.sort([random.random() * 100, random.random() * 100]).tolist()
"""
stmts = 'x[(x > lb) * (x <= ub)]', 'x[(x > lb) & (x <= ub)]', 'x[np.logical_and(x > lb, x <= ub)]'

for _ in range(3):
    for stmt in stmts:
        t = min(repeat(stmt, setup, number=100_000))
        print('%.4f' % t, stmt)
    print()

result:

0.4808 x[(x > lb) * (x <= ub)]
0.4726 x[(x > lb) & (x <= ub)]
0.4904 x[np.logical_and(x > lb, x <= ub)]

0.4725 x[(x > lb) * (x <= ub)]
0.4806 x[(x > lb) & (x <= ub)]
0.5002 x[np.logical_and(x > lb, x <= ub)]

0.4781 x[(x > lb) * (x <= ub)]
0.4336 x[(x > lb) & (x <= ub)]
0.4974 x[np.logical_and(x > lb, x <= ub)]

But, * is not supported in Panda Series, and NumPy Array is faster than pandas data frame (arround 1000 times slower, see number):

from timeit import repeat

setup = """
import numpy as np;
import random;
import pandas as pd;
x = pd.DataFrame(np.linspace(0,100));
lb, ub = np.sort([random.random() * 100, random.random() * 100]).tolist()
"""
stmts = 'x[(x > lb) & (x <= ub)]', 'x[np.logical_and(x > lb, x <= ub)]'

for _ in range(3):
    for stmt in stmts:
        t = min(repeat(stmt, setup, number=100))
        print('%.4f' % t, stmt)
    print()

result:

0.1964 x[(x > lb) & (x <= ub)]
0.1992 x[np.logical_and(x > lb, x <= ub)]

0.2018 x[(x > lb) & (x <= ub)]
0.1838 x[np.logical_and(x > lb, x <= ub)]

0.1871 x[(x > lb) & (x <= ub)]
0.1883 x[np.logical_and(x > lb, x <= ub)]

Note: adding one line of code x = x.to_numpy() will need about 20 µs.

For those who prefer %timeit:

import numpy as np
import random
lb, ub = np.sort([random.random() * 100, random.random() * 100]).tolist()
lb, ub
x = pd.DataFrame(np.linspace(0,100))

def asterik(x):
    x = x.to_numpy()
    return x[(x > lb) * (x <= ub)]

def and_symbol(x):
    x = x.to_numpy()
    return x[(x > lb) & (x <= ub)]

def numpy_logical(x):
    x = x.to_numpy()
    return x[np.logical_and(x > lb, x <= ub)]

for i in range(3):
    %timeit asterik(x)
    %timeit and_symbol(x)
    %timeit numpy_logical(x)
    print('\n')

result:

23 µs ± 3.62 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
35.6 µs ± 9.53 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
31.3 µs ± 8.9 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)


21.4 µs ± 3.35 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
21.9 µs ± 1.02 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
21.7 µs ± 500 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


25.1 µs ± 3.71 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
36.8 µs ± 18.3 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
28.2 µs ± 5.97 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Solution 11 - Python

I encountered the same error and got stalled with a pyspark dataframe for few days, I was able to resolve it successfully by filling na values with 0 since I was comparing integer values from 2 fields.

Solution 12 - Python

One minor thing, which wasted my time.

Put the conditions(if comparing using " = ", " != ") in parenthesis, failing to do so also raises this exception. This will work

df[(some condition) conditional operator (some conditions)]

This will not

df[some condition conditional-operator some condition]

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionobabsView Question on Stackoverflow
Solution 1 - PythonMSeifertView Answer on Stackoverflow
Solution 2 - PythonNipunView Answer on Stackoverflow
Solution 3 - PythonAlexanderView Answer on Stackoverflow
Solution 4 - PythonCảnh Toàn NguyễnView Answer on Stackoverflow
Solution 5 - PythonbliView Answer on Stackoverflow
Solution 6 - PythonMehdi RostamiView Answer on Stackoverflow
Solution 7 - PythonHumza SamiView Answer on Stackoverflow
Solution 8 - PythonYnjxsjmhView Answer on Stackoverflow
Solution 9 - PythonHemanth KolliparaView Answer on Stackoverflow
Solution 10 - PythonMuhammad YasirroniView Answer on Stackoverflow
Solution 11 - PythoniretexView Answer on Stackoverflow
Solution 12 - Pythonsatinder singhView Answer on Stackoverflow