Compare two columns using pandas
PythonPandasIf StatementDataframePython Problem Overview
Using this as a starting point:
a = [['10', '1.2', '4.2'], ['15', '70', '0.03'], ['8', '5', '0']]
df = pd.DataFrame(a, columns=['one', 'two', 'three'])
Out[8]:
one two three
0 10 1.2 4.2
1 15 70 0.03
2 8 5 0
I want to use something like an if
statement within pandas.
if df['one'] >= df['two'] and df['one'] <= df['three']:
df['que'] = df['one']
Basically, check each row via the if
statement, create new column.
The docs say to use .all
but there is no example...
Python Solutions
Solution 1 - Python
You could use np.where. If cond
is a boolean array, and A
and B
are arrays, then
C = np.where(cond, A, B)
defines C to be equal to A
where cond
is True, and B
where cond
is False.
import numpy as np
import pandas as pd
a = [['10', '1.2', '4.2'], ['15', '70', '0.03'], ['8', '5', '0']]
df = pd.DataFrame(a, columns=['one', 'two', 'three'])
df['que'] = np.where((df['one'] >= df['two']) & (df['one'] <= df['three'])
, df['one'], np.nan)
yields
one two three que
0 10 1.2 4.2 10
1 15 70 0.03 NaN
2 8 5 0 NaN
If you have more than one condition, then you could use np.select instead.
For example, if you wish df['que']
to equal df['two']
when df['one'] < df['two']
, then
conditions = [ (df['one'] >= df['two']) & (df['one'] <= df['three']),
df['one'] < df['two']]
choices = [df['one'], df['two']]
df['que'] = np.select(conditions, choices, default=np.nan)
yields
one two three que
0 10 1.2 4.2 10
1 15 70 0.03 70
2 8 5 0 NaN
If we can assume that df['one'] >= df['two']
when df['one'] < df['two']
is
False, then the conditions and choices could be simplified to
conditions = [
df['one'] < df['two'],
df['one'] <= df['three']]
choices = [df['two'], df['one']]
(The assumption may not be true if df['one']
or df['two']
contain NaNs.)
Note that
a = [['10', '1.2', '4.2'], ['15', '70', '0.03'], ['8', '5', '0']]
df = pd.DataFrame(a, columns=['one', 'two', 'three'])
defines a DataFrame with string values. Since they look numeric, you might be better off converting those strings to floats:
df2 = df.astype(float)
This changes the results, however, since strings compare character-by-character, while floats are compared numerically.
In [61]: '10' <= '4.2'
Out[61]: True
In [62]: 10 <= 4.2
Out[62]: False
Solution 2 - Python
You can use .equals
for columns or entire dataframes.
df['col1'].equals(df['col2'])
If they're equal, that statement will return True
, else False
.
Solution 3 - Python
You could use apply() and do something like this
df['que'] = df.apply(lambda x : x['one'] if x['one'] >= x['two'] and x['one'] <= x['three'] else "", axis=1)
or if you prefer not to use a lambda
def que(x):
if x['one'] >= x['two'] and x['one'] <= x['three']:
return x['one']
return ''
df['que'] = df.apply(que, axis=1)
Solution 4 - Python
One way is to use a Boolean series to index the column df['one']
. This gives you a new column where the True
entries have the same value as the same row as df['one']
and the False
values are NaN
.
The Boolean series is just given by your if
statement (although it is necessary to use &
instead of and
):
>>> df['que'] = df['one'][(df['one'] >= df['two']) & (df['one'] <= df['three'])]
>>> df
one two three que
0 10 1.2 4.2 10
1 15 70 0.03 NaN
2 8 5 0 NaN
If you want the NaN
values to be replaced by other values, you can use the fillna
method on the new column que
. I've used 0
instead of the empty string here:
>>> df['que'] = df['que'].fillna(0)
>>> df
one two three que
0 10 1.2 4.2 10
1 15 70 0.03 0
2 8 5 0 0
Solution 5 - Python
Wrap each individual condition in parentheses, and then use the &
operator to combine the conditions:
df.loc[(df['one'] >= df['two']) & (df['one'] <= df['three']), 'que'] = df['one']
You can fill the non-matching rows by just using ~
(the "not" operator) to invert the match:
df.loc[~ ((df['one'] >= df['two']) & (df['one'] <= df['three'])), 'que'] = ''
You need to use &
and ~
rather than and
and not
because the &
and ~
operators work element-by-element.
The final result:
df
Out[8]:
one two three que
0 10 1.2 4.2 10
1 15 70 0.03
2 8 5 0
Solution 6 - Python
Use np.select
if you have multiple conditions to be checked from the dataframe and output a specific choice in a different column
conditions=[(condition1),(condition2)]
choices=["choice1","chocie2"]
df["new column"]=np.select=(condtion,choice,default=)
Note: No of conditions and no of choices should match, repeat text in choice if for two different conditions you have same choices
Solution 7 - Python
Use lambda expression:
df[df.apply(lambda x: x['col1'] != x['col2'], axis = 1)]
Solution 8 - Python
I'd like to add this answer for those who are trying to compare the equality of values in two columns that have NaN
values, and get False
when both values are NaN
. By definition, NaN
!= NaN
(See: https://stackoverflow.com/questions/29528092/numpy-isnanvalue-not-the-same-as-value-numpy-nan).
If you want the two NaN
comparison to return True
, you can use:
df['compare'] = (df["col_1"] == df["col_2"]) | (df["col_1"].isna() & df["col_2"].isna())
Solution 9 - Python
You can use the Series method where
:
df['que'] = df['one'].where((df['one'] >= df['two']) & (df['one'] <= df['three']))
Result:
one two three que
0 10 1.2 4.2 10
1 15 70 0.03 NaN
2 8 5 0 NaN
Solution 10 - Python
I think the closest to the OP's intuition is an inline if statement:
df['que'] = (df['one'] if ((df['one'] >= df['two']) and (df['one'] <= df['three']))