pandas create new column based on values from other columns / apply a function of multiple columns, row-wise

PythonPandasNumpyApply

Python Problem Overview


I want to apply my custom function (it uses an if-else ladder) to these six columns (ERI_Hispanic, ERI_AmerInd_AKNatv, ERI_Asian, ERI_Black_Afr.Amer, ERI_HI_PacIsl, ERI_White) in each row of my dataframe.

I've tried different methods from other questions but still can't seem to find the right answer for my problem. The critical piece of this is that if the person is counted as Hispanic they can't be counted as anything else. Even if they have a "1" in another ethnicity column they still are counted as Hispanic not two or more races. Similarly, if the sum of all the ERI columns is greater than 1 they are counted as two or more races and can't be counted as a unique ethnicity(except for Hispanic). Hopefully this makes sense. Any help will be greatly appreciated.

Its almost like doing a for loop through each row and if each record meets a criterion they are added to one list and eliminated from the original.

From the dataframe below I need to calculate a new column based on the following spec in SQL:

CRITERIA

IF [ERI_Hispanic] = 1 THEN RETURN “Hispanic”
ELSE IF SUM([ERI_AmerInd_AKNatv] + [ERI_Asian] + [ERI_Black_Afr.Amer] + [ERI_HI_PacIsl] + [ERI_White]) > 1 THEN RETURN “Two or More”
ELSE IF [ERI_AmerInd_AKNatv] = 1 THEN RETURN “A/I AK Native”
ELSE IF [ERI_Asian] = 1 THEN RETURN “Asian”
ELSE IF [ERI_Black_Afr.Amer] = 1 THEN RETURN “Black/AA”
ELSE IF [ERI_HI_PacIsl] = 1 THEN RETURN “Haw/Pac Isl.”
ELSE IF [ERI_White] = 1 THEN RETURN “White”

Comment: If the ERI Flag for Hispanic is True (1), the employee is classified as “Hispanic”

Comment: If more than 1 non-Hispanic ERI Flag is true, return “Two or More”

DATAFRAME

	 lname			fname		rno_cd	eri_afr_amer	eri_asian	eri_hawaiian	eri_hispanic	eri_nat_amer	eri_white	rno_defined
0	 MOST	 		JEFF	 	E	    0	 			0	 		0	 			0	 			0	 			1	 		White
1	 CRUISE		 	TOM		 	E	 	0	 			0	 		0	 			1	 			0	 			0	 		White
2	 DEPP	 		JOHNNY		 		0	 			0	 		0	 			0	 			0	 			1	 		Unknown
3	 DICAP	 		LEO			 		0	 			0	 		0	 			0	 			0	 			1	 		Unknown
4	 BRANDO		 	MARLON	 	E	 	0	 			0	 		0	 			0	 			0	 			0	 		White
5	 HANKS		 	TOM		 	0	 					0	 		0	 			0	 			0	 			1	 		Unknown
6	 DENIRO	 		ROBERT	 	E	 	0	 			1	 		0	 			0	 			0	 			1	 		White
7	 PACINO		 	AL			E	 	0	 			0	 		0	 			0	 			0	 			1	 		White
8	 WILLIAMS	 	ROBIN		E	 	0	 			0	 		1	 			0	 			0	 			0	 		White
9	 EASTWOOD	 	CLINT	 	E	 	0	 			0	 		0	 			0	 			0	 			1	 		White

Python Solutions


Solution 1 - Python

OK, two steps to this - first is to write a function that does the translation you want - I've put an example together based on your pseudo-code:

def label_race (row):
   if row['eri_hispanic'] == 1 :
      return 'Hispanic'
   if row['eri_afr_amer'] + row['eri_asian'] + row['eri_hawaiian'] + row['eri_nat_amer'] + row['eri_white'] > 1 :
      return 'Two Or More'
   if row['eri_nat_amer'] == 1 :
      return 'A/I AK Native'
   if row['eri_asian'] == 1:
      return 'Asian'
   if row['eri_afr_amer']  == 1:
      return 'Black/AA'
   if row['eri_hawaiian'] == 1:
      return 'Haw/Pac Isl.'
   if row['eri_white'] == 1:
      return 'White'
   return 'Other'

You may want to go over this, but it seems to do the trick - notice that the parameter going into the function is considered to be a Series object labelled "row".

Next, use the apply function in pandas to apply the function - e.g.

df.apply (lambda row: label_race(row), axis=1)

Note the axis=1 specifier, that means that the application is done at a row, rather than a column level. The results are here:

0           White
1        Hispanic
2           White
3           White
4           Other
5           White
6     Two Or More
7           White
8    Haw/Pac Isl.
9           White

If you're happy with those results, then run it again, saving the results into a new column in your original dataframe.

df['race_label'] = df.apply (lambda row: label_race(row), axis=1)

The resultant dataframe looks like this (scroll to the right to see the new column):

      lname   fname rno_cd  eri_afr_amer  eri_asian  eri_hawaiian   eri_hispanic  eri_nat_amer  eri_white rno_defined    race_label
0      MOST    JEFF      E             0          0             0              0             0          1       White         White
1    CRUISE     TOM      E             0          0             0              1             0          0       White      Hispanic
2      DEPP  JOHNNY    NaN             0          0             0              0             0          1     Unknown         White
3     DICAP     LEO    NaN             0          0             0              0             0          1     Unknown         White
4    BRANDO  MARLON      E             0          0             0              0             0          0       White         Other
5     HANKS     TOM    NaN             0          0             0              0             0          1     Unknown         White
6    DENIRO  ROBERT      E             0          1             0              0             0          1       White   Two Or More
7    PACINO      AL      E             0          0             0              0             0          1       White         White
8  WILLIAMS   ROBIN      E             0          0             1              0             0          0       White  Haw/Pac Isl.
9  EASTWOOD   CLINT      E             0          0             0              0             0          1       White         White

Solution 2 - Python

Since this is the first Google result for 'pandas new column from others', here's a simple example:

import pandas as pd

# make a simple dataframe
df = pd.DataFrame({'a':[1,2], 'b':[3,4]})
df
#    a  b
# 0  1  3
# 1  2  4

# create an unattached column with an index
df.apply(lambda row: row.a + row.b, axis=1)
# 0    4
# 1    6

# do same but attach it to the dataframe
df['c'] = df.apply(lambda row: row.a + row.b, axis=1)
df
#    a  b  c
# 0  1  3  4
# 1  2  4  6

If you get the SettingWithCopyWarning you can do it this way also:

fn = lambda row: row.a + row.b # define a function for the new column
col = df.apply(fn, axis=1) # get column data with an index
df = df.assign(c=col.values) # assign values to column 'c'

Source: https://stackoverflow.com/a/12555510/243392

And if your column name includes spaces you can use syntax like this:

df = df.assign(**{'some column name': col.values})

And here's the documentation for apply, and assign.

Solution 3 - Python

The answers above are perfectly valid, but a vectorized solution exists, in the form of numpy.select. This allows you to define conditions, then define outputs for those conditions, much more efficiently than using apply:


First, define conditions:

conditions = [    df['eri_hispanic'] == 1,
    df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1),
    df['eri_nat_amer'] == 1,
    df['eri_asian'] == 1,
    df['eri_afr_amer'] == 1,
    df['eri_hawaiian'] == 1,
    df['eri_white'] == 1,
]

Now, define the corresponding outputs:

outputs = [
    'Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White'
]

Finally, using numpy.select:

res = np.select(conditions, outputs, 'Other')
pd.Series(res)

0           White
1        Hispanic
2           White
3           White
4           Other
5           White
6     Two Or More
7           White
8    Haw/Pac Isl.
9           White
dtype: object

Why should numpy.select be used over apply? Here are some performance checks:

df = pd.concat([df]*1000)

In [42]: %timeit df.apply(lambda row: label_race(row), axis=1)
1.07 s ± 4.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [44]: %%timeit
    ...: conditions = [    ...:     df['eri_hispanic'] == 1,
    ...:     df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1),
    ...:     df['eri_nat_amer'] == 1,
    ...:     df['eri_asian'] == 1,
    ...:     df['eri_afr_amer'] == 1,
    ...:     df['eri_hawaiian'] == 1,
    ...:     df['eri_white'] == 1,
    ...: ]
    ...:
    ...: outputs = [    ...:     'Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White'    ...: ]
    ...:
    ...: np.select(conditions, outputs, 'Other')
    ...:
    ...:
3.09 ms ± 17 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Using numpy.select gives us vastly improved performance, and the discrepancy will only increase as the data grows.

Solution 4 - Python

.apply() takes in a function as the first parameter; pass in the label_race function as so:

df['race_label'] = df.apply(label_race, axis=1)

You don't need to make a lambda function to pass in a function.

Solution 5 - Python

try this,

df.loc[df['eri_white']==1,'race_label'] = 'White'
df.loc[df['eri_hawaiian']==1,'race_label'] = 'Haw/Pac Isl.'
df.loc[df['eri_afr_amer']==1,'race_label'] = 'Black/AA'
df.loc[df['eri_asian']==1,'race_label'] = 'Asian'
df.loc[df['eri_nat_amer']==1,'race_label'] = 'A/I AK Native'
df.loc[(df['eri_afr_amer'] + df['eri_asian'] + df['eri_hawaiian'] + df['eri_nat_amer'] + df['eri_white']) > 1,'race_label'] = 'Two Or More'
df.loc[df['eri_hispanic']==1,'race_label'] = 'Hispanic'
df['race_label'].fillna('Other', inplace=True)

O/P:

     lname   fname rno_cd  eri_afr_amer  eri_asian  eri_hawaiian  \
0      MOST    JEFF      E             0          0             0   
1    CRUISE     TOM      E             0          0             0   
2      DEPP  JOHNNY    NaN             0          0             0   
3     DICAP     LEO    NaN             0          0             0   
4    BRANDO  MARLON      E             0          0             0   
5     HANKS     TOM    NaN             0          0             0   
6    DENIRO  ROBERT      E             0          1             0   
7    PACINO      AL      E             0          0             0   
8  WILLIAMS   ROBIN      E             0          0             1   
9  EASTWOOD   CLINT      E             0          0             0   

   eri_hispanic  eri_nat_amer  eri_white rno_defined    race_label  
0             0             0          1       White         White  
1             1             0          0       White      Hispanic  
2             0             0          1     Unknown         White  
3             0             0          1     Unknown         White  
4             0             0          0       White         Other  
5             0             0          1     Unknown         White  
6             0             0          1       White   Two Or More  
7             0             0          1       White         White  
8             0             0          0       White  Haw/Pac Isl.  
9             0             0          1       White         White 

use .loc instead of apply.

it improves vectorization.

.loc works in simple manner, mask rows based on the condition, apply values to the freeze rows.

for more details visit, https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html"> .loc docs

Performance metrics:

Accepted Answer:

def label_race (row):
   if row['eri_hispanic'] == 1 :
      return 'Hispanic'
   if row['eri_afr_amer'] + row['eri_asian'] + row['eri_hawaiian'] + row['eri_nat_amer'] + row['eri_white'] > 1 :
      return 'Two Or More'
   if row['eri_nat_amer'] == 1 :
      return 'A/I AK Native'
   if row['eri_asian'] == 1:
      return 'Asian'
   if row['eri_afr_amer']  == 1:
      return 'Black/AA'
   if row['eri_hawaiian'] == 1:
      return 'Haw/Pac Isl.'
   if row['eri_white'] == 1:
      return 'White'
   return 'Other'

df=pd.read_csv('dataser.csv')
df = pd.concat([df]*1000)

%timeit df.apply(lambda row: label_race(row), axis=1)

>1.15 s ± 46.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

My Proposed Answer:

def label_race(df):
    df.loc[df['eri_white']==1,'race_label'] = 'White'
    df.loc[df['eri_hawaiian']==1,'race_label'] = 'Haw/Pac Isl.'
    df.loc[df['eri_afr_amer']==1,'race_label'] = 'Black/AA'
    df.loc[df['eri_asian']==1,'race_label'] = 'Asian'
    df.loc[df['eri_nat_amer']==1,'race_label'] = 'A/I AK Native'
    df.loc[(df['eri_afr_amer'] + df['eri_asian'] + df['eri_hawaiian'] + df['eri_nat_amer'] + df['eri_white']) > 1,'race_label'] = 'Two Or More'
    df.loc[df['eri_hispanic']==1,'race_label'] = 'Hispanic'
    df['race_label'].fillna('Other', inplace=True)
df=pd.read_csv('s22.csv')
df = pd.concat([df]*1000)

%timeit label_race(df)

> 24.7 ms ± 1.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Solution 6 - Python

As @user3483203 pointed out, numpy.select is the best approach

Store your conditional statements and the corresponding actions in two lists

conds = [(df['eri_hispanic'] == 1),(df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1)),(df['eri_nat_amer'] == 1),(df['eri_asian'] == 1),(df['eri_afr_amer'] == 1),(df['eri_hawaiian'] == 1),(df['eri_white'] == 1,])

actions = ['Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White']

You can now use np.select using these lists as its arguments

df['label_race'] = np.select(conds,actions,default='Other')

Reference: https://numpy.org/doc/stable/reference/generated/numpy.select.html

Solution 7 - Python

Yet another (easily generalizable) approach, whose corner-stone is pandas.DataFrame.idxmax. First, the easily generalizable preamble.

# Indeed, all your conditions boils down to the following
_gt_1_key = 'two_or_more'
_lt_1_key = 'other'

# The "dictionary-based" if-else statements
labels = {
    _gt_1_key     : 'Two Or More',
    'eri_hispanic': 'Hispanic',
    'eri_nat_amer': 'A/I AK Native',
    'eri_asian'   : 'Asian',
    'eri_afr_amer': 'Black/AA',
    'eri_hawaiian': 'Haw/Pac Isl.',
    'eri_white'   : 'White',  
    _lt_1_key     : 'Other',
}

# The output-driving 1-0 matrix
mat = df.filter(regex='^eri_').copy()  # `~.copy` to avoid `SettingWithCopyWarning`

... and, finally, in a vectorized fashion:

mat[_gt_1_key] = gt1 = mat.sum(axis=1)
mat[_lt_1_key] = gt1.eq(0).astype(int)
race_label     = mat.idxmax(axis=1).map(labels)

where

>>> race_label
0           White
1        Hispanic
2           White
3           White
4           Other
5           White
6     Two Or More
7           White
8    Haw/Pac Isl.
9           White
dtype: object

that is a pandas.Series instance you can easily host within df, i.e. doing df['race_label'] = race_label.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionDaveView Question on Stackoverflow
Solution 1 - PythonThomas KimberView Answer on Stackoverflow
Solution 2 - PythonBrian BurnsView Answer on Stackoverflow
Solution 3 - Pythonuser3483203View Answer on Stackoverflow
Solution 4 - PythonGabrielle Simard-MooreView Answer on Stackoverflow
Solution 5 - PythonMohamed Thasin ahView Answer on Stackoverflow
Solution 6 - PythonSai PardhuView Answer on Stackoverflow
Solution 7 - PythonkeepAliveView Answer on Stackoverflow