summing two columns in a pandas dataframe

PythonPandas

Python Problem Overview


when I use this syntax it creates a series rather than adding a column to my new dataframe sum.

My code:

sum = data['variance'] = data.budget + data.actual

My dataframe data currently has everything except the budget - actual column. How do I create a variance column?

	cluster  date                  budget  actual budget - actual
0 	a 	     2014-01-01  00:00:00  11000   10000       1000
1 	a 	     2014-02-01  00:00:00  1200    1000
2 	a 	     2014-03-01  00:00:00  200 	   100
3 	b 	     2014-04-01  00:00:00  200 	   300
4 	b 	     2014-05-01  00:00:00  400 	   450
5 	c 	     2014-06-01  00:00:00  700 	   1000
6 	c 	     2014-07-01  00:00:00  1200    1000
7 	c 	     2014-08-01  00:00:00  200 	   100
8 	c 	     2014-09-01  00:00:00  200 	   300

Python Solutions


Solution 1 - Python

I think you've misunderstood some python syntax, the following does two assignments:

In [11]: a = b = 1

In [12]: a
Out[12]: 1

In [13]: b
Out[13]: 1

So in your code it was as if you were doing:

sum = df['budget'] + df['actual']  # a Series
# and
df['variance'] = df['budget'] + df['actual']  # assigned to a column

The latter creates a new column for df:

In [21]: df
Out[21]:
  cluster                 date  budget  actual
0       a  2014-01-01 00:00:00   11000   10000
1       a  2014-02-01 00:00:00    1200    1000
2       a  2014-03-01 00:00:00     200     100
3       b  2014-04-01 00:00:00     200     300
4       b  2014-05-01 00:00:00     400     450
5       c  2014-06-01 00:00:00     700    1000
6       c  2014-07-01 00:00:00    1200    1000
7       c  2014-08-01 00:00:00     200     100
8       c  2014-09-01 00:00:00     200     300
    
In [22]: df['variance'] = df['budget'] + df['actual']

In [23]: df
Out[23]:
  cluster                 date  budget  actual  variance
0       a  2014-01-01 00:00:00   11000   10000     21000
1       a  2014-02-01 00:00:00    1200    1000      2200
2       a  2014-03-01 00:00:00     200     100       300
3       b  2014-04-01 00:00:00     200     300       500
4       b  2014-05-01 00:00:00     400     450       850
5       c  2014-06-01 00:00:00     700    1000      1700
6       c  2014-07-01 00:00:00    1200    1000      2200
7       c  2014-08-01 00:00:00     200     100       300
8       c  2014-09-01 00:00:00     200     300       500

As an aside, you shouldn't use sum as a variable name as the overrides the built-in sum function.

Solution 2 - Python

df['variance'] = df.loc[:,['budget','actual']].sum(axis=1)

Solution 3 - Python

Same thing can be done using lambda function. Here I am reading the data from a xlsx file.

import pandas as pd
df = pd.read_excel("data.xlsx", sheet_name = 4)
print df

Output:

  cluster Unnamed: 1      date  budget  actual
0       a 2014-01-01  00:00:00   11000   10000
1       a 2014-02-01  00:00:00    1200    1000
2       a 2014-03-01  00:00:00     200     100
3       b 2014-04-01  00:00:00     200     300
4       b 2014-05-01  00:00:00     400     450
5       c 2014-06-01  00:00:00     700    1000
6       c 2014-07-01  00:00:00    1200    1000
7       c 2014-08-01  00:00:00     200     100
8       c 2014-09-01  00:00:00     200     300

Sum two columns into 3rd new one.

df['variance'] = df.apply(lambda x: x['budget'] + x['actual'], axis=1)
print df

Output:

  cluster Unnamed: 1      date  budget  actual  variance
0       a 2014-01-01  00:00:00   11000   10000     21000
1       a 2014-02-01  00:00:00    1200    1000      2200
2       a 2014-03-01  00:00:00     200     100       300
3       b 2014-04-01  00:00:00     200     300       500
4       b 2014-05-01  00:00:00     400     450       850
5       c 2014-06-01  00:00:00     700    1000      1700
6       c 2014-07-01  00:00:00    1200    1000      2200
7       c 2014-08-01  00:00:00     200     100       300
8       c 2014-09-01  00:00:00     200     300       500

Solution 4 - Python

You could also use the .add() function:

 df.loc[:,'variance'] = df.loc[:,'budget'].add(df.loc[:,'actual'])

Solution 5 - Python

This is the most elegant solution which follows DRY and work absolutely great.

dataframe_name['col1', 'col2', 'col3'].sum(axis = 1, skipna = True)

Thank you.

Solution 6 - Python

If "budget" has any NaN values but you don't want it to sum to NaN then try:

def fun (b, a):
    if math.isnan(b):
        return a
    else:
        return b + a

f = np.vectorize(fun, otypes=[float])

df['variance'] = f(df['budget'], df_Lp['actual'])

Solution 7 - Python

eval lets you sum and create columns right away:

In [12]: data.eval('variance = budget + actual', inplace=True)

In [13]: data
Out[13]: 
        cluster      date  budget  actual  variance
0 a  2014-01-01  00:00:00   11000   10000     21000
1 a  2014-02-01  00:00:00    1200    1000      2200
2 a  2014-03-01  00:00:00     200     100       300
3 b  2014-04-01  00:00:00     200     300       500
4 b  2014-05-01  00:00:00     400     450       850
5 c  2014-06-01  00:00:00     700    1000      1700
6 c  2014-07-01  00:00:00    1200    1000      2200
7 c  2014-08-01  00:00:00     200     100       300
8 c  2014-09-01  00:00:00     200     300       500

Since inplace=True you don't need to assign it back to data.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionyoshiserryView Question on Stackoverflow
Solution 1 - PythonAndy HaydenView Answer on Stackoverflow
Solution 2 - PythonpylistView Answer on Stackoverflow
Solution 3 - PythonRishi BansalView Answer on Stackoverflow
Solution 4 - PythonArchieView Answer on Stackoverflow
Solution 5 - PythonSahaj Raj MallaView Answer on Stackoverflow
Solution 6 - PythonR. CoxView Answer on Stackoverflow
Solution 7 - PythonrachwaView Answer on Stackoverflow