summing two columns in a pandas dataframe

Python Problem Overview

when I use this syntax it creates a series rather than adding a column to my new dataframe sum.

My code:

sum = data['variance'] = data.budget + data.actual

My dataframe data currently has everything except the budget - actual column. How do I create a variance column?

	cluster  date                  budget  actual budget - actual
0 	a 	     2014-01-01  00:00:00  11000   10000       1000
1 	a 	     2014-02-01  00:00:00  1200    1000
2 	a 	     2014-03-01  00:00:00  200 	   100
3 	b 	     2014-04-01  00:00:00  200 	   300
4 	b 	     2014-05-01  00:00:00  400 	   450
5 	c 	     2014-06-01  00:00:00  700 	   1000
6 	c 	     2014-07-01  00:00:00  1200    1000
7 	c 	     2014-08-01  00:00:00  200 	   100
8 	c 	     2014-09-01  00:00:00  200 	   300

Python Solutions

Solution 1 - Python

I think you've misunderstood some python syntax, the following does two assignments:

In [11]: a = b = 1

In [12]: a
Out[12]: 1

In [13]: b
Out[13]: 1

So in your code it was as if you were doing:

sum = df['budget'] + df['actual']  # a Series
# and
df['variance'] = df['budget'] + df['actual']  # assigned to a column

The latter creates a new column for df:

In [21]: df
Out[21]:
  cluster                 date  budget  actual
0       a  2014-01-01 00:00:00   11000   10000
1       a  2014-02-01 00:00:00    1200    1000
2       a  2014-03-01 00:00:00     200     100
3       b  2014-04-01 00:00:00     200     300
4       b  2014-05-01 00:00:00     400     450
5       c  2014-06-01 00:00:00     700    1000
6       c  2014-07-01 00:00:00    1200    1000
7       c  2014-08-01 00:00:00     200     100
8       c  2014-09-01 00:00:00     200     300
    
In [22]: df['variance'] = df['budget'] + df['actual']

In [23]: df
Out[23]:
  cluster                 date  budget  actual  variance
0       a  2014-01-01 00:00:00   11000   10000     21000
1       a  2014-02-01 00:00:00    1200    1000      2200
2       a  2014-03-01 00:00:00     200     100       300
3       b  2014-04-01 00:00:00     200     300       500
4       b  2014-05-01 00:00:00     400     450       850
5       c  2014-06-01 00:00:00     700    1000      1700
6       c  2014-07-01 00:00:00    1200    1000      2200
7       c  2014-08-01 00:00:00     200     100       300
8       c  2014-09-01 00:00:00     200     300       500

As an aside, you shouldn't use sum as a variable name as the overrides the built-in sum function.

Solution 2 - Python

df['variance'] = df.loc[:,['budget','actual']].sum(axis=1)

Solution 3 - Python

Same thing can be done using lambda function. Here I am reading the data from a xlsx file.

import pandas as pd
df = pd.read_excel("data.xlsx", sheet_name = 4)
print df

Output:

  cluster Unnamed: 1      date  budget  actual
0       a 2014-01-01  00:00:00   11000   10000
1       a 2014-02-01  00:00:00    1200    1000
2       a 2014-03-01  00:00:00     200     100
3       b 2014-04-01  00:00:00     200     300
4       b 2014-05-01  00:00:00     400     450
5       c 2014-06-01  00:00:00     700    1000
6       c 2014-07-01  00:00:00    1200    1000
7       c 2014-08-01  00:00:00     200     100
8       c 2014-09-01  00:00:00     200     300

Sum two columns into 3rd new one.

df['variance'] = df.apply(lambda x: x['budget'] + x['actual'], axis=1)
print df

Output:

  cluster Unnamed: 1      date  budget  actual  variance
0       a 2014-01-01  00:00:00   11000   10000     21000
1       a 2014-02-01  00:00:00    1200    1000      2200
2       a 2014-03-01  00:00:00     200     100       300
3       b 2014-04-01  00:00:00     200     300       500
4       b 2014-05-01  00:00:00     400     450       850
5       c 2014-06-01  00:00:00     700    1000      1700
6       c 2014-07-01  00:00:00    1200    1000      2200
7       c 2014-08-01  00:00:00     200     100       300
8       c 2014-09-01  00:00:00     200     300       500

Solution 4 - Python

You could also use the .add() function:

 df.loc[:,'variance'] = df.loc[:,'budget'].add(df.loc[:,'actual'])

Solution 5 - Python

This is the most elegant solution which follows DRY and work absolutely great.

dataframe_name['col1', 'col2', 'col3'].sum(axis = 1, skipna = True)

Thank you.

Solution 6 - Python

If "budget" has any NaN values but you don't want it to sum to NaN then try:

def fun (b, a):
    if math.isnan(b):
        return a
    else:
        return b + a

f = np.vectorize(fun, otypes=[float])

df['variance'] = f(df['budget'], df_Lp['actual'])

Solution 7 - Python

eval lets you sum and create columns right away:

In [12]: data.eval('variance = budget + actual', inplace=True)

In [13]: data
Out[13]: 
        cluster      date  budget  actual  variance
0 a  2014-01-01  00:00:00   11000   10000     21000
1 a  2014-02-01  00:00:00    1200    1000      2200
2 a  2014-03-01  00:00:00     200     100       300
3 b  2014-04-01  00:00:00     200     300       500
4 b  2014-05-01  00:00:00     400     450       850
5 c  2014-06-01  00:00:00     700    1000      1700
6 c  2014-07-01  00:00:00    1200    1000      2200
7 c  2014-08-01  00:00:00     200     100       300
8 c  2014-09-01  00:00:00     200     300       500

Since inplace=True you don't need to assign it back to data.

Content Type	Original Author	Original Content on Stackoverflow
Question	yoshiserry	View Question on Stackoverflow
Solution 1 - Python	Andy Hayden	View Answer on Stackoverflow
Solution 2 - Python	pylist	View Answer on Stackoverflow
Solution 3 - Python	Rishi Bansal	View Answer on Stackoverflow
Solution 4 - Python	Archie	View Answer on Stackoverflow
Solution 5 - Python	Sahaj Raj Malla	View Answer on Stackoverflow
Solution 6 - Python	R. Cox	View Answer on Stackoverflow
Solution 7 - Python	rachwa	View Answer on Stackoverflow