Is there a way in Pandas to use previous row value in dataframe.apply when previous value is also calculated in the apply?

PythonPandasDataframeFor LoopIteration

Python Problem Overview


I have the following dataframe:

Index_Date    A   B     C    D
================================
2015-01-31    10   10   Nan   10
2015-02-01     2    3   Nan   22 
2015-02-02    10   60   Nan  280
2015-02-03    10  100   Nan  250

Require:

Index_Date    A   B    C     D
================================
2015-01-31    10   10    10   10
2015-02-01     2    3    23   22
2015-02-02    10   60   290  280
2015-02-03    10  100  3000  250

Column C is derived for 2015-01-31 by taking value of D.

Then I need to use the value of C for 2015-01-31 and multiply by the value of A on 2015-02-01 and add B.

I have attempted an apply and a shift using an if else by this gives a key error.

Python Solutions


Solution 1 - Python

First, create the derived value:

df.loc[0, 'C'] = df.loc[0, 'D']

Then iterate through the remaining rows and fill the calculated values:

for i in range(1, len(df)):
    df.loc[i, 'C'] = df.loc[i-1, 'C'] * df.loc[i, 'A'] + df.loc[i, 'B']


  Index_Date   A   B    C    D
0 2015-01-31  10  10   10   10
1 2015-02-01   2   3   23   22
2 2015-02-02  10  60  290  280

Solution 2 - Python

Given a column of numbers:

lst = []
cols = ['A']
for a in range(100, 105):
    lst.append([a])
df = pd.DataFrame(lst, columns=cols, index=range(5))
df

	A
0	100
1	101
2	102
3	103
4	104

You can reference the previous row with shift:

df['Change'] = df.A - df.A.shift(1)
df

	A	Change
0	100	NaN
1	101	1.0
2	102	1.0
3	103	1.0
4	104	1.0

Solution 3 - Python

numba

For recursive calculations which are not vectorisable, numba, which uses JIT-compilation and works with lower level objects, often yields large performance improvements. You need only define a regular for loop and use the decorator @njit or (for older versions) @jit(nopython=True):

For a reasonable size dataframe, this gives a ~30x performance improvement versus a regular for loop:

from numba import jit

@jit(nopython=True)
def calculator_nb(a, b, d):
    res = np.empty(d.shape)
    res[0] = d[0]
    for i in range(1, res.shape[0]):
        res[i] = res[i-1] * a[i] + b[i]
    return res

df['C'] = calculator_nb(*df[list('ABD')].values.T)

n = 10**5
df = pd.concat([df]*n, ignore_index=True)

# benchmarking on Python 3.6.0, Pandas 0.19.2, NumPy 1.11.3, Numba 0.30.1
# calculator() is same as calculator_nb() but without @jit decorator
%timeit calculator_nb(*df[list('ABD')].values.T)  # 14.1 ms per loop
%timeit calculator(*df[list('ABD')].values.T)     # 444 ms per loop

Solution 4 - Python

Applying the recursive function on numpy arrays will be faster than the current answer.

df = pd.DataFrame(np.repeat(np.arange(2, 6),3).reshape(4,3), columns=['A', 'B', 'D'])
new = [df.D.values[0]]
for i in range(1, len(df.index)):
    new.append(new[i-1]*df.A.values[i]+df.B.values[i])
df['C'] = new

Output

      A  B  D    C
   0  1  1  1    1
   1  2  2  2    4
   2  3  3  3   15
   3  4  4  4   64
   4  5  5  5  325

Solution 5 - Python

Although it has been a while since this question was asked, I will post my answer hoping it helps somebody.

Disclaimer: I know this solution is not standard, but I think it works well.

import pandas as pd
import numpy as np

data = np.array([[10, 2, 10, 10],
                 [10, 3, 60, 100],
                 [np.nan] * 4,
                 [10, 22, 280, 250]]).T
idx = pd.date_range('20150131', end='20150203')
df = pd.DataFrame(data=data, columns=list('ABCD'), index=idx)
df
               A    B     C    D
 =================================
 2015-01-31    10   10    NaN  10
 2015-02-01    2    3     NaN  22 
 2015-02-02    10   60    NaN  280
 2015-02-03    10   100   NaN  250

def calculate(mul, add):
    global value
    value = value * mul + add
    return value

value = df.loc['2015-01-31', 'D']
df.loc['2015-01-31', 'C'] = value
df.loc['2015-02-01':, 'C'] = df.loc['2015-02-01':].apply(lambda row: calculate(*row[['A', 'B']]), axis=1)
df
               A    B     C     D
 =================================
 2015-01-31    10   10    10    10
 2015-02-01    2    3     23    22 
 2015-02-02    10   60    290   280
 2015-02-03    10   100   3000  250

So basically we use a apply from pandas and the help of a global variable that keeps track of the previous calculated value.


Time comparison with a for loop:

data = np.random.random(size=(1000, 4))
idx = pd.date_range('20150131', end='20171026')
df = pd.DataFrame(data=data, columns=list('ABCD'), index=idx)
df.C = np.nan

df.loc['2015-01-31', 'C'] = df.loc['2015-01-31', 'D']

%%timeit
for i in df.loc['2015-02-01':].index.date:
    df.loc[i, 'C'] = df.loc[(i - pd.DateOffset(days=1)).date(), 'C'] * df.loc[i, 'A'] + df.loc[i, 'B']

3.2 s ± 114 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

data = np.random.random(size=(1000, 4))
idx = pd.date_range('20150131', end='20171026')
df = pd.DataFrame(data=data, columns=list('ABCD'), index=idx)
df.C = np.nan

def calculate(mul, add):
    global value
    value = value * mul + add
    return value

value = df.loc['2015-01-31', 'D']
df.loc['2015-01-31', 'C'] = value

%%timeit
df.loc['2015-02-01':, 'C'] = df.loc['2015-02-01':].apply(lambda row: calculate(*row[['A', 'B']]), axis=1)

1.82 s ± 64.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So 0.57 times faster on average.

Solution 6 - Python

It's an old question but the solution below (without a for loop) might be helpful:

def new_fun(df):
    prev_value = df.iloc[0]["C"]
    def func2(row):
        # non local variable ==> will use pre_value from the new_fun function
        nonlocal prev_value
        new_value =  prev_value * row['A'] + row['B']
        prev_value = row['C']
        return new_value
    # This line might throw a SettingWithCopyWarning warning
    df.iloc[1:]["C"] = df.iloc[1:].apply(func2, axis=1)
    return df

df = new_fun(df)

Solution 7 - Python

In general, the key to avoiding an explicit loop would be to join (merge) 2 instances of the dataframe on rowindex-1==rowindex.

Then you would have a big dataframe containing rows of r and r-1, from where you could do a df.apply() function.

However the overhead of creating the large dataset may offset the benefits of parallel processing...

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
Questionctrl-alt-deleteView Question on Stackoverflow
Solution 1 - PythonStefanView Answer on Stackoverflow
Solution 2 - PythonkztdView Answer on Stackoverflow
Solution 3 - PythonjppView Answer on Stackoverflow
Solution 4 - Pythonuser4322543View Answer on Stackoverflow
Solution 5 - PythoniiprView Answer on Stackoverflow
Solution 6 - PythonWazaaView Answer on Stackoverflow
Solution 7 - PythonMartin AlleyView Answer on Stackoverflow