Pandas: filling missing values by mean in each group

Python Pandas Pandas Groupby Imputation Fillna

Python Problem Overview

This should be straightforward, but the closest thing I've found is this post: https://stackoverflow.com/questions/18265930/pandas-filling-missing-values-within-a-group, and I still can't solve my problem....

Suppose I have the following dataframe

df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3], 'name': ['A','A', 'B','B','B','B', 'C','C','C']})

  name  value
0    A      1
1    A    NaN
2    B    NaN
3    B      2
4    B      3
5    B      1
6    C      3
7    C    NaN
8    C      3

and I'd like to fill in "NaN" with mean value in each "name" group, i.e.

      name  value
0    A      1
1    A      1
2    B      2
3    B      2
4    B      3
5    B      1
6    C      3
7    C      3
8    C      3

I'm not sure where to go after:

grouped = df.groupby('name').mean()

Thanks a bunch.

Python Solutions

Solution 1 - Python

One way would be to use transform:

>>> df
  name  value
0    A      1
1    A    NaN
2    B    NaN
3    B      2
4    B      3
5    B      1
6    C      3
7    C    NaN
8    C      3
>>> df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))
>>> df
  name  value
0    A      1
1    A      1
2    B      2
3    B      2
4    B      3
5    B      1
6    C      3
7    C      3
8    C      3

Solution 2 - Python

`fillna` + `groupby` + `transform` + `mean`

This seems intuitive:

df['value'] = df['value'].fillna(df.groupby('name')['value'].transform('mean'))

The groupby + transform syntax maps the groupwise mean to the index of the original dataframe. This is roughly equivalent to @DSM's solution, but avoids the need to define an anonymous lambda function.

Solution 3 - Python

@DSM has IMO the right answer, but I'd like to share my generalization and optimization of the question: Multiple columns to group-by and having multiple value columns:

df = pd.DataFrame(
    {
        'category': ['X', 'X', 'X', 'X', 'X', 'X', 'Y', 'Y', 'Y'],
        'name': ['A','A', 'B','B','B','B', 'C','C','C'],
        'other_value': [10, np.nan, np.nan, 20, 30, 10, 30, np.nan, 30],
        'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3],
    }
)

... gives ...

  category name  other_value value
0        X    A         10.0   1.0
1        X    A          NaN   NaN
2        X    B          NaN   NaN
3        X    B         20.0   2.0
4        X    B         30.0   3.0
5        X    B         10.0   1.0
6        Y    C         30.0   3.0
7        Y    C          NaN   NaN
8        Y    C         30.0   3.0

In this generalized case we would like to group by category and name, and impute only on value.

This can be solved as follows:

df['value'] = df.groupby(['category', 'name'])['value']\
    .transform(lambda x: x.fillna(x.mean()))

Notice the column list in the group-by clause, and that we select the value column right after the group-by. This makes the transformation only be run on that particular column. You could add it to the end, but then you will run it for all columns only to throw out all but one measure column at the end. A standard SQL query planner might have been able to optimize this, but pandas (0.19.2) doesn't seem to do this.

Performance test by increasing the dataset by doing ...

big_df = None
for _ in range(10000):
    if big_df is None:
        big_df = df.copy()
    else:
        big_df = pd.concat([big_df, df])
df = big_df

... confirms that this increases the speed proportional to how many columns you don't have to impute:

import pandas as pd
from datetime import datetime

def generate_data():
    ...

t = datetime.now()
df = generate_data()
df['value'] = df.groupby(['category', 'name'])['value']\
    .transform(lambda x: x.fillna(x.mean()))
print(datetime.now()-t)

# 0:00:00.016012

t = datetime.now()
df = generate_data()
df["value"] = df.groupby(['category', 'name'])\
    .transform(lambda x: x.fillna(x.mean()))['value']
print(datetime.now()-t)

# 0:00:00.030022

On a final note you can generalize even further if you want to impute more than one column, but not all:

df[['value', 'other_value']] = df.groupby(['category', 'name'])['value', 'other_value']\
    .transform(lambda x: x.fillna(x.mean()))

Solution 4 - Python

Shortcut:

> Groupby + Apply + Lambda + Fillna + Mean

>>> df['value1']=df.groupby('name')['value'].apply(lambda x:x.fillna(x.mean()))
>>> df.isnull().sum().sum()
    0

This solution still works if you want to group by multiple columns to replace missing values.

>>> df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, np.nan,np.nan, 4, 3], 
    'name': ['A','A', 'B','B','B','B', 'C','C','C'],'class':list('ppqqrrsss')})  

    
>>> df['value']=df.groupby(['name','class'])['value'].apply(lambda x:x.fillna(x.mean()))
       
>>> df
        value name   class
	0    1.0    A     p
	1    1.0    A     p
	2    2.0    B     q
	3    2.0    B     q
	4    3.0    B     r
	5    3.0    B     r
	6    3.5    C     s
	7    4.0    C     s
	8    3.0    C     s

Solution 5 - Python

I'd do it this way

df.loc[df.value.isnull(), 'value'] = df.groupby('group').value.transform('mean')

Solution 6 - Python

The featured high ranked answer only works for a pandas Dataframe with only two columns. If you have a more columns case use instead:

df['Crude_Birth_rate'] = df.groupby("continent").Crude_Birth_rate.transform(
    lambda x: x.fillna(x.mean()))

Solution 7 - Python

def groupMeanValue(group):
    group['value'] = group['value'].fillna(group['value'].mean())
    return group

dft = df.groupby("name").transform(groupMeanValue)

Solution 8 - Python

To summarize all above concerning the efficiency of the possible solution I have a dataset with 97 906 rows and 48 columns. I want to fill in 4 columns with the median of each group. The column I want to group has 26 200 groups.

The first solution

start = time.time()
x = df_merged[continuous_variables].fillna(df_merged.groupby('domain_userid')[continuous_variables].transform('median'))
print(time.time() - start)
0.10429811477661133 seconds

The second solution

start = time.time()
for col in continuous_variables:
    df_merged.loc[df_merged[col].isnull(), col] = df_merged.groupby('domain_userid')[col].transform('median')
print(time.time() - start)
0.5098445415496826 seconds

The next solution I only performed on a subset since it was running too long.

start = time.time()
for col in continuous_variables:
    x = df_merged.head(10000).groupby('domain_userid')[col].transform(lambda x: x.fillna(x.median()))
print(time.time() - start)
11.685635566711426 seconds

The following solution follows the same logic as above.

start = time.time()
x = df_merged.head(10000).groupby('domain_userid')[continuous_variables].transform(lambda x: x.fillna(x.median()))
print(time.time() - start)
42.630549907684326 seconds

So it's quite important to choose the right method. Bear in mind that I noticed once a column was not a numeric the times were going up exponentially (makes sense as I was computing the median).

Solution 9 - Python

df.fillna(df.groupby(['name'], as_index=False).mean(), inplace=True)

Solution 10 - Python

You can also use "dataframe or table_name".apply(lambda x: x.fillna(x.mean())).

Content Type	Original Author	Original Content on Stackoverflow
Question	BlueFeet	View Question on Stackoverflow
Solution 1 - Python	DSM	View Answer on Stackoverflow
Solution 2 - Python	jpp	View Answer on Stackoverflow
Solution 3 - Python	André C. Andersen	View Answer on Stackoverflow
Solution 4 - Python	Ashish Anand	View Answer on Stackoverflow
Solution 5 - Python	piRSquared	View Answer on Stackoverflow
Solution 6 - Python	Philipp Schwarz	View Answer on Stackoverflow
Solution 7 - Python	Prajit Patil	View Answer on Stackoverflow
Solution 8 - Python	Sam	View Answer on Stackoverflow
Solution 9 - Python	Vino Vincent	View Answer on Stackoverflow
Solution 10 - Python	Hardik Pachgade	View Answer on Stackoverflow

Pandas: filling missing values by mean in each group

Python Problem Overview

Python Solutions

Solution 1 - Python

Solution 2 - Python

`fillna` + `groupby` + `transform` + `mean`

Solution 3 - Python

Solution 4 - Python

Solution 5 - Python

Solution 6 - Python

Solution 7 - Python

Solution 8 - Python

Solution 9 - Python

Solution 10 - Python

git: Your branch and 'origin/master' have diverged - how to throw away local commits?

Does a dot have to be escaped in a character class (square brackets) of a regular expression?

Attributions

Python Problem Overview

Python Solutions

Solution 1 - Python

Solution 2 - Python

fillna + groupby + transform + mean

Solution 3 - Python

Solution 4 - Python

Solution 5 - Python

Solution 6 - Python

Solution 7 - Python

Solution 8 - Python

Solution 9 - Python

Solution 10 - Python

git: Your branch and 'origin/master' have diverged - how to throw away local commits?

Does a dot have to be escaped in a character class (square brackets) of a regular expression?

Attributions

`fillna` + `groupby` + `transform` + `mean`