Pandas - Compute z-score for all columns

PythonPandasIndexingStatistics

Python Problem Overview


I have a dataframe containing a single column of IDs and all other columns are numerical values for which I want to compute z-scores. Here's a subsection of it:

ID      Age    BMI    Risk Factor
PT 6    48     19.3    4
PT 8    43     20.9    NaN
PT 2    39     18.1    3
PT 9    41     19.5    NaN

Some of my columns contain NaN values which I do not want to include into the z-score calculations so I intend to use a solution offered to this question: https://stackoverflow.com/questions/23451244/how-to-zscore-normalize-pandas-column-with-nans

df['zscore'] = (df.a - df.a.mean())/df.a.std(ddof=0)

I'm interested in applying this solution to all of my columns except the ID column to produce a new dataframe which I can save as an Excel file using

df2.to_excel("Z-Scores.xlsx")

So basically; how can I compute z-scores for each column (ignoring NaN values) and push everything into a new dataframe?

SIDENOTE: there is a concept in pandas called "indexing" which intimidates me because I do not understand it well. If indexing is a crucial part of solving this problem, please dumb down your explanation of indexing.

Python Solutions


Solution 1 - Python

Build a list from the columns and remove the column you don't want to calculate the Z score for:

In [66]:
cols = list(df.columns)
cols.remove('ID')
df[cols]
 
Out[66]:
   Age  BMI  Risk  Factor
0    6   48  19.3       4
1    8   43  20.9     NaN
2    2   39  18.1       3
3    9   41  19.5     NaN
In [68]:
# now iterate over the remaining columns and create a new zscore column
for col in cols:
    col_zscore = col + '_zscore'
    df[col_zscore] = (df[col] - df[col].mean())/df[col].std(ddof=0)
df
Out[68]:
   ID  Age  BMI  Risk  Factor  Age_zscore  BMI_zscore  Risk_zscore  \
0  PT    6   48  19.3       4   -0.093250    1.569614    -0.150946   
1  PT    8   43  20.9     NaN    0.652753    0.074744     1.459148   
2  PT    2   39  18.1       3   -1.585258   -1.121153    -1.358517   
3  PT    9   41  19.5     NaN    1.025755   -0.523205     0.050315   

   Factor_zscore  
0              1  
1            NaN  
2             -1  
3            NaN  

Solution 2 - Python

Using Scipy's zscore function:

df = pd.DataFrame(np.random.randint(100, 200, size=(5, 3)), columns=['A', 'B', 'C'])
df

|    |   A |   B |   C |
|---:|----:|----:|----:|
|  0 | 163 | 163 | 159 |
|  1 | 120 | 153 | 181 |
|  2 | 130 | 199 | 108 |
|  3 | 108 | 188 | 157 |
|  4 | 109 | 171 | 119 |

from scipy.stats import zscore
df.apply(zscore)

|    |         A |         B |         C |
|---:|----------:|----------:|----------:|
|  0 |  1.83447  | -0.708023 |  0.523362 |
|  1 | -0.297482 | -1.30804  |  1.3342   |
|  2 |  0.198321 |  1.45205  | -1.35632  |
|  3 | -0.892446 |  0.792025 |  0.449649 |
|  4 | -0.842866 | -0.228007 | -0.950897 |

If not all the columns of your data frame are numeric, then you can apply the Z-score function only to the numeric columns using the select_dtypes function:

# Note that `select_dtypes` returns a data frame. We are selecting only the columns
numeric_cols = df.select_dtypes(include=[np.number]).columns
df[numeric_cols].apply(zscore)

|    |         A |         B |         C |
|---:|----------:|----------:|----------:|
|  0 |  1.83447  | -0.708023 |  0.523362 |
|  1 | -0.297482 | -1.30804  |  1.3342   |
|  2 |  0.198321 |  1.45205  | -1.35632  |
|  3 | -0.892446 |  0.792025 |  0.449649 |
|  4 | -0.842866 | -0.228007 | -0.950897 |

Solution 3 - Python

If you want to calculate the zscore for all of the columns, you can just use the following:

df_zscore = (df - df.mean())/df.std()

Solution 4 - Python

Here's other way of getting Zscore using custom function:

In [6]: import pandas as pd; import numpy as np

In [7]: np.random.seed(0) # Fixes the random seed

In [8]: df = pd.DataFrame(np.random.randn(5,3), columns=["randomA", "randomB","randomC"])

In [9]: df # watch output of dataframe
Out[9]:
    randomA   randomB   randomC
0  1.764052  0.400157  0.978738
1  2.240893  1.867558 -0.977278
2  0.950088 -0.151357 -0.103219
3  0.410599  0.144044  1.454274
4  0.761038  0.121675  0.443863

## Create custom function to compute Zscore 
In [10]: def z_score(df):
   ....:         df.columns = [x + "_zscore" for x in df.columns.tolist()]
   ....:         return ((df - df.mean())/df.std(ddof=0))
   ....:

## make sure you filter or select columns of interest before passing dataframe to function
In [11]: z_score(df) # compute Zscore
Out[11]:
   randomA_zscore  randomB_zscore  randomC_zscore
0        0.798350       -0.106335        0.731041
1        1.505002        1.939828       -1.577295
2       -0.407899       -0.875374       -0.545799
3       -1.207392       -0.463464        1.292230
4       -0.688061       -0.494655        0.099824

Result reproduced using scipy.stats zscore

In [12]: from scipy.stats import zscore

In [13]: df.apply(zscore) # (Credit: Manuel)
Out[13]:
    randomA   randomB   randomC
0  0.798350 -0.106335  0.731041
1  1.505002  1.939828 -1.577295
2 -0.407899 -0.875374 -0.545799
3 -1.207392 -0.463464  1.292230
4 -0.688061 -0.494655  0.099824

Solution 5 - Python

for Z score, we can stick to documentation instead of using 'apply' function

from scipy.stats import zscore
df_zscore = zscore(cols as array, axis=1)

Solution 6 - Python

The almost one-liner solution:

df2 = (df.ix[:,1:] - df.ix[:,1:].mean()) / df.ix[:,1:].std()
df2['ID'] = df['ID']

Solution 7 - Python

When we are dealing with time-series, calculating z-scores (or anomalies - not the same thing, but you can adapt this code easily) is a bit more complicated. For example, you have 10 years of temperature data measured weekly. To calculate z-scores for the whole time-series, you have to know the means and standard deviations for each day of the year. So, let's get started:

Assume you have a pandas DataFrame. First of all, you need a DateTime index. If you don't have it yet, but luckily you do have a column with dates, just make it as your index. Pandas will try to guess the date format. The goal here is to have DateTimeIndex. You can check it out by trying:

type(df.index)

If you don't have one, let's make it.

df.index = pd.DatetimeIndex(df[datecolumn])
df = df.drop(datecolumn,axis=1)

Next step is to calculate mean and standard deviation for each group of days. For this, we use the groupby method.

mean = pd.groupby(df,by=[df.index.dayofyear]).aggregate(np.nanmean)
std = pd.groupby(df,by=[df.index.dayofyear]).aggregate(np.nanstd)

Finally, we loop through all the dates, performing the calculation (value - mean)/stddev; however, as mentioned, for time-series this is not so straightforward.

df2 = df.copy() #keep a copy for future comparisons 
for y in np.unique(df.index.year):
    for d in np.unique(df.index.dayofyear):
        df2[(df.index.year==y) & (df.index.dayofyear==d)] = (df[(df.index.year==y) & (df.index.dayofyear==d)]- mean.ix[d])/std.ix[d]
        df2.index.name = 'date' #this is just to look nicer

df2 #this is your z-score dataset.

The logic inside the for loops is: for a given year we have to match each dayofyear to its mean and stdev. We run this for all the years in your time-series.

Solution 8 - Python

To calculate a z-score for an entire column quickly, do as follows:

from scipy.stats import zscore
import pandas as pd

df = pd.DataFrame({'num_1': [1,2,3,4,5,6,7,8,9,3,4,6,5,7,3,2,9]})
df['num_1_zscore'] = zscore(df['num_1'])

display(df)

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionSlavatronView Question on Stackoverflow
Solution 1 - PythonEdChumView Answer on Stackoverflow
Solution 2 - PythonManuelView Answer on Stackoverflow
Solution 3 - PythonJoe BatheltView Answer on Stackoverflow
Solution 4 - PythonSuryaView Answer on Stackoverflow
Solution 5 - Pythonibozkurt79View Answer on Stackoverflow
Solution 6 - PythonJosh ChartierView Answer on Stackoverflow
Solution 7 - PythonDeninhosView Answer on Stackoverflow
Solution 8 - PythonBrad G GroundsView Answer on Stackoverflow