Counting non zero values in each column of a DataFrame in python

Python Problem Overview

I have a python-pandas-DataFrame in which first column is "user_id" and rest of the columns are tags("Tag_0" to "Tag_122").

I have the data in the following format:

UserId	Tag_0	Tag_1
7867688	0	5
7867688	0	3
7867688	3	0
7867688	3.5	3.5
7867688	4	4
7867688	3.5	0

My aim is to achieve Sum(Tag)/Count(NonZero(Tags)) for each user_id

df.groupby('user_id').sum(), gives me sum(tag), however I am clueless about counting non zero values

Is it possible to achieve Sum(Tag)/Count(NonZero(Tags)) in one command?

In MySQL I could achieve this as follows:-

select user_id, sum(tag)/count(nullif(tag,0)) from table group by 1

Any help shall be appreciated.

Python Solutions

Solution 1 - Python

My favorite way of getting number of nonzeros in each column is

df.astype(bool).sum(axis=0)

For the number of non-zeros in each row use

df.astype(bool).sum(axis=1)

(Thanks to Skulas)

If you have nans in your df you should make these zero first, otherwise they will be counted as 1.

df.fillna(0).astype(bool).sum(axis=1)

(Thanks to SirC)

Solution 2 - Python

Why not use np.count_nonzero?

To count the number of non-zeros of an entire dataframe, np.count_nonzero(df)
To count the number of non-zeros of all rows np.count_nonzero(df, axis=0)
To count the number of non-zeros of all columns np.count_nonzero(df, axis=1)

It works with dates too.

Solution 3 - Python

To count nonzero values, just do (column!=0).sum(), where column is the data you want to do it for. column != 0 returns a boolean array, and True is 1 and False is 0, so summing this gives you the number of elements that match the condition.

So to get your desired result, do

df.groupby('user_id').apply(lambda column: column.sum()/(column != 0).sum())

Solution 4 - Python

I know this question is old but it seems OP's aim is different from the question title:

> My aim is to achieve Sum(Tag)/Count(NonZero(Tags)) for each user_id...

For OP's aim, we could replace 0 with NaN and use groupby + mean (this works because mean skips NaN by default):

out = df.replace(0, np.nan).groupby('UserId', as_index=False).mean()

Output:

    UserId  Tag_0  Tag_1
0  7867688    3.5  3.875

Content Type	Original Author	Original Content on Stackoverflow
Question	Harsh Singal	View Question on Stackoverflow
Solution 1 - Python	The Unfun Cat	View Answer on Stackoverflow
Solution 2 - Python	Sarah	View Answer on Stackoverflow
Solution 3 - Python	BrenBarn	View Answer on Stackoverflow
Solution 4 - Python	user7864386	View Answer on Stackoverflow