unique combinations of values in selected columns in pandas data frame and count

PythonPandas

Python Problem Overview


I have my data in pandas data frame as follows:

df1 = pd.DataFrame({'A':['yes','yes','yes','yes','no','no','yes','yes','yes','no'],
                   'B':['yes','no','no','no','yes','yes','no','yes','yes','no']})

So, my data looks like this

----------------------------
index         A        B
0           yes      yes
1           yes       no
2           yes       no
3           yes       no
4            no      yes
5            no      yes
6           yes       no
7           yes      yes
8           yes      yes
9            no       no
-----------------------------

I would like to transform it to another data frame. The expected output can be shown in the following python script:

output = pd.DataFrame({'A':['no','no','yes','yes'],'B':['no','yes','no','yes'],'count':[1,2,4,3]})

So, my expected output looks like this

--------------------------------------------
index      A       B       count
--------------------------------------------
0         no       no        1
1         no      yes        2
2        yes       no        4
3        yes      yes        3
--------------------------------------------

Actually, I can achieve to find all combinations and count them by using the following command: mytable = df1.groupby(['A','B']).size()

However, it turns out that such combinations are in a single column. I would like to separate each value in a combination into different column and also add one more column for the result of counting. Is it possible to do that? May I have your suggestions? Thank you in advance.

Python Solutions


Solution 1 - Python

You can groupby on cols 'A' and 'B' and call size and then reset_index and rename the generated column:

In [26]:

df1.groupby(['A','B']).size().reset_index().rename(columns={0:'count'})
Out[26]:
     A    B  count
0   no   no      1
1   no  yes      2
2  yes   no      4
3  yes  yes      3

update

A little explanation, by grouping on the 2 columns, this groups rows where A and B values are the same, we call size which returns the number of unique groups:

In[202]:
df1.groupby(['A','B']).size()

Out[202]: 
A    B  
no   no     1
     yes    2
yes  no     4
     yes    3
dtype: int64

So now to restore the grouped columns, we call reset_index:

In[203]:
df1.groupby(['A','B']).size().reset_index()

Out[203]: 
     A    B  0
0   no   no  1
1   no  yes  2
2  yes   no  4
3  yes  yes  3

This restores the indices but the size aggregation is turned into a generated column 0, so we have to rename this:

In[204]:
df1.groupby(['A','B']).size().reset_index().rename(columns={0:'count'})

Out[204]: 
     A    B  count
0   no   no      1
1   no  yes      2
2  yes   no      4
3  yes  yes      3

groupby does accept the arg as_index which we could have set to False so it doesn't make the grouped columns the index, but this generates a series and you'd still have to restore the indices and so on....:

In[205]:
df1.groupby(['A','B'], as_index=False).size()

Out[205]: 
A    B  
no   no     1
     yes    2
yes  no     4
     yes    3
dtype: int64

Solution 2 - Python

In Pandas 1.1.0 you can use the method value_counts with DataFrames:

df.value_counts() # or df[['A', 'B']].value_counts()

Result:

A    B
yes  no     4
     yes    3
no   yes    2
     no     1
dtype: int64

Convert index to columns and sort by value counts:

df.value_counts(ascending=True).reset_index(name='count')

Result:

     A    B  count
0   no   no      1
1   no  yes      2
2  yes  yes      3
3  yes   no      4

Solution 3 - Python

Slightly related, I was looking for the unique combinations and I came up with this method:

def unique_columns(df,columns):

    result = pd.Series(index = df.index)

    groups = meta_data_csv.groupby(by = columns)
    for name,group in groups:
       is_unique = len(group) == 1
       result.loc[group.index] = is_unique
    
    assert not result.isnull().any()
    
    return result

And if you only want to assert that all combinations are unique:

df1.set_index(['A','B']).index.is_unique

Solution 4 - Python

I haven't done time test with this but it was fun to try. Basically convert two columns to one column of tuples. Now convert that to a dataframe, do 'value_counts()' which finds the unique elements and counts them. Fiddle with zip again and put the columns in order you want. You can probably make the steps more elegant but working with tuples seems more natural to me for this problem

b = pd.DataFrame({'A':['yes','yes','yes','yes','no','no','yes','yes','yes','no'],'B':['yes','no','no','no','yes','yes','no','yes','yes','no']})

b['count'] = pd.Series(zip(*[b.A,b.B]))
df = pd.DataFrame(b['count'].value_counts().reset_index())
df['A'], df['B'] = zip(*df['index'])
df = df.drop(columns='index')[['A','B','count']]

Solution 5 - Python

Based on the accepted answer and @Bryan P's comment relating to the differences between count() and size(), I opted for count() for cleaner code as below :

df1.groupby(['A','B']).count().reset_index()

Solution 6 - Python

Placing @EdChum's very nice answer into a function count_unique_index. The unique method only works on pandas series, not on data frames. The function below reproduces the behavior of the unique function in R:

> unique returns a vector, data frame or array like x but with duplicate elements/rows removed.

And adds a count of the occurrences as requested by the OP.

def count_unique_index(df, by):                                                                                                                                                 
    return df.groupby(by).size().reset_index().rename(columns={0:'count'})                                                                                                      

df1 = pd.DataFrame({'A':['yes','yes','yes','yes','no','no','yes','yes','yes','no'],                                                                                             
                    'B':['yes','no','no','no','yes','yes','no','yes','yes','no']})   
                                                                                                                                                                                 
count_unique_index(df1, ['A','B'])                                                                                                                                              
     A    B  count                                                                                                                                                                  
0   no   no      1                                                                                                                                                                  
1   no  yes      2                                                                                                                                                                  
2  yes   no      4                                                                                                                                                                  
3  yes  yes      3

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionRatchainant ThammasudjaritView Question on Stackoverflow
Solution 1 - PythonEdChumView Answer on Stackoverflow
Solution 2 - PythonMykola ZotkoView Answer on Stackoverflow
Solution 3 - PythonMartin AlexanderssonView Answer on Stackoverflow
Solution 4 - PythonMikeB2019xView Answer on Stackoverflow
Solution 5 - PythonTendaiMView Answer on Stackoverflow
Solution 6 - PythonPaul RougieuxView Answer on Stackoverflow