Python: Random selection per group

PythonRandomPandas

Python Problem Overview


Say that I have a dataframe that looks like:

Name Group_Id
AAA  1
ABC  1
CCC  2
XYZ  2
DEF  3 
YYH  3

How could I randomly select one (or more) row for each Group_Id? Say that I want one random draw per Group_Id, I would get:

Name Group_Id
AAA  1
XYZ  2
DEF  3

Python Solutions


Solution 1 - Python

From 0.16.x onwards pd.DataFrame.sample provides a way to return a random sample of items from an axis of object.

In [664]: df.groupby('Group_Id').apply(lambda x: x.sample(1)).reset_index(drop=True)
Out[664]:
  Name  Group_Id
0  ABC         1
1  XYZ         2
2  DEF         3

Solution 2 - Python

size = 2        # sample size
replace = True  # with replacement
fn = lambda obj: obj.loc[np.random.choice(obj.index, size, replace),:]
df.groupby('Group_Id', as_index=False).apply(fn)

Solution 3 - Python

There are two ways to do this very simply, one without using anything except basic pandas syntax:

df[['x','y']].groupby('x').agg(pd.DataFrame.sample)

This takes 14.4ms with 50k row dataset.

The other, slightly faster method, involves numpy.

df[['x','y']].groupby('x').agg(np.random.choice)

This takes 10.9ms with (the same) 50k row dataset.

Generally speaking, when using pandas, it's preferable to stick with its native syntax. Especially for beginners.

Solution 4 - Python

Using groupby and random.choice in an elegant one liner:

df.groupby('Group_Id').apply(lambda x :x.iloc[random.choice(range(0,len(x)))])

Solution 5 - Python

for randomly selecting just one row per group try:

df.sample(frac = 1.0).groupby('Group_Id').head(1)

Solution 6 - Python

Solution 7 - Python

The solutions offered fail if a group has fewer samples than the desired sample size n. This addresses this problem:

n = 10
df.groupby('Group_Id').apply(lambda x: x.sample(min(n,len(x)))).reset_index(drop=True)

Solution 8 - Python

A very pandas-ish way:

takesamp = lambda d: d.sample(n)
df = df.groupby('Group_Id').apply(takesamp)

Solution 9 - Python

Using random.choice, you can do something like this:

import random
name_group = {'AAA': 1, 'ABC':1, 'CCC':2, 'XYZ':2, 'DEF':3, 'YYH':3}

names = [name for name in name_group.iterkeys()] #create a list out of the keys in the name_group dict

first_name = random.choice(names)
first_group = name_group[first_name]
print first_name, first_group

> random.choice(seq) > > Return a random element from the non-empty sequence seq. If seq is empty, raises IndexError.

Solution 10 - Python

You can use a combination of pandas.groupby, pandas.concat and random.sample:

import pandas as pd
import random

df = pd.DataFrame({
        'Name': ['AAA', 'ABC', 'CCC', 'XYZ', 'DEF', 'YYH'],
        'Group_ID': [1,1,2,2,3,3]
     })

grouped = df.groupby('Group_ID')
df_sampled = pd.concat([d.ix[random.sample(d.index, 1)] for _, d in grouped]).reset_index(drop=True)
print df_sampled

Output:

   Group_ID Name
0         1  AAA
1         2  XYZ
2         3  DEF

Solution 11 - Python

I found another one:

size=2
count_s = df['Id'].value_counts()
df.iloc[np.concatenate([previous_count + np.random.choice(count, size) 
                        for count, previous_count in zip(count_s, 
                                                         count_s.shift(fill_value=0))])]

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionPlug4View Question on Stackoverflow
Solution 1 - PythonZeroView Answer on Stackoverflow
Solution 2 - Pythonbehzad.nouriView Answer on Stackoverflow
Solution 3 - PythonmikkokotilaView Answer on Stackoverflow
Solution 4 - PythongrasshopperView Answer on Stackoverflow
Solution 5 - PythonihadannyView Answer on Stackoverflow
Solution 6 - Pythonuser3826929View Answer on Stackoverflow
Solution 7 - PythonReveilleView Answer on Stackoverflow
Solution 8 - PythonSelahView Answer on Stackoverflow
Solution 9 - PythongravetiiView Answer on Stackoverflow
Solution 10 - PythonYS-LView Answer on Stackoverflow
Solution 11 - PythonansevView Answer on Stackoverflow