Is there any numpy group by function?

PythonArraysNumpy

Python Problem Overview


Is there any function in numpy to group this array down below by the first column?

I couldn't find any good answer over the internet..

>>> a
array([[  1, 275],
       [  1, 441],
       [  1, 494],
       [  1, 593],
       [  2, 679],
       [  2, 533],
       [  2, 686],
       [  3, 559],
       [  3, 219],
       [  3, 455],
       [  4, 605],
       [  4, 468],
       [  4, 692],
       [  4, 613]])

Wanted output:

array([[[275, 441, 494, 593]],
       [[679, 533, 686]],
       [[559, 219, 455]],
       [[605, 468, 692, 613]]], dtype=object)

Python Solutions


Solution 1 - Python

Inspired by Eelco Hoogendoorn's library, but without his library, and using the fact that the first column of your array is always increasing (if not, sort first with a = a[a[:, 0].argsort()])

>>> np.split(a[:,1], np.unique(a[:, 0], return_index=True)[1][1:])
[array([275, 441, 494, 593]),
 array([679, 533, 686]),
 array([559, 219, 455]),
 array([605, 468, 692, 613])]

I didn't "timeit" ([EDIT] see below) but this is probably the faster way to achieve the question :

  • No python native loop
  • Result lists are numpy arrays, in case you need to make other numpy operations on them, no new conversion will be needed
  • Complexity looks O(n) (with sort it goes O(n log(n))

[EDIT sept 2021] I ran timeit on my Macbook M1, for a table of 10k random integers. The duration is for 1000 calls.

>>> a = np.random.randint(5, size=(10000, 2))  # 5 different "groups"

# Only the sort
>>> a = a[a[:, 0].argsort()]
⏱ 116.9 ms

# Group by on the already sorted table
>>> np.split(a[:, 1], np.unique(a[:, 0], return_index=True)[1][1:])
⏱ 35.5 ms

# Total sort + groupby
>>> a = a[a[:, 0].argsort()]
>>> np.split(a[:, 1], np.unique(a[:, 0], return_index=True)[1][1:])
⏱ 153.0 ms 👑

# With numpy-indexed package (cf Eelco answer)
>>> npi.group_by(a[:, 0]).split(a[:, 1])
⏱ 353.3 ms

# With pandas (cf Piotr answer)
>>> df = pd.DataFrame(a, columns=["key", "val"]) # no timer for this line
>>> df.groupby("key").val.apply(pd.Series.tolist) 
⏱ 362.3 ms

# With defaultdict, the python native way (cf Piotr answer)
>>> d = defaultdict(list)
for key, val in a:
    d[key].append(val)
⏱ 3543.2 ms

# With numpy_groupies (cf Michael answer)
>>> aggregate(a[:,0], a[:,1], "array", fill_value=[])
⏱ 376.4 ms

Second timeit scenario, with 500 different groups instead of 5. I'm surprised about pandas, I ran several times, but it just behave badly in this scenario.

>>> a = np.random.randint(500, size=(10000, 2))

just the sort  141.1 ms
already_sorted 392.0 ms
sort+groupby   542.4 ms
pandas        2695.8 ms
numpy-indexed  800.6 ms
defaultdict   3707.3 ms
numpy_groupies 836.7 ms

[EDIT] I improved the answer thanks to ns63sr's answer and Behzad Shayegh (cf comment) Thanks also TMBailey for noticing complexity of argsort is n log(n).

Solution 2 - Python

The numpy_indexed package (disclaimer: I am its author) aims to fill this gap in numpy. All operations in numpy-indexed are fully vectorized, and no O(n^2) algorithms were harmed during the making of this library.

import numpy_indexed as npi
npi.group_by(a[:, 0]).split(a[:, 1])

Note that it is usually more efficient to directly compute relevant properties over such groups (ie, group_by(keys).mean(values)), rather than first splitting into a list / jagged array.

Solution 3 - Python

Numpy is not very handy here because the desired output is not an array of integers (it is an array of list objects).

I suggest either the pure Python way...

from collections import defaultdict

%%timeit
d = defaultdict(list)
for key, val in a:
    d[key].append(val)
10.7 µs ± 156 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

# result:
defaultdict(list,
        {1: [275, 441, 494, 593],
         2: [679, 533, 686],
         3: [559, 219, 455],
         4: [605, 468, 692, 613]})

...or the pandas way:

import pandas as pd

%%timeit
df = pd.DataFrame(a, columns=["key", "val"])
df.groupby("key").val.apply(pd.Series.tolist)
979 µs ± 3.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# result:
key
1    [275, 441, 494, 593]
2         [679, 533, 686]
3         [559, 219, 455]
4    [605, 468, 692, 613]
Name: val, dtype: object

Solution 4 - Python

n = np.unique(a[:,0])
np.array( [ list(a[a[:,0]==i,1]) for i in n] )

outputs:

array([[275, 441, 494, 593], [679, 533, 686], [559, 219, 455],
       [605, 468, 692, 613]], dtype=object)

Solution 5 - Python

Simplifying the answer of Vincent J and considering the comment of HS-nebula one can use return_index = True instead of return_counts = True and get rid of the cumsum:

np.split(a[:,1], np.unique(a[:,0], return_index = True)[1])[1:]

Output

[array([275, 441, 494, 593]),
 array([679, 533, 686]),
 array([559, 219, 455]),
 array([605, 468, 692, 613])]

Solution 6 - Python

I used np.unique() followed by np.extract()

unique = np.unique(a[:, 0:1])
answer = []
for element in unique:
    present = a[:,0]==element
    answer.append(np.extract(present,a[:,-1]))
print (answer)

[array([275, 441, 494, 593]), array([679, 533, 686]), array([559, 219, 455]), array([605, 468, 692, 613])]

Solution 7 - Python

given X as array of items you want to be grouped and y (1D array) as corresponding groups, following function does the grouping with numpy:

def groupby(X, y):
    y = np.asarray(y)
    X = np.asarray(X)
    y_uniques = np.unique(y)
    return [X[y==yi] for yi in y_uniques]

So, groupby(a[:,1], a[:,0]) returns [array([275, 441, 494, 593]), array([679, 533, 686]), array([559, 219, 455]), array([605, 468, 692, 613])]

Solution 8 - Python

We might also find it useful to generate a dict:

def groupby(X): 
    X = np.asarray(X) 
    x_uniques = np.unique(X) 
    return {xi:X[X==xi] for xi in x_uniques} 

Let's try it out:

X=[1,1,2,2,3,3,3,3,4,5,6,7,7,8,9,9,1,1,1]
groupby(X)                                                                                                      
Out[9]: 
{1: array([1, 1, 1, 1, 1]),
 2: array([2, 2]),
 3: array([3, 3, 3, 3]),
 4: array([4]),
 5: array([5]),
 6: array([6]),
 7: array([7, 7]),
 8: array([8]),
 9: array([9, 9])}

Note this by itself is not super compelling - but if we make X an object or namedtuple and then provide a groupby function it becomes more interesting. Will put that in later.

Solution 9 - Python

Late to the party, but anyways. If you plan to not only group the arrays, but also want to do operations on them like sum, mean and so on, and you're doing this with speed in mind, you also might want to consider numpy_groupies. All those group operations are optimized and jitted with numba. They easily outperform the other mentioned solutions.

from numpy_groupies.aggregate_numpy import aggregate
aggregate(a[:,0], a[:,1], "array", fill_value=[])
>>> array([array([], dtype=int64), array([275, 441, 494, 593]),
           array([679, 533, 686]), array([559, 219, 455]),
           array([605, 468, 692, 613])], dtype=object)
aggregate(a[:,0], a[:,1], "sum")
>>> array([   0, 1803, 1898, 1233, 2378])

Solution 10 - Python

It becomes pretty apparent that a = a[a[:, 0].argsort()] is a bottleneck of all the competetive grouping algorithms, big thanks to Vincent J for clarifying this. Over 80% of processing time are just blown up on this argsort method and there's no easy way to replace or optimise it. numba package allows to speed up a lot of algorithms and, hopefully, argsort will attract any efforts in the future. The remaining part of grouping can be improved significantly assuming indices of first column are small.

TL;DR

The remaining part of majority of grouping methods contains np.unique method which is quite slow and excessive in cases values of groups are small. It's more efficient to replace it with np.bincount which could be later improved in numba. There are some results of how the remaining part could be improved:

def _custom_return(unique_id, a, split_idx, return_groups):
    '''Choose if you want to also return unique ids'''
    if return_groups:
        return unique_id, np.split(a[:,1], split_idx)
    else: 
        return np.split(a[:,1], split_idx)

def numpy_groupby_index(a, return_groups=False):
    '''Code refactor of method of Vincent J'''
    u, idx = np.unique(a[:,0], return_index=True) 
    return _custom_return(u, a, idx[1:], return_groups)

def numpy_groupby_counts(a, return_groups=False):
    '''Use cumsum of counts instead of index'''
    u, counts = np.unique(a[:,0], return_counts=True)
    idx = np.cumsum(counts)
    return _custom_return(u, a, idx[:-1], return_groups)

def numpy_groupby_diff(a, return_groups=False):
    '''No use of any np.unique options'''
    u = np.unique(a[:,0])
    idx = np.flatnonzero(np.diff(a[:,0])) + 1
    return _custom_return(u, a, idx, return_groups)

def numpy_groupby_bins(a, return_groups=False):  
    '''Replace np.unique by np.bincount'''
    bins = np.bincount(a[:,0])
    nonzero_bins_idx = bins != 0
    nonzero_bins = bins[nonzero_bins_idx]
    idx = np.cumsum(nonzero_bins[:-1])
    return _custom_return(np.flatnonzero(nonzero_bins_idx), a, idx, return_groups)

def numba_groupby_bins(a, return_groups=False):  
    '''Replace np.bincount by numba_bincount'''
    bins = numba_bincount(a[:,0])
    nonzero_bins_idx = bins != 0
    nonzero_bins = bins[nonzero_bins_idx]
    idx = np.cumsum(nonzero_bins[:-1])
    return _custom_return(np.flatnonzero(nonzero_bins_idx), a, idx, return_groups)

So numba_bincount works in the same way as np.bincount and it's defined like so:

from numba import njit

@njit
def _numba_bincount(a, counts, m):
    for i in range(m):
        counts[a[i]] += 1

def numba_bincount(arr): #just a refactor of Python count
    M = np.max(arr)
    counts = np.zeros(M + 1, dtype=int)
    _numba_bincount(arr, counts, len(arr))
    return counts

Usage:

a = np.array([[1,275],[1,441],[1,494],[1,593],[2,679],[2,533],[2,686],[3,559],[3,219],[3,455],[4,605],[4,468],[4,692],[4,613]])
a = a[a[:, 0].argsort()]
>>> numpy_groupby_index(a, return_groups=False)
[array([275, 441, 494, 593]),
 array([679, 533, 686]),
 array([559, 219, 455]),
 array([605, 468, 692, 613])]
>>> numpy_groupby_index(a, return_groups=True)
(array([1, 2, 3, 4]),
 [array([275, 441, 494, 593]),
  array([679, 533, 686]),
  array([559, 219, 455]),
  array([605, 468, 692, 613])])

Perfmormance tests

It takes ~30 seconds to sort 100M items on my computer (with 10 distincts IDs). Let's test how much time will methods of the remaining part take to run:

%matplotlib inline
benchit.setparams(rep=3)

sizes = [3*10**(i//2) if i%2 else 10**(i//2) for i in range(17)]
N = sizes[-1]
x1 = np.random.randint(0,10, size=N)
x2 = np.random.normal(loc=500, scale=200, size=N).astype(int)
a = np.transpose([x1, x2])

arr = a[a[:, 0].argsort()]
fns = [numpy_groupby_index, numpy_groupby_counts, numpy_groupby_diff, numpy_groupby_bins, numba_groupby_bins]
in_ = {s/1000000: (arr[:s], ) for s in sizes}
t = benchit.timings(fns, in_, multivar=True, input_name='Millions of events')
t.plot(logx=True, figsize=(12, 6), fontsize=14)

enter image description here

No doubt numba-powered bincount is a new winner of datasets that contains small IDs. It helps to reduce grouping of sorted data ~5 times which is ~10% of total runtime.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionJohn DowView Question on Stackoverflow
Solution 1 - PythonVincent JView Answer on Stackoverflow
Solution 2 - PythonEelco HoogendoornView Answer on Stackoverflow
Solution 3 - PythonPiotrView Answer on Stackoverflow
Solution 4 - PythonGioelelmView Answer on Stackoverflow
Solution 5 - Pythonns63srView Answer on Stackoverflow
Solution 6 - Pythonuser2251346View Answer on Stackoverflow
Solution 7 - PythonGuido MochaView Answer on Stackoverflow
Solution 8 - PythonWestCoastProjectsView Answer on Stackoverflow
Solution 9 - PythonMichaelView Answer on Stackoverflow
Solution 10 - PythonmathfuxView Answer on Stackoverflow