Remove duplicate rows of a numpy array

PythonNumpy

Python Problem Overview


How can I remove duplicate rows of a 2 dimensional numpy array?

data = np.array([[1,8,3,3,4],
                 [1,8,9,9,4],
                 [1,8,3,3,4]])

The answer should be as follows:

ans = array([[1,8,3,3,4],
             [1,8,9,9,4]])

If there are two rows that are the same, then I would like to remove one "duplicate" row.

Python Solutions


Solution 1 - Python

You can use numpy unique. Since you want the unique rows, we need to put them into tuples:

import numpy as np

data = np.array([[1,8,3,3,4],
                 [1,8,9,9,4],
                 [1,8,3,3,4]])

just applying np.unique to the data array will result in this:

>>> uniques
array([1, 3, 4, 8, 9])

prints out the unique elements in the list. So putting them into tuples results in:

new_array = [tuple(row) for row in data]
uniques = np.unique(new_array)

which prints:

>>> uniques
array([[1, 8, 3, 3, 4],
       [1, 8, 9, 9, 4]])

UPDATE

In the new version, you need to set np.unique(data, axis=0)

Solution 2 - Python

One approach with lex-sorting -

# Perform lex sort and get sorted data
sorted_idx = np.lexsort(data.T)
sorted_data =  data[sorted_idx,:]

# Get unique row mask
row_mask = np.append([True],np.any(np.diff(sorted_data,axis=0),1))

# Get unique rows
out = sorted_data[row_mask]

Sample run -

In [199]: data
Out[199]: 
array([[1, 8, 3, 3, 4],
       [1, 8, 9, 9, 4],
       [1, 8, 3, 3, 4],
       [1, 8, 3, 3, 4],
       [1, 8, 0, 3, 4],
       [1, 8, 9, 9, 4]])

In [200]: sorted_idx = np.lexsort(data.T)
     ...: sorted_data =  data[sorted_idx,:]
     ...: row_mask = np.append([True],np.any(np.diff(sorted_data,axis=0),1))
     ...: out = sorted_data[row_mask]
     ...: 

In [201]: out
Out[201]: 
array([[1, 8, 0, 3, 4],
       [1, 8, 3, 3, 4],
       [1, 8, 9, 9, 4]])

Runtime tests -

This section times all approaches proposed in the solutions presented thus far.

In [34]: data = np.random.randint(0,10,(10000,10))

In [35]: def tuple_based(data):
    ...:     new_array = [tuple(row) for row in data]
    ...:     return np.unique(new_array)
    ...: 
    ...: def lexsort_based(data):                 
    ...:     sorted_data =  data[np.lexsort(data.T),:]
    ...:     row_mask = np.append([True],np.any(np.diff(sorted_data,axis=0),1))
    ...:     return sorted_data[row_mask]
    ...: 
    ...: def unique_based(a):
    ...:     a = np.ascontiguousarray(a)
    ...:     unique_a = np.unique(a.view([('', a.dtype)]*a.shape[1]))
    ...:     return unique_a.view(a.dtype).reshape((unique_a.shape[0], a.shape[1]))
    ...: 

In [36]: %timeit tuple_based(data)
10 loops, best of 3: 63.1 ms per loop

In [37]: %timeit lexsort_based(data)
100 loops, best of 3: 8.92 ms per loop

In [38]: %timeit unique_based(data)
10 loops, best of 3: 29.1 ms per loop

Solution 3 - Python

A simple solution can be:

import numpy as np
def unique_rows(a):
	a = np.ascontiguousarray(a)
	unique_a = np.unique(a.view([('', a.dtype)]*a.shape[1]))
	return unique_a.view(a.dtype).reshape((unique_a.shape[0], a.shape[1]))

data = np.array([[1,8,3,3,4],
				 [1,8,9,9,4],
				 [1,8,3,3,4]])


print unique_rows(data)
#prints:
[[1 8 3 3 4]
 [1 8 9 9 4]]

You can check this for many more solutions for this problem

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionRomanView Question on Stackoverflow
Solution 1 - PythonSrivatsanView Answer on Stackoverflow
Solution 2 - PythonDivakarView Answer on Stackoverflow
Solution 3 - PythonomerbpView Answer on Stackoverflow