What is the most efficient way to check if a value exists in a NumPy array?

Python Problem Overview

I have a very large NumPy array

I want to check to see if a value exists in the 1st column of the array. I've got a bunch of homegrown ways (e.g. iterating through each row and checking), but given the size of the array I'd like to find the most efficient method.

Thanks!

Python Solutions

Solution 1 - Python

How about

if value in my_array[:, col_num]:
    do_whatever

Edit: I think __contains__ is implemented in such a way that this is the same as @detly's version

Solution 2 - Python

The most obvious to me would be:

np.any(my_array[:, 0] == value)

Solution 3 - Python

To check multiple values, you can use numpy.in1d(), which is an element-wise function version of the python keyword in. If your data is sorted, you can use numpy.searchsorted():

import numpy as np
data = np.array([1,4,5,5,6,8,8,9])
values = [2,3,4,6,7]
print np.in1d(values, data)

index = np.searchsorted(data, values)
print data[index] == values

Solution 4 - Python

Fascinating. I needed to improve the speed of a series of loops that must perform matching index determination in this same way. So I decided to time all the solutions here, along with some riff's.

Here are my speed tests for Python 2.7.10:

import timeit
timeit.timeit('N.any(N.in1d(sids, val))', setup = 'import numpy as N; val = 20010401020091; sids = N.array([20010401010101+x for x in range(1000)])')

18.86137104034424

timeit.timeit('val in sids', setup = 'import numpy as N; val = 20010401020091; sids = [20010401010101+x for x in range(1000)]')

15.061666011810303

timeit.timeit('N.in1d(sids, val)', setup = 'import numpy as N; val = 20010401020091; sids = N.array([20010401010101+x for x in range(1000)])')

11.613027095794678

timeit.timeit('N.any(val == sids)', setup = 'import numpy as N; val = 20010401020091; sids = N.array([20010401010101+x for x in range(1000)])')

7.670552015304565

timeit.timeit('val in sids', setup = 'import numpy as N; val = 20010401020091; sids = N.array([20010401010101+x for x in range(1000)])')

5.610057830810547

timeit.timeit('val == sids', setup = 'import numpy as N; val = 20010401020091; sids = N.array([20010401010101+x for x in range(1000)])')

1.6632978916168213

timeit.timeit('val in sids', setup = 'import numpy as N; val = 20010401020091; sids = set([20010401010101+x for x in range(1000)])')

0.0548710823059082

timeit.timeit('val in sids', setup = 'import numpy as N; val = 20010401020091; sids = dict(zip([20010401010101+x for x in range(1000)],[True,]*1000))')

0.054754018783569336

Very surprising! Orders of magnitude difference!

To summarize, if you just want to know whether something's in a 1D list or not:

19s N.any(N.in1d(numpy array))
15s x in (list)
8s N.any(x == numpy array)
6s x in (numpy array)
.1s x in (set or a dictionary)

If you want to know where something is in the list as well (order is important):

12s N.in1d(x, numpy array)
2s x == (numpy array)

Solution 5 - Python

Adding to @HYRY's answer in1d seems to be fastest for numpy. This is using numpy 1.8 and python 2.7.6.

In this test in1d was fastest, however 10 in a look cleaner:

a = arange(0,99999,3)
%timeit 10 in a
%timeit in1d(a, 10)

10000 loops, best of 3: 150 µs per loop
10000 loops, best of 3: 61.9 µs per loop

Constructing a set is slower than calling in1d, but checking if the value exists is a bit faster:

s = set(range(0, 99999, 3))
%timeit 10 in s

10000000 loops, best of 3: 47 ns per loop

Solution 6 - Python

The most convenient way according to me is:

(Val in X[:, col_num])

where Val is the value that you want to check for and X is the array. In your example, suppose you want to check if the value 8 exists in your the third column. Simply write

(8 in X[:, 2])

This will return True if 8 is there in the third column, else False.

Solution 7 - Python

If you are looking for a list of integers, you may use indexing for doing the work. This also works with nd-arrays, but seems to be slower. It may be better when doing this more than once.

def valuesInArray(values, array):
    values = np.asanyarray(values)
    array = np.asanyarray(array)
    assert array.dtype == np.int and values.dtype == np.int
    
    matches = np.zeros(array.max()+1, dtype=np.bool_)
    matches[values] = True
    
    res = matches[array]
    
    return np.any(res), res
    
    
array = np.random.randint(0, 1000, (10000,3))
values = np.array((1,6,23,543,222))

matched, matches = valuesInArray(values, array)

By using numba and njit, I could get a speedup of this by ~x10.

Content Type	Original Author	Original Content on Stackoverflow
Question	thegreatt	View Question on Stackoverflow
Solution 1 - Python	agf	View Answer on Stackoverflow
Solution 2 - Python	detly	View Answer on Stackoverflow
Solution 3 - Python	HYRY	View Answer on Stackoverflow
Solution 4 - Python	Lukas Mandrake	View Answer on Stackoverflow
Solution 5 - Python	Joelmob	View Answer on Stackoverflow
Solution 6 - Python	Loochie	View Answer on Stackoverflow
Solution 7 - Python	Simon Klein	View Answer on Stackoverflow