Find closest value in a vector with binary search

R Problem Overview

As a silly toy example, suppose

x=4.5
w=c(1,2,4,6,7)

I wonder if there is a simple R function that finds the index of the closest match to x in w. So if foo is that function, foo(w,x) would return 3. The function match is the right idea, but seems to apply only for exact matches.

Solutions here (e.g. which.min(abs(w - x)), which(abs(w-x)==min(abs(w-x))), etc.) are all O(n) instead of log(n) (I'm assuming that w is already sorted).

R Solutions

Solution 1 - R

R>findInterval(4.5, c(1,2,4,5,6))
[1] 3

will do that with price-is-right matching (closest without going over).

Solution 2 - R

You can use data.table to do a binary search:

dt = data.table(w, val = w) # you'll see why val is needed in a sec
setattr(dt, "sorted", "w")  # let data.table know that w is sorted

Note that if the column w isn't already sorted, then you'll have to use setkey(dt, w) instead of setattr(.).

# binary search and "roll" to the nearest neighbour
dt[J(x), roll = "nearest"]
#     w val
#1: 4.5   4

In the final expression the val column will have the you're looking for.

# or to get the index as Josh points out
# (and then you don't need the val column):
dt[J(x), .I, roll = "nearest", by = .EACHI]
#     w .I
#1: 4.5  3

# or to get the index alone
dt[J(x), roll = "nearest", which = TRUE]
#[1] 3

Solution 3 - R

See match.closest() from the MALDIquant package:

> library(MALDIquant)
> match.closest(x, w)
[1] 3

Solution 4 - R

x = 4.5
w = c(1,2,4,6,7)

closestLoc = which(min(abs(w-x)))
closestVal = w[which(min(abs(w-x)))]

# On my phone- please pardon typos

If your vector is lengthy, try a 2-step approach:

x = 4.5
w = c(1,2,4,6,7)

sdev = sapply(w,function(v,x) abs(v-x), x = x)
closestLoc = which(min(sdev))

for maddeningly long vectors (millions of rows!, warning- this will actually be slower for data which is not very, very, very large.)

require(doMC)
registerDoMC()

closestLoc = which(min(foreach(i = w) %dopar% {
   abs(i-x)
}))

This example is just to give you a basic idea of leveraging parallel processing when you have huge data. Note, I do not recommend you use it for simple & fast functions like abs().

Solution 5 - R

To do this on character vectors, Martin Morgan suggested this function on R-help:

bsearch7 <-
     function(val, tab, L=1L, H=length(tab))
{
     b <- cbind(L=rep(L, length(val)), H=rep(H, length(val)))
     i0 <- seq_along(val)
     repeat {
         updt <- M <- b[i0,"L"] + (b[i0,"H"] - b[i0,"L"]) %/% 2L
         tabM <- tab[M]
         val0 <- val[i0]
         i <- tabM < val0
         updt[i] <- M[i] + 1L
         i <- tabM > val0
         updt[i] <- M[i] - 1L
         b[i0 + i * length(val)] <- updt
         i0 <- which(b[i0, "H"] >= b[i0, "L"])
         if (!length(i0)) break;
     }
     b[,"L"] - 1L
}

Solution 6 - R

NearestValueSearch = function(x, w){
  ## A simple binary search algo
  ## Assume the w vector is sorted so we can use binary search
  left = 1
  right = length(w)
  while(right - left > 1){
    middle = floor((left + right) / 2)
    if(x < w[middle]){
      right = middle
    }
    else{
      left = middle
    }
  }
  if(abs(x - w[right]) < abs(x - w[left])){
    return(right)
  }
  else{
    return(left)
  }
}


x = 4.5
w = c(1,2,4,6,7)
NearestValueSearch(x, w) # return 3

Solution 7 - R

Based on @neal-fultz answer, here is a simple function that uses findInterval():

get_closest_index <- function(x, vec){
  # vec must be sorted
  iv <- findInterval(x, vec)
  dist_left <- x - vec[ifelse(iv == 0, NA, iv)]
  dist_right <- vec[iv + 1] - x
  ifelse(! is.na(dist_left) & (is.na(dist_right) | dist_left < dist_right), iv, iv + 1)
}
values <- c(-15, -0.01, 3.1, 6, 10, 100)
grid <- c(-2, -0.1, 0.1, 3, 7)
get_closest_index(values, grid)
#> [1] 1 2 4 5 5 5

^{Created on 2020-05-29 by the reprex package (v0.3.0)}

Solution 8 - R

You can always implement custom binary search algorithm to find the closest value. Alternately, you can leverage standard implementation of libc bsearch(). You can use other binary search implementations as well, but it does not change the fact that you have to implement the comparing function carefully to find the closest element in array. The issue with standard binary search implementation is that it is meant for exact comparison. That means your improvised comparing function needs to do some kind of exactification to figure out if an element in array is close-enough. To achieve it, the comparing function needs to have awareness of other elements in the array, especially following aspects:

position of the current element (one which is being compared with the key).
the distance with key and how it compares with neighbors (previous or next element).

To provide this extra knowledge in comparing function, the key needs to be packaged with additional information (not just the key value). Once the comparing function have awareness on these aspects, it can figure out if the element itself is closest. When it knows that it is the closest, it returns "match".

The the following C code finds the closest value.

#include <stdio.h>
#include <stdlib.h>

struct key {
        int key_val;
        int *array_head;
        int array_size;
};

int compar(const void *k, const void *e) {
        struct key *key = (struct key*)k;
        int *elem = (int*)e;
        int *arr_first = key->array_head;
        int *arr_last = key->array_head + key->array_size -1;
        int kv = key->key_val;
        int dist_left;
        int dist_right;

        if (kv == *elem) {
                /* easy case: if both same, got to be closest */
                return 0;
        } else if (key->array_size == 1) {
                /* easy case: only element got to be closest */
                return 0;
        } else if (elem == arr_first) {
                /* element is the first in array */
                if (kv < *elem) {
                        /* if keyval is less the first element then
                         * first elem is closest.
                         */
                        return 0;
                } else {
                        /* check distance between first and 2nd elem.
                         * if distance with first elem is smaller, it is closest.
                         */
                        dist_left = kv - *elem;
                        dist_right = *(elem+1) - kv;
                        return (dist_left <= dist_right) ? 0:1;
                }
        } else if (elem == arr_last) {
                /* element is the last in array */
                if (kv > *elem) {
                        /* if keyval is larger than the last element then
                         * last elem is closest.
                         */
                        return 0;
                } else {
                        /* check distance between last and last-but-one.
                         * if distance with last elem is smaller, it is closest.
                         */
                        dist_left = kv - *(elem-1);
                        dist_right = *elem - kv;
                        return (dist_right <= dist_left) ? 0:-1;
                }
        }

        /* condition for remaining cases (other cases are handled already):
         * - elem is neither first or last in the array
         * - array has atleast three elements.
         */

        if (kv < *elem) {
                /* keyval is smaller than elem */

                if (kv <= *(elem -1)) {
                        /* keyval is smaller than previous (of "elem") too.
                         * hence, elem cannot be closest.
                         */
                        return -1;
                } else {
                        /* check distance between elem and elem-prev.
                         * if distance with elem is smaller, it is closest.
                         */
                        dist_left = kv - *(elem -1);
                        dist_right = *elem - kv;
                        return (dist_right <= dist_left) ? 0:-1;
                }
        }

        /* remaining case: (keyval > *elem) */

        if (kv >= *(elem+1)) {
                /* keyval is larger than next (of "elem") too.
                 * hence, elem cannot be closest.
                 */
                return 1;
        }

        /* check distance between elem and elem-next.
         * if distance with elem is smaller, it is closest.
         */
        dist_right = *(elem+1) - kv;
        dist_left = kv - *elem;
        return (dist_left <= dist_right) ? 0:1;
}


int main(int argc, char **argv) {
        int arr[] = {10, 20, 30, 40, 50, 60, 70};
        int *found;
        struct key k;

        if (argc < 2) {
                return 1;
        }

        k.key_val = atoi(argv[1]);
        k.array_head = arr;
        k.array_size = sizeof(arr)/sizeof(int);

        found = (int*)bsearch(&k, arr, sizeof(arr)/sizeof(int), sizeof(int),
                compar);

        if(found) {
                printf("found closest: %d\n", *found);
        } else {
                printf("closest not found. absurd! \n");
        }

        return 0;
}

Needless to say that bsearch() in above example should never fail (unless the array size is zero).

If you implement your own custom binary search, essentially you have to embed same comparing logic in the main body of binary search code (instead of having this logic in comparing function in above example).

Content Type	Original Author	Original Content on Stackoverflow
Question	zkurtz	View Question on Stackoverflow
Solution 1 - R	Neal Fultz	View Answer on Stackoverflow
Solution 2 - R	eddi	View Answer on Stackoverflow
Solution 3 - R	Sam Firke	View Answer on Stackoverflow
Solution 4 - R	jackStinger	View Answer on Stackoverflow
Solution 5 - R	Neal Fultz	View Answer on Stackoverflow
Solution 6 - R	LC94	View Answer on Stackoverflow
Solution 7 - R	const-ae	View Answer on Stackoverflow
Solution 8 - R	Pralay	View Answer on Stackoverflow