What are the pitfalls in implementing binary search?

AlgorithmBinary Search

Algorithm Problem Overview


Binary search is harder to implement than it looks. "Although the basic idea of binary search is comparatively straightforward, the details can be surprisingly tricky…" — Donald Knuth.

Which bugs are most likely to be introduced into a new binary search implementation?

Algorithm Solutions


Solution 1 - Algorithm

This question was just asked again recently. Apart from Knuth's quote that "Although the basic idea of binary search is comparatively straightforward, the details can be surprisingly tricky", there is the staggering historical fact (see TAOCP, Volume 3, section 6.2.1) that binary search was first published in 1946 but the first published binary search without bugs was in 1962. And there is Bentley's experience that when he assigned binary search in courses for professional programmers at places like Bell Labs and IBM and gave them two hours, everyone reported they had got it right, and on examining their code 90% of them had bugs — year after year.

Perhaps the basic reason why so many programmers make mistakes with binary search, besides Sturgeon's Law, is that they are not careful enough: Programming Pearls quotes this as the "write your code, throw it over the wall, and have Quality Assurance or Testing deal with the bugs" approach. And there is a lot of scope for error. Not just the overflow error that several of the other answers here mention, but logical errors.

Below are some examples of binary-search errors. This is by no means exhaustive. (As Tolstoy writes in Anna Karenina — "All happy families are alike; every unhappy family is unhappy in its own way" — every incorrect binary search program is incorrect in its own way.)

Pattis

The following Pascal code is taken from the paper Textbook errors in binary searching (1988) by Richard E Pattis. He looked at twenty textbooks and came up with this binary search (BTW, Pascal uses array indexes starting from 1):

PROCEDURE BinarySearch (A         : anArray,
                        Size      : anArraySize,
                        Key       : INTEGER,
                        VAR Found : BOOLEAN;
                        VAR Index : anArrayIndex);
Var Low, High : anArrayIndex;
BEGIN         
   LOW := 1;
   High := Size;
   
   REPEAT
      Index := (Low + High) DIV 2;
      If Key < A[Index]
         THEN High := Index - 1
         ELSE Low  := Index + 1
   UNTIL (Low > High) OR (Key = A[Index]);

   FOUND := (Low <= High)
END;

Looks ok? This has more than one error. Before reading further, see if you can find them all. You should be able to guess what the code does even if you're seeing Pascal for the first time.


He describes five errors that many programs have, and in particular the above has:

Error 1: It doesn't run in O(log n) time, where n = Size. In their enthusiasm for Proper Programming Practice, some programmers write binary search as a function/procedure, and pass it an array. (This is not specific to Pascal; imagine in C++ passing a vector by value rather than by reference.) This is Θ(n) time just to pass the array to the procedure, which defeats the whole purpose. Even worse, some authors apparently give a recursive binary search, which passes an array each time, giving a running time that is Θ(n log n). (This is not far-fetched; I've actually seen code like this.)

Error 2: It fails when size = 0. This may be ok. But depending on the intended application, the list/table being searched may shrink to 0, and it must be handled somewhere.

Error 3: It gives a wrong answer. Whenever the final iteration of the loop starts with Low=High (e.g. when Size=1), it sets Found:=False, even if Key is in the array.

Error 4: It causes an error whenever Key is less than the smallest element of the array. (After Index becomes 1, it sets High to 0, etc.; causes an out-of-bounds error.)

Error 5: It causes an error whenever Key is greater than the largest element of the array. (After Index becomes Size, it sets Low to Size+1 etc.; causes an out-of-bounds error.)

He also points out that some obvious ways to "fix" these errors turn out to be wrong as well. Real-life code also often has this property, when a programmer wrote something incorrect, found an error, and then "fixed" it until it seemed correct without thinking carefully enough.

Of the 20 textbooks he tried, only 5 had a correct binary search. In the remaining 15 (he says 16, ironically), he found 11 instances of Error 1, six instances of Error 2, two each of Errors 3 and 4, and one of Error 5. These numbers add up to much more than 15, because several of them had multiple errors.


More examples

Binary search is used for more than just searching an array to see if it contains a value, so here is one more example for now. I may update this list as I think of more.

Suppose you have an increasing (non-decreasing) function f: R->R, and (because you want a root of f, say), you want to find the largest t such that f(t) < 0. See how many bugs you can find in the following:

float high = INF, low = 0;
while(high != low) {
   float mid = (low + high)/2;
   if(f(mid)>0) high=mid;
   else low=mid;
}
printf("%f", high);

(Some: There may be no such t in [0,INF], if f is 0 on an interval then this is wrong, never compare floating point numbers for equality, etc.)

I used to make several of those errors — the first few dozen times I wrote binary search (which was during programming contests with time pressure), about 30% of the time there was a bug somewhere — until I found the simple way to write it correctly. I haven't got a binary search wrong since (as I recall). The trick is very simple:

> Maintain an invariant.

Find/decide and make explicit some invariant property that your "low" and "high" variables satisfy throughout the loop: before, during and after. Make sure it is never violated. Of course you also need to think about the termination condition. This is explained in good detail in Chapter 4 of Programming Pearls which derives a binary search program from semi-formal methods.

For instance, to slightly abstract out the condition being examined, suppose you want to find the largest integer value x for which some condition poss(x) is true. Even this explicitness of problem definition is more than many programmers start with. (For instance, poss(x) may be a[x] ≤ v for some value v; this is to find out how many elements in the sorted array a are grater than v, say.) Then, one way to write binary search is:

int lo=0, hi=n;
//INVARIANT: poss(lo) is true, poss(hi) is false
//Check and ensure invariant before starting binary search
assert(poss(lo)==true);
assert(poss(hi)==false);
while(hi-lo>1) {
    int mid = lo + (hi-lo)/2;
    if(poss(mid)) lo = mid;
    else hi = mid;
}
printf("%d \n",lo);

You can add more assert statements and other checks, but the basic idea is that because you update lo to mid only when you know that poss(mid) is true, you maintain the invariant that poss(lo) is always true. Similarly, you set hi to mid only when poss(mid) is false, so you maintain the invariant that poss(hi) is always false. Think about the termination condition separately. (Note that when hi-lo is 1, mid is the same as lo. So don't write the loop as while(hi>lo), or you'll have a infinite loop.) At the end of the loop, you're guaranteed that hi-lo is at most 1, and because your always maintained your invariant (poss(lo) is true and poss(hi) is false), it cannot be 0. Also, again because of your invariant, you know that lo is the value to return/print/use. There are other ways to write binary search of course, but maintaining an invariant is a trick/discipline that always helps.

Solution 2 - Algorithm

Here are some I can think of:

  • Off-by-one errors, when determining the boundary of the next interval
  • Handling of duplicate items, if you are suppose to return the first equal item in the array but instead returned a subsequent equal item
  • Numerical underflows/overflows when computing indices, with huge arrays
  • Recursive vs non-recursive implementation, a design choice you should consider

Are these what you have in mind?

Solution 3 - Algorithm

Read this. Java's binary search implementation hid a bug for almost a decade before anybody found it.

The bug is integer overflow. It didn't cause people problems because hardly anyone was searching big enough data structures.

Solution 4 - Algorithm

One crucial reason why people can't implement binary search correctly is that we don't characterize binary search well, it is a well-defined problem but usually one doesn't define it well.

One universal rule is learn from failure. Here, thinking about 'invalid' cases help clarifying the problem. What if input is empty? what if input contains duplicates? should I implement it by one conditional test or two test (plus an extra test for early termination) per iteration? and other technical issues such as numerical overflow/underflow in computing indices, and other tricks.

Errors that could be avoided by characterize the problem well is Off-by-one errors,and handling of duplicate items, as @Zach Scrivena pointed out.

Many people view binary search as find a target value given an sorted array. That's how binary how is being used, not binary search it per se.

I'll try to give a relative rigorous definition/formulation of binary search, and show one way to get off off-by-one error and duplicate issues, by conforming to one specific approach, which, of course is not new.

# (my) definition of binary search:
input: 
    L: a 'partially sorted' array, 
    key: a function, take item in L as argument
prerequisite: 
    by 'partially sorted' I mean, if apply key function to all item of L, we get a 
    new array of bool, L_1, such that it can't be partitioned to two left, right blocks, 
    with all item in left being false, all item in right being true. 
    (left or/and right could be empty)
output: 
    the index of first item in right block of L_1 (as defined in prerequisite). 
    or equivalently, the index of first item in L such that key(item) == True

this definition naturally resolves the duplicate issue.

There are commonly two way to denote arrays, [] and [), I prefer the latter, an equivalence of [) approach is using (start, count) pair instead.

# Algorithm: binary search
# input: L: a 'partially sorted' array, key: a function, take item in L as argument
	while L is not empty:
		mid = left + (right - left)/2  # or mid = left + count/2
		if key(mid item) is True:
			recede right # if True, recede right
		else:
			forward left # if False, forward left
    return left

Thus, if you make your "if True, Recede (end)" and "if False, Forward (start) " part right, you are almost done. I call it the "FFTR rule" of binary search. If are going to find the first item, as in the definition above, left will be start, however right will be start if you are going to find the last item. I you conform to the [) fashion, then a possible implement is,

while left<right:
	mid = left + (right - left)/2
	if key(L(mid)) == True:
		right = mid
	else:
		left = mid+1
	return left

lets validate it further, by first show convergence, and then show correctness.

convergence:

whenever left == right, we exit loop (also true if being empty at the first)

so, in the loop, if denote, 

	diff = (right - left)/2, 

	lfstep = 1 + diff/2, 'lfstep' for 'left index forward step size'

	rbstep = diff - diff/2, 'rbstep' for 'right index back (recede) step size'

it can be show that lfstep and rbstep are alway positive, so left and right 
will be equal at last. 

both lfstep and rbstep are asymptotically half of current subarray size, so it's 
of logarithm time complexity.

correctness:

if the input array is empty:
	return the left index;
else:
	if key(mid item) is true:
		"recede right"
		so mid and all item after mid are all true, we can reduce the search range 
        to [left, mid), to validate it, there are two possible cases,
		case 1:
			mid is the first item such that key(item) is True, so there are no true items 
            in new search range [left, mid), so the test will always be false, and we 
            forward left at each iteration until search range is empty, that is  
            [finalleft,mid), since we return finalleft, and finalleft == mid, correctly done!
		case 2:
			there are item before mid such that key(item) is True,
			in this case we just reduce to a new problem of smaller size
	else:
		"forward left"
		mid and all item before mid is false, since we are to find the first true one, 
		we can safely reduce to new search range [mid+1, right) without change the result.
		

equivalent (start, count) version,

while count>0:
	mid = start + count/2
	if key(L[mid]) == True:
		right = mid
	else:
		left = mid+1
return left

to summarize, if conforming to [) convention,

1. define your key function of your problem, 
2. implement your binary search with "FFTR rule" -- 
    "recede (end) if True ( key(item) == True) else forward (start)" 
	examples:
		if to find a value target, return index or -1 if not found,
		key = lambda x: x>=target, 
		if L[found_index] == target: return found_index
		else: return -1

As for the early termination by an extra test, I don't think it worth the pay, but you can try.

Solution 5 - Algorithm

Failing to consider that when calculating the midpoint between two indices summing the high and low values may result in integer overflow.

Reference

Solution 6 - Algorithm

The pipelining architecture of modern processors is much more suited to do linear searches than to do binary searches which have lots of decisions and branches.

Therefore a common performance bug and a premature optimization (you know, these are the root of all evil) is using binary search at all when a simple linear search would be faster and of course simpler.

Depending on the number of reads, even sorting the data at all can make things slower. A break-even point between linear and binary can easily be at 1000 elements for simple keys (e.g. 32 bit integers).

Solution 7 - Algorithm

> One bug which I encountered while implementing Binary Search is having integer range overflow while calculating the mid element. As we all know, the steps of a Binary Search implementation are:

  1. Find the Mid element.
  2. Check if:
  • Target element is greater than the mid element, search from mid+1 to end index
  • Target element is smaller than the mid element, search from start to mid-1 index
  • Target element is equal to the mid element, return the index.

In all of the above steps, recalculation of mid element is taking place. Now, mid element is found out as: mid = (start+end)/2 now, if the end index is too high(for a very big array), it might overflow the INT range. So, a better way of doing it is: mid = start + (end-start)/2

if you check carefully, ultimately, its giving us the same value, but we get rid of the overflow issue which we might encounter.

> Also, as we know, binary search can be implemented on a sorted list only. Suppose in the problem it is stated that the input list is sorted, but not mentioned whether it's in Ascending or Descending order. in that case, you might need to add an extra condition to check that.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionjoeforkerView Question on Stackoverflow
Solution 1 - AlgorithmShreevatsaRView Answer on Stackoverflow
Solution 2 - AlgorithmZach ScrivenaView Answer on Stackoverflow
Solution 3 - AlgorithmDan DyerView Answer on Stackoverflow
Solution 4 - AlgorithmqeatzyView Answer on Stackoverflow
Solution 5 - AlgorithmtvanfossonView Answer on Stackoverflow
Solution 6 - AlgorithmPeter G.View Answer on Stackoverflow
Solution 7 - AlgorithmHimanView Answer on Stackoverflow