Calculating mid in binary search

AlgorithmBinary Search

Algorithm Problem Overview


I was reading an algorithms book which had the following algorithm for binary search:

public class BinSearch {
  static int search ( int [ ] A, int K ) {
    int l = 0 ;
    int u = A. length −1;
    int m;
    while (l <= u ) {
      m = (l+u) /2;
      if (A[m] < K) {
        l = m + 1 ;
      } else if (A[m] == K) {
        return m;
        } else {
          u = m−1;
        }
       }
       return −1;
      }
 }

The author says "The error is in the assignment m = (l+u)/2; it can lead to overflow and should be replaced by m = l + (u-l)/2."

I can't see how that would cause an overflow. When I run the algorithm in my mind for a few different inputs, I don't see the mid's value going out of the array index.

So, in which cases would the overflow occur?

Algorithm Solutions


Solution 1 - Algorithm

This post covers this famous bug in a lot of detail. As others have said it's an overflow issue. The fix recommended on the link is as follows:

int mid = low + ((high - low) / 2);

// Alternatively
int mid = (low + high) >>> 1;

It is also probably worth mentioning that in case negative indices are allowed, or perhaps it's not even an array that's being searched (for example, searching for a value in some integer range satisfying some condition), the code above may not be correct as well. In this case, something as ugly as

(low < 0 && high > 0) ? (low + high) / 2 : low + (high - low) / 2

may be necessary. One good example is searching for the median in an unsorted array without modifying it or using additional space by simply performing a binary search on the whole Integer.MIN_VALUEInteger.MAX_VALUE range.

Solution 2 - Algorithm

The following C++ program can show you how an overflow can happen with a 32-bit unsigned integer:

#include <iostream>
using namespace std;

int main ()
{
  unsigned int  low = 33,  
                high = 4294967290, 
                mid;

  cout << "The value of low is " << low << endl;
  cout << "The value of high is " << high << endl;

  mid = (low + high) / 2;

  cout << "The value of mid is " << mid << endl;
  
  return 0;
}

If you run it on a Mac:

$ g++ try.cpp && ./a.out
The value of low is 33
The value of high is 4294967290
The value of mid is 13

The value of mid might be expected to be 2147483661, but low + high overflowed because a 32-bit unsigned integer cannot contain the proper value, and give back 27, and so mid becomes 13.

When the calculation of mid is changed to

mid = low + (high - low) / 2;

Then it will show

The value of mid is 2147483661

The simple answer is, the addition l + u can overflow, and has undefined behavior in some languages, as described in a blog post by Joshua Bloch, about a bug in the Java library for the implementation of binary search.

Some readers may not understand what it is about:

l + (u - l) / 2

Note that in some code, the variable names are different, and it is

low + (high - low) / 2

The answer is: let's say if you have two numbers: 200 and 210, and now you want the "middle number". And let's say if you add any two numbers and the result is greater than 255, then it can overflow and the behavior is undefined, then what can you do? A simple way is just to add the difference between them, but just half of it, to the smaller value: look at what the difference is between 200 and 210. It is 10. (You can consider it the "difference" or "length", between them). So you just need to add 10 / 2 = 5 to 200, and get 205. You don't need to add 200 and 210 together first -- and that's how we can reach the calculation: (u - l) is the difference. (u - l) / 2 is half of it. Add that to l and we have l + (u - l) / 2.

It is like, if we are looking at two trees, one is 200 feet tall and one is 210 feet tall, what is the "midpoint" or the "mean"? We don't have to add them together first. We can just tell the difference is 10 feet, and we can add half of that, which is 5, to 200, and we know it is 205 feet.

To put this into history perspectives, Robert Sedgewick mentioned that the first binary search was stated in 1946, and it wasn't correct until 1964. Jon Bentley described in his book Programming Pearls in 1988 that more that 90% of the professional programmers could not write it correctly given a couple of hours. But even Jon Bentley himself had that overflow bug for 20 years. A study that was published in 1988 showed that accurate code for binary search was only found in 5 out of 20 textbooks. In 2006, Joshua Bloch wrote that blog post about the bug about calculating the mid value. So it took 60 years for this code to be correct. But now, next time in the job interview, remember to write it correctly within that 5 minutes.

Solution 3 - Algorithm

The problem is that (l+u) is evaluated first, and could overflow int, so (l+u)/2 would return the wrong value.

Solution 4 - Algorithm

Jeff suggested really good post to read about this bug, here is summary if you want quick overview.

In Programming Pearls Bentley says that the analogous line "sets m to the average of l and u, truncated down to the nearest integer." On the face of it, this assertion might appear correct, but it fails for large values of the int variables low and high. Specifically, it fails if the sum of low and high is greater than the maximum positive int value (2^31 - 1). The sum overflows to a negative value, and the value stays negative when divided by two. In C this causes an array index out of bounds with unpredictable results. In Java, it throws ArrayIndexOutOfBoundsException.

Solution 5 - Algorithm

Here is an example, suppose you had a very big array of size 2,000,000,000 and 10 (10^9 + 10) and the left index was at 2,000,000,000 and the right index was at 2,000,000,000 + 1.

By using lo + hi will sum upto 2,000,000,000 + 2,000,000,001 = 4,000,000,001. Since the max value of an integer is 2,147,483,647. So you won't get 4,000,000,000 + 1, you will get an integer overflow.

But low + ((high - low) / 2) will work. 2,000,000,000 + ((2,000,000,001 - 2,000,000,000) / 2) = 2,000,000,000

Solution 6 - Algorithm

The potential overflow is in the l+u addition itself.

This was actually a bug in early versions of binary search in the JDK.

Solution 7 - Algorithm

Actually the following statement in calculating mid may result in INT range overflow.

mid = (start + end) /2

Suppose the given ordered input list is very large, and suppose it surpasses the INT range(-2^31 to 2^31-1). The start + end may result in exception. To counter this, the following statement is written:

mid = start + (end-start)/2

Ultimately it results in the same expression. But the exception is averted by this trick.

Solution 8 - Algorithm

> int mid=(l+h)/2; can lead to integer overflow problem. > > (l+u) gets evaluated into a large negative integer value and its half > is returned. Now,if we are searching for an element in an array, it > would lead to "index out of range error."

However, the issue is resolved as:-

  • int mid=l+(h-l)/2;
  • Bit Manipulation: For faster computation->int mid=((unsigned int)l+(unsigned int)h) >> 1 ;

> where >> is the right shift operator.

Hope this helps :)

Solution 9 - Algorithm

To avoid overflow, you can also do this: int midIndex = (int) (startIndex/2.0 + endIndex / 2.0);

You divide both indices by 2.0 -> You are getting two doubles that are less or equal to Integer.MAX_VALUE / 2 and their sum is also less or equal to Integer.MAXVALUE and a double as well. Same for Integer.MIN_VALUE. Finally, you convert the sum to an int and prevented overflow ;)

Solution 10 - Algorithm

This answer gives a practical example of why the l + (r-l)/2 calculation is necessary.

In case you are curious how the two are equivalent mathematically, here is the proof. The key is adding 0 then splitting that into l/2 - l/2.

(l+r)/2 =
l/2 + r/2 =
l/2 + r/2 + 0 =
l/2 + r/2 + (l/2 - l/2) =
(l/2 + l/2) + (r/2 - l/2) =
l + (r-l)/2

Solution 11 - Algorithm

I have created this video with an example where number overflow will happen.

https://youtu.be/fMgenZq7qls

Usually, for simple binary search where you need to find an element from an array, this won't happen due to array size limitation in languages like Java but where problem space is not limited to an array, this problem can occur. Please see my video for practical example.

Solution 12 - Algorithm

It is a very subtle error and easy to miss out the first time. Most articles on the internet don't seem to clearly explain how this error occurs and how the optimized formula prevents overflow.

After a lot of digging I found this article which has a excellent and detailed explanation on how the error occurs when mid = (left+right)/2 formula is used and also how it is overcome using mid = low + ((high - low) / 2). Most importantly they explain it with example which makes the understanding so much easier.

It also explains why mid = low + ((high - low) / 2) doesn't cause an overflow.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionBharatView Question on Stackoverflow
Solution 1 - AlgorithmJeff FosterView Answer on Stackoverflow
Solution 2 - AlgorithmnonopolarityView Answer on Stackoverflow
Solution 3 - Algorithmmurgatroid99View Answer on Stackoverflow
Solution 4 - AlgorithmVipinView Answer on Stackoverflow
Solution 5 - AlgorithmSambhav KhareView Answer on Stackoverflow
Solution 6 - AlgorithmNemoView Answer on Stackoverflow
Solution 7 - AlgorithmHimanView Answer on Stackoverflow
Solution 8 - AlgorithmckausView Answer on Stackoverflow
Solution 9 - AlgorithmSimonView Answer on Stackoverflow
Solution 10 - AlgorithmChris RedfordView Answer on Stackoverflow
Solution 11 - AlgorithmVikas VermaView Answer on Stackoverflow
Solution 12 - AlgorithmRahulView Answer on Stackoverflow