How does this float square root approximation work?

C++C Optimization Floating Point Ieee 754

C++ Problem Overview

I found a rather strange but working square root approximation for floats; I really don't get it. Can someone explain me why this code works?

float sqrt(float f)
{
    const int result = 0x1fbb4000 + (*(int*)&f >> 1);
    return *(float*)&result;   
}

I've test it a bit and it outputs values off of std::sqrt() by about 1 to 3%. I know of the Quake III's fast inverse square root and I guess it's something similar here (without the newton iteration) but I'd really appreciate an explanation of how it works.

(nota: I've tagged it both [tag:c] and [tag:c++] since it's both valid-ish (see comments) C and C++ code)

C++ Solutions

Solution 1 - C++

(*(int*)&f >> 1) right-shifts the bitwise representation of f. This almost divides the exponent by two, which is approximately equivalent to taking the square root.¹

Why almost? In IEEE-754, the actual exponent is e - 127.² To divide this by two, we'd need e/2 - 64, but the above approximation only gives us e/2 - 127. So we need to add on 63 to the resulting exponent. This is contributed by bits 30-23 of that magic constant (0x1fbb4000).

I'd imagine the remaining bits of the magic constant have been chosen to minimise the maximum error across the mantissa range, or something like that. However, it's unclear whether it was determined analytically, iteratively, or heuristically.

It's worth pointing out that this approach is somewhat non-portable. It makes (at least) the following assumptions:

The platform uses single-precision IEEE-754 for float.
The endianness of float representation.
That you will be unaffected by undefined behaviour due to the fact this approach violates C/C++'s strict-aliasing rules.

Thus it should be avoided unless you're certain that it gives predictable behaviour on your platform (and indeed, that it provides a useful speedup vs. sqrtf!).

_{1. sqrt(a^b) = (a^b)^0.5 = a^(b/2)}

_{2. See e.g. https://en.wikipedia.org/wiki/Single-precision_floating-point_format#Exponent_encoding</sub>}

Solution 2 - C++

See Oliver Charlesworth’s explanation of why this almost works. I’m addressing an issue raised in the comments.

Since several people have pointed out the non-portability of this, here are some ways you can make it more portable, or at least make the compiler tell you if it won’t work.

First, C++ allows you to check std::numeric_limits<float>::is_iec559 at compile time, such as in a static_assert. You can also check that sizeof(int) == sizeof(float), which will not be true if int is 64-bits, but what you really want to do is use uint32_t, which if it exists will always be exactly 32 bits wide, will have well-defined behavior with shifts and overflow, and will cause a compilation error if your weird architecture has no such integral type. Either way, you should also static_assert() that the types have the same size. Static assertions have no run-time cost and you should always check your preconditions this way if possible.

Unfortunately, the test of whether converting the bits in a float to a uint32_t and shifting is big-endian, little-endian or neither cannot be computed as a compile-time constant expression. Here, I put the run-time check in the part of the code that depends on it, but you might want to put it in the initialization and do it once. In practice, both gcc and clang can optimize this test away at compile time.

You do not want to use the unsafe pointer cast, and there are some systems I’ve worked on in the real world where that could crash the program with a bus error. The maximally-portable way to convert object representations is with memcpy(). In my example below, I type-pun with a union, which works on any actually-existing implementation. (Language lawyers object to it, but no successful compiler will ever break that much legacy code silently.) If you must do a pointer conversion (see below) there is alignas(). But however you do it, the result will be implementation-defined, which is why we check the result of converting and shifting a test value.

Anyway, not that you’re likely to use it on a modern CPU, here’s a gussied-up C++14 version that checks those non-portable assumptions:

#include <cassert>
#include <cmath>
#include <cstdint>
#include <cstdlib>
#include <iomanip>
#include <iostream>
#include <limits>
#include <vector>

using std::cout;
using std::endl;
using std::size_t;
using std::sqrt;
using std::uint32_t;

template <typename T, typename U>
  inline T reinterpret(const U x)
/* Reinterprets the bits of x as a T.  Cannot be constexpr
 * in C++14 because it reads an inactive union member.
 */
{
  static_assert( sizeof(T)==sizeof(U), "" );
  union tu_pun {
    U u = U();
    T t;
  };
  
  const tu_pun pun{x};
  return pun.t;
}

constexpr float source = -0.1F;
constexpr uint32_t target = 0x5ee66666UL;

const uint32_t after_rshift = reinterpret<uint32_t,float>(source) >> 1U;
const bool is_little_endian = after_rshift == target;

float est_sqrt(const float x)
/* A fast approximation of sqrt(x) that works less well for subnormal numbers.
 */
{
  static_assert( std::numeric_limits<float>::is_iec559, "" );
  assert(is_little_endian); // Could provide alternative big-endian code.
  
 /* The algorithm relies on the bit representation of normal IEEE floats, so
  * a subnormal number as input might be considered a domain error as well?
  */
  if ( std::isless(x, 0.0F) || !std::isfinite(x) )
	return std::numeric_limits<float>::signaling_NaN();
  
  constexpr uint32_t magic_number = 0x1fbb4000UL;
  const uint32_t raw_bits = reinterpret<uint32_t,float>(x);
  const uint32_t rejiggered_bits = (raw_bits >> 1U) + magic_number;
  return reinterpret<float,uint32_t>(rejiggered_bits);
}

int main(void)
{  
  static const std::vector<float> test_values{
    4.0F, 0.01F, 0.0F, 5e20F, 5e-20F, 1.262738e-38F };
  
  for ( const float& x : test_values ) {
	const double gold_standard = sqrt((double)x);
	const double estimate = est_sqrt(x);
	const double error = estimate - gold_standard;
	
	cout << "The error for (" << estimate << " - " << gold_standard << ") is "
	     << error;

	if ( gold_standard != 0.0 && std::isfinite(gold_standard) ) {
	  const double error_pct = error/gold_standard * 100.0;
	  cout << " (" << error_pct << "%).";
	} else
	  cout << '.';

    cout << endl;
  }

  return EXIT_SUCCESS;
}

Update

Here is an alternative definition of reinterpret<T,U>() that avoids type-punning. You could also implement the type-pun in modern C, where it’s allowed by standard, and call the function as extern "C". I think type-punning is more elegant, type-safe and consistent with the quasi-functional style of this program than memcpy(). I also don’t think you gain much, because you still could have undefined behavior from a hypothetical trap representation. Also, clang++ 3.9.1 -O -S is able to statically analyze the type-punning version, optimize the variable is_little_endian to the constant 0x1, and eliminate the run-time test, but it can only optimize this version down to a single-instruction stub.

But more importantly, this code isn’t guaranteed to work portably on every compiler. For example, some old computers can’t even address exactly 32 bits of memory. But in those cases, it should fail to compile and tell you why. No compiler is just suddenly going to break a huge amount of legacy code for no reason. Although the standard technically gives permission to do that and still say it conforms to C++14, it will only happen on an architecture very different from we expect. And if our assumptions are so invalid that some compiler is going to turn a type-pun between a float and a 32-bit unsigned integer into a dangerous bug, I really doubt the logic behind this code will hold up if we just use memcpy() instead. We want that code to fail at compile time, and to tell us why.

#include <cassert>
#include <cstdint>
#include <cstring>

using std::memcpy;
using std::uint32_t;

template <typename T, typename U> inline T reinterpret(const U &x)
/* Reinterprets the bits of x as a T.  Cannot be constexpr
 * in C++14 because it modifies a variable.
 */
{
  static_assert( sizeof(T)==sizeof(U), "" );
  T temp;
  
  memcpy( &temp, &x, sizeof(T) );
  return temp;
}

constexpr float source = -0.1F;
constexpr uint32_t target = 0x5ee66666UL;

const uint32_t after_rshift = reinterpret<uint32_t,float>(source) >> 1U;
extern const bool is_little_endian = after_rshift == target;

However, Stroustrup et al., in the C++ Core Guidelines, recommend a reinterpret_cast instead:

#include <cassert>

template <typename T, typename U> inline T reinterpret(const U x)
/* Reinterprets the bits of x as a T.  Cannot be constexpr
 * in C++14 because it uses reinterpret_cast.
 */
{
  static_assert( sizeof(T)==sizeof(U), "" );
  const U temp alignas(T) alignas(U) = x;
  return *reinterpret_cast<const T*>(&temp);
}

The compilers I tested can also optimize this away to a folded constant. Stroustrup’s reasoning is [sic]:

> Accessing the result of an reinterpret_cast to a different type from the objects declared type is still undefined behavior, but at least we can see that something tricky is going on.

Update

From the comments: C++20 introduces std::bit_cast, which converts an object representation to a different type with unspecified, not undefined, behavior. This doesn’t guarantee that your implementation will use the same format of float and int that this code expects, but it doesn’t give the compiler carte blanche to break your program arbitrarily because there’s technically undefined behavior in one line of it. It can also give you a constexpr conversion.

Solution 3 - C++

Let y = sqrt(x),

it follows from the properties of logarithms that log(y) = 0.5 * log(x) (1)

Interpreting a normal float as an integer gives INT(x) = Ix = L * (log(x) + B - σ) (2)

where L = 2^N, N the number of bits of the significand, B is the exponent bias, and σ is a free factor to tune the approximation.

Combining (1) and (2) gives: Iy = 0.5 * (Ix + (L * (B - σ)))

Which is written in the code as (*(int*)&x >> 1) + 0x1fbb4000;

Find the σ so that the constant equals 0x1fbb4000 and determine whether it's optimal.

Solution 4 - C++

Adding a wiki test harness to test all float.

The approximation is within 4% for many float, but very poor for sub-normal numbers. YMMV

Worst:1.401298e-45 211749.20%
Average:0.63%
Worst:1.262738e-38 3.52%
Average:0.02%

Note that with argument of +/-0.0, the result is not zero.

printf("% e % e\n", sqrtf(+0.0), sqrt_apx(0.0));  //  0.000000e+00  7.930346e-20
printf("% e % e\n", sqrtf(-0.0), sqrt_apx(-0.0)); // -0.000000e+00 -2.698557e+19

Test code

#include <float.h>
#include <limits.h>
#include <math.h>
#include <stddef.h>
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>

float sqrt_apx(float f) {
  const int result = 0x1fbb4000 + (*(int*) &f >> 1);
  return *(float*) &result;
}

double error_value = 0.0;
double error_worst = 0.0;
double error_sum = 0.0;
unsigned long error_count = 0;

void sqrt_test(float f) {
  if (f == 0) return;
  volatile float y0 = sqrtf(f);
  volatile float y1 = sqrt_apx(f);
  double error = (1.0 * y1 - y0) / y0;
  error = fabs(error);
  if (error > error_worst) {
    error_worst = error;
    error_value = f;
  }
  error_sum += error;
  error_count++;
}

void sqrt_tests(float f0, float f1) {
  error_value = error_worst = error_sum = 0.0;
  error_count = 0;
  for (;;) {
    sqrt_test(f0);
    if (f0 == f1) break;
    f0 = nextafterf(f0, f1);
  }
  printf("Worst:%e %.2f%%\n", error_value, error_worst*100.0);
  printf("Average:%.2f%%\n", error_sum / error_count);
  fflush(stdout);
}

int main() {
  sqrt_tests(FLT_TRUE_MIN, FLT_MIN);
  sqrt_tests(FLT_MIN, FLT_MAX);
  return 0;
}

Content Type	Original Author	Original Content on Stackoverflow
Question	YSC	View Question on Stackoverflow
Solution 1 - C++	Oliver Charlesworth	View Answer on Stackoverflow
Solution 2 - C++	Davislor	View Answer on Stackoverflow
Solution 3 - C++	Michael Foukarakis	View Answer on Stackoverflow
Solution 4 - C++	chux - Reinstate Monica	View Answer on Stackoverflow

How does this float square root approximation work?

C++ Problem Overview

C++ Solutions

Solution 1 - C++

Solution 2 - C++

Update

Update

Solution 3 - C++

Solution 4 - C++

Wordpress Docker won't increase upload limit

Visual Studio error D8016: '/ZI' and '/Gy' command-line options are incompatible

Attributions