How to perform a bitwise operation on floating point numbers

C++Floating PointGenetic AlgorithmBitwise Operators

C++ Problem Overview


I tried this:

float a = 1.4123;
a = a & (1 << 3);

I get a compiler error saying that the operand of & cannot be of type float.

When I do:

float a = 1.4123;
a = (int)a & (1 << 3);

I get the program running. The only thing is that the bitwise operation is done on the integer representation of the number obtained after rounding off.

The following is also not allowed.

float a = 1.4123;
a = (void*)a & (1 << 3);

I don't understand why int can be cast to void* but not float.

I am doing this to solve the problem described in Stack Overflow question How to solve linear equations using a genetic algorithm?.

C++ Solutions


Solution 1 - C++

At the language level, there's no such thing as "bitwise operation on floating-point numbers". Bitwise operations in C/C++ work on value-representation of a number. And the value-representation of floating point numbers is not defined in C/C++ (unsigned integers are an exception in this regard, as their shift is defined as-if they are stored in 2's complement). Floating point numbers don't have bits at the level of value-representation, which is why you can't apply bitwise operations to them.

All you can do is analyze the bit content of the raw memory occupied by the floating-point number. For that you need to either use a union as suggested below or (equivalently, and only in C++) reinterpret the floating-point object as an array of unsigned char objects, as in

float f = 5;
unsigned char *c = reinterpret_cast<unsigned char *>(&f);
// inspect memory from c[0] to c[sizeof f - 1]

And please, don't try to reinterpret a float object as an int object, as other answers suggest. That doesn't make much sense, and is not guaranteed to work in compilers that follow strict-aliasing rules in optimization. The correct way to inspect memory content in C++ is by reinterpreting it as an array of [signed/unsigned] char.

Also note that you technically aren't guaranteed that floating-point representation on your system is IEEE754 (although in practice it is unless you explicitly allow it not to be, and then only with respect to -0.0, ±infinity and NaN).

Solution 2 - C++

If you are trying to change the bits in the floating-point representation, you could do something like this:

union fp_bit_twiddler {
    float f;
    int i;
} q;
q.f = a;
q.i &= (1 << 3);
a = q.f;

As AndreyT notes, accessing a union like this invokes undefined behavior, and the compiler could grow arms and strangle you. Do what he suggests instead.

Solution 3 - C++

float a = 1.4123;
unsigned int* inta = reinterpret_cast<unsigned int*>(&a);
*inta = *inta & (1 << 3);

Solution 4 - C++

You can work around the strict-aliasing rule and perform bitwise operations on a float type-punned as an uint32_t (if your implementation defines it, which most do) without undefined behavior by using memcpy():

float a = 1.4123f;
uint32_t b;

std::memcpy(&b, &a, 4);
// perform bitwise operation
b &= 1u << 3;
std::memcpy(&a, &b, 4);

Solution 5 - C++

Have a look at the following. Inspired by fast inverse square root:

#include <iostream>
using namespace std;

int main()
{
    float x, td = 2.0;
    int ti = *(int*) &td;
    cout << "Cast int: " << ti << endl;
    ti = ti>>4;
    x = *(float*) &ti;
    cout << "Recast float: " << x << endl;
    return 0; 
}

Solution 6 - C++

FWIW, there is a real use case for bit-wise operations on floating point (I just ran into it recently) - shaders written for OpenGL implementations that only support older versions of GLSL (1.2 and earlier did not have support for bit-wise operators), and where there would be loss of precision if the floats were converted to ints.

The bit-wise operations can be implemented on floating point numbers using remainders (modulo) and inequality checks. For example:

float A = 0.625; //value to check; ie, 160/256
float mask = 0.25; //bit to check; ie, 1/4
bool result = (mod(A, 2.0 * mask) >= mask); //non-zero if bit 0.25 is on in A

The above assumes that A is between [0..1) and that there is only one "bit" in mask to check, but it could be generalized for more complex cases.

This idea is based on some of the info found in is-it-possible-to-implement-bitwise-operators-using-integer-arithmetic

If there is not even a built-in mod function, then that can also be implemented fairly easily. For example:

float mod(float num, float den)
{
    return num - den * floor(num / den);
}

Solution 7 - C++

@mobrule:

Better:

#include <stdint.h>
...
union fp_bit_twiddler {
    float f;
    uint32_t u;
} q;

/* mutatis mutandis ... */

For these values int will likely be ok, but generally, you should use unsigned ints for bit shifting to avoid the effects of arithmetic shifts. And the uint32_t will work even on systems whose ints are not 32 bits.

Solution 8 - C++

The Python implementation in Floating point bitwise operations (Python recipe) of floating point bitwise operations works by representing numbers in binary that extends infinitely to the left as well as to the right from the fractional point. Because floating point numbers have a signed zero on most architectures it uses ones' complement for representing negative numbers (well, actually it just pretends to do so and uses a few tricks to achieve the appearance).

I'm sure it can be adapted to work in C++, but care must be taken so as to not let the right shifts overflow when equalizing the exponents.

Solution 9 - C++

Bitwise operators should NOT be used on floats, as floats are hardware specific, regardless of similarity on what ever hardware you might have. Which project/job do you want to risk on "well it worked on my machine"? Instead, for C++, you can get a similar "feel" for the bit shift operators by overloading the stream operator on an "object" wrapper for a float:

// Simple object wrapper for float type as templates want classes.
class Float
{
float m_f;
public:
    Float( const float & f )
    : m_f( f )
    {
    }

    operator float() const
    {
        return m_f;
    }
};

float operator>>( const Float & left, int right )
{
    float temp = left;
    for( right; right > 0; --right )
    {
        temp /= 2.0f;
    }
    return temp;
}

float operator<<( const Float & left, int right )
{
    float temp = left;
    for( right; right > 0; --right )
    {
        temp *= 2.0f;
    }
    return temp;
}

int main( int argc, char ** argv )
{
    int a1 = 40 >> 2; 
    int a2 = 40 << 2;
    int a3 = 13 >> 2;
    int a4 = 256 >> 2;
    int a5 = 255 >> 2;

    float f1 = Float( 40.0f ) >> 2; 
    float f2 = Float( 40.0f ) << 2;
    float f3 = Float( 13.0f ) >> 2;
    float f4 = Float( 256.0f ) >> 2;
    float f5 = Float( 255.0f ) >> 2;
}

You will have a remainder, which you can throw away based on your desired implementation.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionRohit BangaView Question on Stackoverflow
Solution 1 - C++AnTView Answer on Stackoverflow
Solution 2 - C++mobView Answer on Stackoverflow
Solution 3 - C++ChapView Answer on Stackoverflow
Solution 4 - C++Patrick RobertsView Answer on Stackoverflow
Solution 5 - C++JustinView Answer on Stackoverflow
Solution 6 - C++djulienView Answer on Stackoverflow
Solution 7 - C++Tim SchaefferView Answer on Stackoverflow
Solution 8 - C++Pyry PakkanenView Answer on Stackoverflow
Solution 9 - C++Kit10View Answer on Stackoverflow