Efficient way to OR adjacent bits in 64-bit integer

C++Bit Manipulation

C++ Problem Overview


What I want to do is take a 64-bit unsigned integer consisting of pairs of bits and create from it a 32-bit integer containing 0 if both bits in the corresponding pair are 0 and 1 otherwise. In other words, convert something that looks like :

01 00 10 11

into something that looks like this

1 0 1 1

The two obvious solutions are either the brute force loop or a lookup table for each byte and then do eight lookups and combine them into a final result with OR and bit shifting but I'm sure there should be an efficient means of bit-twiddling this. I will be doing this for a 64-bit integers in C++ but if anyone knows of efficient ways to do this for shorter integers I'm sure I can figure out how to scale it up.

C++ Solutions


Solution 1 - C++

Here is a portable C++ implementation. It seems to work during my brief testing. The deinterleave code is based on this SO question.

uint64_t calc(uint64_t n)
{
	// (odd | even)
	uint64_t x = (n & 0x5555555555555555ull) | ((n & 0xAAAAAAAAAAAAAAAAull) >> 1);

	// deinterleave
	x = (x | (x >> 1)) & 0x3333333333333333ull;
	x = (x | (x >> 2)) & 0x0F0F0F0F0F0F0F0Full;
	x = (x | (x >> 4)) & 0x00FF00FF00FF00FFull;
	x = (x | (x >> 8)) & 0x0000FFFF0000FFFFull;
	x = (x | (x >> 16)) & 0x00000000FFFFFFFFull;

	return x;
}

gcc, clang, and msvc all compile this down to about 30 instructions.

From the comments, there is a modification that can be made.

  • Change the first line to use a single bitmask operation to select only the "odd" bits.

The possibly (?) improved code is:

uint64_t calc(uint64_t n)
{
	// (odd | even)
	uint64_t x = (n | (n >> 1)) & 0x5555555555555555ull; // single bits

	// ... the restdeinterleave
	x = (x | (x >> 1)) & 0x3333333333333333ull; // bit pairs
	x = (x | (x >> 2)) & 0x0F0F0F0F0F0F0F0Full; // nibbles
	x = (x | (x >> 4)) & 0x00FF00FF00FF00FFull; // octets
	x = (x | (x >> 8)) & 0x0000FFFF0000FFFFull; // halfwords
	x = (x | (x >> 16)) & 0x00000000FFFFFFFFull; // words

	return x;
}

Solution 2 - C++

Probably fastest solution for x86 architecture with BMI2 instruction set:

#include <stdint.h>
#include <x86intrin.h>

uint32_t calc (uint64_t a)
{
   return _pext_u64(a, 0x5555555555555555ull) |
          _pext_u64(a, 0xaaaaaaaaaaaaaaaaull);
}

This compiles to 5 instructions total.

Solution 3 - C++

If you do not have pext and you still want to do this better than the trivial way then this extraction can be expressed as a logarithmic number (if you generalized it in terms of length) of bit moves:

// OR adjacent bits, destroys the odd bits but it doesn't matter
x = (x | (x >> 1)) & rep8(0x55);
// gather the even bits with delta swaps
x = bitmove(x, rep8(0x44), 1);   // make pairs
x = bitmove(x, rep8(0x30), 2);   // make nibbles
x = bitmove(x, rep4(0x0F00), 4); // make bytes
x = bitmove(x, rep2(0x00FF0000), 8); // make words
res = (uint32_t)(x | (x >> 16)); // final step is simpler

With:

bitmove(x, mask, step) {
    return x | ((x & mask) >> step);
}

repk is just so I could write shorter constants. rep8(0x44) = 0x4444444444444444 etc.

Also if you do have pext, you can do it with only one of them, which is probably faster and at least shorter:

_pext_u64(x | (x >> 1), rep8(0x55));

Solution 4 - C++

Okay, let's make this more hacky then (might be buggy):

uint64_t x;

uint64_t even_bits = x & 0xAAAAAAAAAAAAAAAAull;
uint64_t odd_bits  = x & 0x5555555555555555ull;

Now, my original solution did this:

// wrong
even_bits >> 1;
unsigned int solution = even_bits | odd_bits;

However, as JackAidley pointed out, while this aligns the bits together, it doesn't remove the spaces from the middle!

Thankfully, we can use a very helpful _pext instruction from the BMI2 instruction set.

> u64 _pext_u64(u64 a, u64 m) - Extract bits from a at the corresponding bit locations specified by mask m to contiguous low bits in dst; the remaining upper bits in dst are set to zero.

solution = _pext_u64(solution, odd_bits);

Alternatively, instead of using & and >> to separate out the bits, you might just use _pext twice on the original number with the provided masks (which would split it up into two contiguous 32-bit numbers), and then simply or the results.

If you don't have access to BMI2, though, I'm pretty sure the gap removal would still involve a loop; a bit simpler one than your original idea, perhaps.

Solution 5 - C++

Slight improvement over the LUT approach (4 lookups instead of 8):

Compute the bitwise-or and clear every other bit. Then intertwine the bits of pairs of bytes to yield four bytes. Finally, reorder the bits in the four bytes (mapped on the quadword) by means of a 256-entries lookup-table:

Q= (Q | (Q << 1)) & 0xAAAAAAAAAAAAL; // OR in pairs
Q|= Q >> 9; // Intertwine 4 words into 4 bytes
B0= LUT[B0]; B1= LUT[B2]; B2= LUT[B4]; B3= LUT[B6]; // Rearrange bits in bytes

Solution 6 - C++

The hard part seem to be to pack the bits after oring. The oring is done by:

ored  = (x | (x>>1)) & 0x5555555555555555;

(assuming large enough ints so we don't have to use suffixes). Then we can pack then in steps first packing them two-by-two, four-by-four etc:

pack2 = ((ored*3) >> 1) & 0x333333333333;
pack4 = ((ored*5) >> 2) & 0x0F0F0F0F0F0F;
pack8 = ((ored*17) >> 4) & 0x00FF00FF00FF;
pac16 = ((ored*257) >> 8) & 0x0000FFFF0000FFFF;
pack32 = ((ored*65537) >> 16) & 0xFFFFFFFF;
// (or cast to uint32_t instead of the final & 0xFFF...)

The thing that happens in the packing is that by multiplying we combine the data with the data shifted. In your example we would have in first multiplication (I denote zeros from the masking in ored as o and the other 0 (from the original data)):

 o1o0o1o1
     x 11
----------
 o1o0o1o1
o1o0o1o1
----------
o11001111
  ^^  ^^ 
 o10oo11o < these are bits we want to keep.

We could have done that by oring as well:

ored = (ored | (ored>>1)) & 0x3333333333333333;
ored = (ored | (ored>>2)) & 0x0F0F0F0F0F0F0F0F;
ored = (ored | (ored>>4)) & 0x00FF00FF00FF00FF;
ored = (ored | (ored>>8)) & 0x0000FFFF0000FFFF;
ored = (ored | (ored>>16)) & 0xFFFFFFFF;

// ored = ((uint32_t)ored | (uint32_t)(ored>>16));  // helps some compilers make better code, esp. on x86

Solution 7 - C++

I made some vectorized versions (godbolt link still with some big design-notes comments) and did some benchmarks when this question was new. I was going to spend more time on it, but never got back to it. Posting what I have so I can close this browser tab. >.< Improvements welcome.

I don't have a Haswell I could test on, so I couldn't benchmark the pextr version against this. I'm sure it's faster, though, since it's only 4 fast instructions.

 *** Sandybridge (i5-2500k, so no hyperthreading)
 *** 64bit, gcc 5.2 with -O3 -fno-tree-vectorize results:
 TODO: update benchmarks for latest code changes

   total cycles, and insn/clock, for the test-loop
   This measures only throughput, not latency,
   and a bottleneck on one execution port might make a function look worse in a microbench
   than it will do when mixed with other code that can keep the other ports busy.

Lower numbers in the first column are better: 
these are total cycle counts in Megacycles, and correspond to execution time 
but they take frequency scaling / turbo out of the mix.
(We're not cache / memory bound at all, so low core clock = fewer cycles for cache miss doesn't matter).

     AVX                  no AVX
887.519Mc  2.70Ipc      887.758Mc  2.70Ipc    use_orbits_shift_right
1140.68Mc  2.45Ipc      1140.47Mc  2.46Ipc    use_orbits_mul  (old version that right-shifted after each)
718.038Mc  2.79Ipc      716.452Mc  2.79Ipc    use_orbits_x86_lea
767.836Mc  2.74Ipc      1027.96Mc  2.53Ipc    use_orbits_sse2_shift
619.466Mc  2.90Ipc      816.698Mc  2.69Ipc    use_orbits_ssse3_shift
845.988Mc  2.72Ipc      845.537Mc  2.72Ipc    use_orbits_ssse3_shift_scalar_mmx (gimped by stupid compiler)
583.239Mc  2.92Ipc      686.792Mc  2.91Ipc    use_orbits_ssse3_interleave_scalar
547.386Mc  2.92Ipc      730.259Mc  2.88Ipc    use_orbits_ssse3_interleave

The fastest (for throughput in a loop) with    AVX is orbits_ssse3_interleave
The fastest (for throughput in a loop) without AVX is orbits_ssse3_interleave_scalar
but obits_x86_lea comes very close.

AVX for non-destructive 3-operand vector insns helps a lot
Maybe a bit less important on IvB and later, where mov-elimination handles mov uops at register-rename time

// Tables generated with the following commands:
// for i in avx.perf{{2..4},{6..10}};do awk '/cycles   / {c=$1; gsub(",", "", c); }  /insns per cy/ {print c / 1000000 "Mc  " $4"Ipc"}' *"$i"*;done | column -c 50 -x
//  Include 0 and 1 for hosts with pextr
// 5 is omitted because it's not written

The almost certainly best version (with BMI2) is:

#include <stdint.h>
#define LOBITS64 0x5555555555555555ull
#define HIBITS64 0xaaaaaaaaaaaaaaaaull

uint32_t orbits_1pext (uint64_t a) {
    // a|a<<1 compiles more efficiently on x86 than a|a>>1, because of LEA for non-destructive left-shift
    return _pext_u64( a | a<<1, HIBITS64);
}

This compiles to:

    lea     rax, [rdi+rdi]
    or      rdi, rax
    movabs  rax, -6148914691236517206
    pext    rax, rdi, rax
    ret

So it's only 4 uops, and the critical path latency is 5c = 3(pext) + 1(or) + 1(lea). (Intel Haswell). The throughput should be one result per cycle (with no loop overhead or loading/storing). The mov imm for the constant can be hoisted out of a loop, though, since it isn't destroyed. This means throughput-wise that we only need 3 fused-domain uops per result.

The mov r, imm64 isn't ideal. (A 1uop broadcast-immediate 32 or 8bit to a 64bit reg would be ideal, but there's no such instruction). Having the constant in data memory is an option, but inline in the instruction stream is nice. A 64b constant takes a lot of uop-cache space, which makes the version that does pext with two different masks even worse. Generating one mask from the other with a not could help with that, though: movabs / pext / not / pext / or, but that's still 5 insns compared to the 4 enabled by the lea trick.


The best version (with AVX) is:

#include <immintrin.h>

/* Yves Daoust's idea, operating on nibbles instead of bytes:
   original:
   Q= (Q | (Q << 1)) & 0xAAAAAAAAAAAAL // OR in pairs
   Q|= Q >> 9; // Intertwine 4 words into 4 bytes
   B0= LUT[B0]; B1= LUT[B2]; B2= LUT[B4]; B3= LUT[B6]; // Rearrange bits in bytes

   To operate on nibbles,
   Q= (Q | (Q << 1)) & 0xAAAAAAAAAAAAL // OR in pairs, same as before
   Q|= Q>>5  // Intertwine 8 nibbles into 8 bytes
   // pshufb as a LUT to re-order the bits within each nibble (to undo the interleave)
   // right-shift and OR to combine nibbles
   // pshufb as a byte-shuffle to put the 4 bytes we want into the low 4
*/
uint32_t orbits_ssse3_interleave(uint64_t scalar_a)
{
    // do some of this in GP regs if not doing two 64b elements in parallel.
    // esp. beneficial for AMD Bulldozer-family, where integer and vector ops don't share execution ports
    // but VEX-encoded SSE saves mov instructions

    __m128i a = _mm_cvtsi64_si128(scalar_a);
    // element size doesn't matter, any bits shifted out of element boundaries would have been masked off anyway.
    __m128i lshift = _mm_slli_epi64(a, 1);
    lshift = _mm_or_si128(lshift, a);
    lshift = _mm_and_si128(lshift, _mm_set1_epi32(0xaaaaaaaaUL));
    // a = bits:   h  g  f  e  d  c  b  a  (same thing in other bytes)
    // lshift =    hg 0 fe  0  dc 0  ba 0
    // lshift =    s  0  r  0  q  0  p  0

    // lshift =    s 0 r 0 q 0 p 0
    __m128i rshift = _mm_srli_epi64(lshift, 5);  // again, element size doesn't matter, we're keeping only the low nibbles
    // rshift =              s 0 r 0 q 0 p 0  (the last zero ORs with the top bit of the low nibble in the next byte over)
    __m128i nibbles = _mm_or_si128(rshift, lshift);
    nibbles = _mm_and_si128(nibbles, _mm_set1_epi8(0x0f) );  // have to zero the high nibbles: the sign bit affects pshufb

    // nibbles =   0 0 0 0 q s p r
    // pshufb ->   0 0 0 0 s r q p
    const __m128i BITORDER_NIBBLE_LUT = _mm_setr_epi8( // setr: first arg goes in the low byte, indexed by 0b0000
	0b0000,
	0b0100,
	0b0001,
	0b0101,
	0b1000,
	0b1100,
	0b1001,
	0b1101,
	0b0010,
	0b0110,
	0b0011,
	0b0111,
	0b1010,
	0b1110,
	0b1011,
	0b1111 );
    __m128i ord_nibbles = _mm_shuffle_epi8(BITORDER_NIBBLE_LUT, nibbles);

    // want            00 00 00 00 AB CD EF GH from:

    // ord_nibbles   = 0A0B0C0D0E0F0G0H
    //                  0A0B0C0D0E0F0G0 H(shifted out)
    __m128i merged_nibbles = _mm_or_si128(ord_nibbles, _mm_srli_epi64(ord_nibbles, 4));
    // merged_nibbles= 0A AB BC CD DE EF FG GH.  We want every other byte of this.
    //                 7  6  5  4  3  2  1  0
    // pshufb is the most efficient way.  Mask and then packuswb would work, but uses the shuffle port just like pshufb
    __m128i ord_bytes = _mm_shuffle_epi8(merged_nibbles, _mm_set_epi8(-1,-1,-1,-1, 14,12,10,8,
								      -1,-1,-1,-1,  6, 4, 2,0) );
    return _mm_cvtsi128_si32(ord_bytes); // movd the low32 of the vector
    // _mm_extract_epi32(ord_bytes, 2); // If operating on two inputs in parallel: SSE4.1 PEXTRD the result from the upper half of the reg.
}

The best version without AVX is a slight modification that only works with one input at a time, only using SIMD for the shuffling. In theory using MMX instead of SSE would make more sense, esp. if we care about first-gen Core2 where 64b pshufb is fast, but 128b pshufb is not single cycle. Anyway, compilers did a bad job with MMX intrinsics. Also, EMMS is slow.

// same as orbits_ssse3_interleave, but doing some of the math in integer regs. (non-vectorized)
// esp. beneficial for AMD Bulldozer-family, where integer and vector ops don't share execution ports

// VEX-encoded SSE saves mov instructions, so full vector is preferable if building with VEX-encoding

// Use MMX for Silvermont/Atom/Merom(Core2): pshufb is slow for xmm, but fast for MMX.  Only 64b shuffle unit?
uint32_t orbits_ssse3_interleave_scalar(uint64_t scalar_a)
{
    uint64_t lshift = (scalar_a | scalar_a << 1);
    lshift &= HIBITS64;

    uint64_t rshift = lshift >> 5;
    // rshift =              s 0 r 0 q 0 p 0  (the last zero ORs with the top bit of the low nibble in the next byte over)
    uint64_t nibbles_scalar = (rshift | lshift) & 0x0f0f0f0f0f0f0f0fULL;
    // have to zero the high nibbles: the sign bit affects pshufb
    __m128i nibbles = _mm_cvtsi64_si128(nibbles_scalar);
   
    // nibbles =   0 0 0 0 q s p r
    // pshufb ->   0 0 0 0 s r q p

    const __m128i BITORDER_NIBBLE_LUT = _mm_setr_epi8( // setr: first arg goes in the low byte, indexed by 0b0000
	0b0000,
	0b0100,
	0b0001,
	0b0101,
	0b1000,
	0b1100,
	0b1001,
	0b1101,
	0b0010,
	0b0110,
	0b0011,
	0b0111,
	0b1010,
	0b1110,
	0b1011,
	0b1111 );
    __m128i ord_nibbles = _mm_shuffle_epi8(BITORDER_NIBBLE_LUT, nibbles);

    // want            00 00 00 00 AB CD EF GH from:

    // ord_nibbles   = 0A0B0C0D0E0F0G0H
    //                  0A0B0C0D0E0F0G0 H(shifted out)
    __m128i merged_nibbles = _mm_or_si128(ord_nibbles, _mm_srli_epi64(ord_nibbles, 4));
    // merged_nibbles= 0A AB BC CD DE EF FG GH.  We want every other byte of this.
    //                 7  6  5  4  3  2  1  0
    // pshufb is the most efficient way.  Mask and then packuswb would work, but uses the shuffle port just like pshufb
    __m128i ord_bytes = _mm_shuffle_epi8(merged_nibbles, _mm_set_epi8(0,0,0,0, 0,0,0,0, 0,0,0,0, 6,4,2,0));
    return _mm_cvtsi128_si32(ord_bytes); // movd the low32 of the vector
}

Sorry for the mostly code-dump answer. At this point I didn't feel like it's worth spending a huge amount of time on discussing things more than the comments already do. See http://agner.org/optimize/ for guides to optimizing for specific microarchitectures. Also the [tag:x86] wiki for other resources.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionJack AidleyView Question on Stackoverflow
Solution 1 - C++BlastfurnaceView Answer on Stackoverflow
Solution 2 - C++Nils PipenbrinckView Answer on Stackoverflow
Solution 3 - C++haroldView Answer on Stackoverflow
Solution 4 - C++Bartek BanachewiczView Answer on Stackoverflow
Solution 5 - C++Yves DaoustView Answer on Stackoverflow
Solution 6 - C++skykingView Answer on Stackoverflow
Solution 7 - C++Peter CordesView Answer on Stackoverflow