Efficient way to OR adjacent bits in 64-bit integer
C++Bit ManipulationC++ Problem Overview
What I want to do is take a 64-bit unsigned integer consisting of pairs of bits and create from it a 32-bit integer containing 0 if both bits in the corresponding pair are 0 and 1 otherwise. In other words, convert something that looks like :
01 00 10 11
into something that looks like this
1 0 1 1
The two obvious solutions are either the brute force loop or a lookup table for each byte and then do eight lookups and combine them into a final result with OR and bit shifting but I'm sure there should be an efficient means of bit-twiddling this. I will be doing this for a 64-bit integers in C++ but if anyone knows of efficient ways to do this for shorter integers I'm sure I can figure out how to scale it up.
C++ Solutions
Solution 1 - C++
Here is a portable C++ implementation. It seems to work during my brief testing. The deinterleave code is based on this SO question.
uint64_t calc(uint64_t n)
{
// (odd | even)
uint64_t x = (n & 0x5555555555555555ull) | ((n & 0xAAAAAAAAAAAAAAAAull) >> 1);
// deinterleave
x = (x | (x >> 1)) & 0x3333333333333333ull;
x = (x | (x >> 2)) & 0x0F0F0F0F0F0F0F0Full;
x = (x | (x >> 4)) & 0x00FF00FF00FF00FFull;
x = (x | (x >> 8)) & 0x0000FFFF0000FFFFull;
x = (x | (x >> 16)) & 0x00000000FFFFFFFFull;
return x;
}
gcc, clang, and msvc all compile this down to about 30 instructions.
From the comments, there is a modification that can be made.
- Change the first line to use a single bitmask operation to select only the "odd" bits.
The possibly (?) improved code is:
uint64_t calc(uint64_t n)
{
// (odd | even)
uint64_t x = (n | (n >> 1)) & 0x5555555555555555ull; // single bits
// ... the restdeinterleave
x = (x | (x >> 1)) & 0x3333333333333333ull; // bit pairs
x = (x | (x >> 2)) & 0x0F0F0F0F0F0F0F0Full; // nibbles
x = (x | (x >> 4)) & 0x00FF00FF00FF00FFull; // octets
x = (x | (x >> 8)) & 0x0000FFFF0000FFFFull; // halfwords
x = (x | (x >> 16)) & 0x00000000FFFFFFFFull; // words
return x;
}
Solution 2 - C++
Probably fastest solution for x86 architecture with BMI2 instruction set:
#include <stdint.h>
#include <x86intrin.h>
uint32_t calc (uint64_t a)
{
return _pext_u64(a, 0x5555555555555555ull) |
_pext_u64(a, 0xaaaaaaaaaaaaaaaaull);
}
This compiles to 5 instructions total.
Solution 3 - C++
If you do not have pext
and you still want to do this better than the trivial way then this extraction can be expressed as a logarithmic number (if you generalized it in terms of length) of bit moves:
// OR adjacent bits, destroys the odd bits but it doesn't matter
x = (x | (x >> 1)) & rep8(0x55);
// gather the even bits with delta swaps
x = bitmove(x, rep8(0x44), 1); // make pairs
x = bitmove(x, rep8(0x30), 2); // make nibbles
x = bitmove(x, rep4(0x0F00), 4); // make bytes
x = bitmove(x, rep2(0x00FF0000), 8); // make words
res = (uint32_t)(x | (x >> 16)); // final step is simpler
With:
bitmove(x, mask, step) {
return x | ((x & mask) >> step);
}
repk
is just so I could write shorter constants. rep8(0x44) = 0x4444444444444444
etc.
Also if you do have pext
, you can do it with only one of them, which is probably faster and at least shorter:
_pext_u64(x | (x >> 1), rep8(0x55));
Solution 4 - C++
Okay, let's make this more hacky then (might be buggy):
uint64_t x;
uint64_t even_bits = x & 0xAAAAAAAAAAAAAAAAull;
uint64_t odd_bits = x & 0x5555555555555555ull;
Now, my original solution did this:
// wrong
even_bits >> 1;
unsigned int solution = even_bits | odd_bits;
However, as JackAidley pointed out, while this aligns the bits together, it doesn't remove the spaces from the middle!
Thankfully, we can use a very helpful _pext
instruction from the BMI2 instruction set.
> u64 _pext_u64(u64 a, u64 m)
- Extract bits from a at the corresponding bit locations specified by mask m to contiguous low bits in dst; the remaining upper bits in dst are set to zero.
solution = _pext_u64(solution, odd_bits);
Alternatively, instead of using &
and >>
to separate out the bits, you might just use _pext
twice on the original number with the provided masks (which would split it up into two contiguous 32-bit numbers), and then simply or
the results.
If you don't have access to BMI2, though, I'm pretty sure the gap removal would still involve a loop; a bit simpler one than your original idea, perhaps.
Solution 5 - C++
Slight improvement over the LUT approach (4 lookups instead of 8):
Compute the bitwise-or and clear every other bit. Then intertwine the bits of pairs of bytes to yield four bytes. Finally, reorder the bits in the four bytes (mapped on the quadword) by means of a 256-entries lookup-table:
Q= (Q | (Q << 1)) & 0xAAAAAAAAAAAAL; // OR in pairs
Q|= Q >> 9; // Intertwine 4 words into 4 bytes
B0= LUT[B0]; B1= LUT[B2]; B2= LUT[B4]; B3= LUT[B6]; // Rearrange bits in bytes
Solution 6 - C++
The hard part seem to be to pack the bits after oring. The oring is done by:
ored = (x | (x>>1)) & 0x5555555555555555;
(assuming large enough int
s so we don't have to use suffixes). Then we can pack then in steps first packing them two-by-two, four-by-four etc:
pack2 = ((ored*3) >> 1) & 0x333333333333;
pack4 = ((ored*5) >> 2) & 0x0F0F0F0F0F0F;
pack8 = ((ored*17) >> 4) & 0x00FF00FF00FF;
pac16 = ((ored*257) >> 8) & 0x0000FFFF0000FFFF;
pack32 = ((ored*65537) >> 16) & 0xFFFFFFFF;
// (or cast to uint32_t instead of the final & 0xFFF...)
The thing that happens in the packing is that by multiplying we combine the data with the data shifted. In your example we would have in first multiplication (I denote zeros from the masking in ored
as o
and the other 0
(from the original data)):
o1o0o1o1
x 11
----------
o1o0o1o1
o1o0o1o1
----------
o11001111
^^ ^^
o10oo11o < these are bits we want to keep.
We could have done that by oring as well:
ored = (ored | (ored>>1)) & 0x3333333333333333;
ored = (ored | (ored>>2)) & 0x0F0F0F0F0F0F0F0F;
ored = (ored | (ored>>4)) & 0x00FF00FF00FF00FF;
ored = (ored | (ored>>8)) & 0x0000FFFF0000FFFF;
ored = (ored | (ored>>16)) & 0xFFFFFFFF;
// ored = ((uint32_t)ored | (uint32_t)(ored>>16)); // helps some compilers make better code, esp. on x86
Solution 7 - C++
I made some vectorized versions (godbolt link still with some big design-notes comments) and did some benchmarks when this question was new. I was going to spend more time on it, but never got back to it. Posting what I have so I can close this browser tab. >.< Improvements welcome.
I don't have a Haswell I could test on, so I couldn't benchmark the pextr
version against this. I'm sure it's faster, though, since it's only 4 fast instructions.
*** Sandybridge (i5-2500k, so no hyperthreading)
*** 64bit, gcc 5.2 with -O3 -fno-tree-vectorize results:
TODO: update benchmarks for latest code changes
total cycles, and insn/clock, for the test-loop
This measures only throughput, not latency,
and a bottleneck on one execution port might make a function look worse in a microbench
than it will do when mixed with other code that can keep the other ports busy.
Lower numbers in the first column are better:
these are total cycle counts in Megacycles, and correspond to execution time
but they take frequency scaling / turbo out of the mix.
(We're not cache / memory bound at all, so low core clock = fewer cycles for cache miss doesn't matter).
AVX no AVX
887.519Mc 2.70Ipc 887.758Mc 2.70Ipc use_orbits_shift_right
1140.68Mc 2.45Ipc 1140.47Mc 2.46Ipc use_orbits_mul (old version that right-shifted after each)
718.038Mc 2.79Ipc 716.452Mc 2.79Ipc use_orbits_x86_lea
767.836Mc 2.74Ipc 1027.96Mc 2.53Ipc use_orbits_sse2_shift
619.466Mc 2.90Ipc 816.698Mc 2.69Ipc use_orbits_ssse3_shift
845.988Mc 2.72Ipc 845.537Mc 2.72Ipc use_orbits_ssse3_shift_scalar_mmx (gimped by stupid compiler)
583.239Mc 2.92Ipc 686.792Mc 2.91Ipc use_orbits_ssse3_interleave_scalar
547.386Mc 2.92Ipc 730.259Mc 2.88Ipc use_orbits_ssse3_interleave
The fastest (for throughput in a loop) with AVX is orbits_ssse3_interleave
The fastest (for throughput in a loop) without AVX is orbits_ssse3_interleave_scalar
but obits_x86_lea comes very close.
AVX for non-destructive 3-operand vector insns helps a lot
Maybe a bit less important on IvB and later, where mov-elimination handles mov uops at register-rename time
// Tables generated with the following commands:
// for i in avx.perf{{2..4},{6..10}};do awk '/cycles / {c=$1; gsub(",", "", c); } /insns per cy/ {print c / 1000000 "Mc " $4"Ipc"}' *"$i"*;done | column -c 50 -x
// Include 0 and 1 for hosts with pextr
// 5 is omitted because it's not written
The almost certainly best version (with BMI2) is:
#include <stdint.h>
#define LOBITS64 0x5555555555555555ull
#define HIBITS64 0xaaaaaaaaaaaaaaaaull
uint32_t orbits_1pext (uint64_t a) {
// a|a<<1 compiles more efficiently on x86 than a|a>>1, because of LEA for non-destructive left-shift
return _pext_u64( a | a<<1, HIBITS64);
}
This compiles to:
lea rax, [rdi+rdi]
or rdi, rax
movabs rax, -6148914691236517206
pext rax, rdi, rax
ret
So it's only 4 uops, and the critical path latency is 5c = 3(pext) + 1(or) + 1(lea). (Intel Haswell). The throughput should be one result per cycle (with no loop overhead or loading/storing). The mov imm
for the constant can be hoisted out of a loop, though, since it isn't destroyed. This means throughput-wise that we only need 3 fused-domain uops per result.
The mov r, imm64
isn't ideal. (A 1uop broadcast-immediate 32 or 8bit to a 64bit reg would be ideal, but there's no such instruction). Having the constant in data memory is an option, but inline in the instruction stream is nice. A 64b constant takes a lot of uop-cache space, which makes the version that does pext
with two different masks even worse. Generating one mask from the other with a not
could help with that, though: movabs
/ pext
/ not
/ pext
/ or
, but that's still 5 insns compared to the 4 enabled by the lea
trick.
The best version (with AVX) is:
#include <immintrin.h>
/* Yves Daoust's idea, operating on nibbles instead of bytes:
original:
Q= (Q | (Q << 1)) & 0xAAAAAAAAAAAAL // OR in pairs
Q|= Q >> 9; // Intertwine 4 words into 4 bytes
B0= LUT[B0]; B1= LUT[B2]; B2= LUT[B4]; B3= LUT[B6]; // Rearrange bits in bytes
To operate on nibbles,
Q= (Q | (Q << 1)) & 0xAAAAAAAAAAAAL // OR in pairs, same as before
Q|= Q>>5 // Intertwine 8 nibbles into 8 bytes
// pshufb as a LUT to re-order the bits within each nibble (to undo the interleave)
// right-shift and OR to combine nibbles
// pshufb as a byte-shuffle to put the 4 bytes we want into the low 4
*/
uint32_t orbits_ssse3_interleave(uint64_t scalar_a)
{
// do some of this in GP regs if not doing two 64b elements in parallel.
// esp. beneficial for AMD Bulldozer-family, where integer and vector ops don't share execution ports
// but VEX-encoded SSE saves mov instructions
__m128i a = _mm_cvtsi64_si128(scalar_a);
// element size doesn't matter, any bits shifted out of element boundaries would have been masked off anyway.
__m128i lshift = _mm_slli_epi64(a, 1);
lshift = _mm_or_si128(lshift, a);
lshift = _mm_and_si128(lshift, _mm_set1_epi32(0xaaaaaaaaUL));
// a = bits: h g f e d c b a (same thing in other bytes)
// lshift = hg 0 fe 0 dc 0 ba 0
// lshift = s 0 r 0 q 0 p 0
// lshift = s 0 r 0 q 0 p 0
__m128i rshift = _mm_srli_epi64(lshift, 5); // again, element size doesn't matter, we're keeping only the low nibbles
// rshift = s 0 r 0 q 0 p 0 (the last zero ORs with the top bit of the low nibble in the next byte over)
__m128i nibbles = _mm_or_si128(rshift, lshift);
nibbles = _mm_and_si128(nibbles, _mm_set1_epi8(0x0f) ); // have to zero the high nibbles: the sign bit affects pshufb
// nibbles = 0 0 0 0 q s p r
// pshufb -> 0 0 0 0 s r q p
const __m128i BITORDER_NIBBLE_LUT = _mm_setr_epi8( // setr: first arg goes in the low byte, indexed by 0b0000
0b0000,
0b0100,
0b0001,
0b0101,
0b1000,
0b1100,
0b1001,
0b1101,
0b0010,
0b0110,
0b0011,
0b0111,
0b1010,
0b1110,
0b1011,
0b1111 );
__m128i ord_nibbles = _mm_shuffle_epi8(BITORDER_NIBBLE_LUT, nibbles);
// want 00 00 00 00 AB CD EF GH from:
// ord_nibbles = 0A0B0C0D0E0F0G0H
// 0A0B0C0D0E0F0G0 H(shifted out)
__m128i merged_nibbles = _mm_or_si128(ord_nibbles, _mm_srli_epi64(ord_nibbles, 4));
// merged_nibbles= 0A AB BC CD DE EF FG GH. We want every other byte of this.
// 7 6 5 4 3 2 1 0
// pshufb is the most efficient way. Mask and then packuswb would work, but uses the shuffle port just like pshufb
__m128i ord_bytes = _mm_shuffle_epi8(merged_nibbles, _mm_set_epi8(-1,-1,-1,-1, 14,12,10,8,
-1,-1,-1,-1, 6, 4, 2,0) );
return _mm_cvtsi128_si32(ord_bytes); // movd the low32 of the vector
// _mm_extract_epi32(ord_bytes, 2); // If operating on two inputs in parallel: SSE4.1 PEXTRD the result from the upper half of the reg.
}
The best version without AVX is a slight modification that only works with one input at a time, only using SIMD for the shuffling. In theory using MMX instead of SSE would make more sense, esp. if we care about first-gen Core2 where 64b pshufb is fast, but 128b pshufb is not single cycle. Anyway, compilers did a bad job with MMX intrinsics. Also, EMMS is slow.
// same as orbits_ssse3_interleave, but doing some of the math in integer regs. (non-vectorized)
// esp. beneficial for AMD Bulldozer-family, where integer and vector ops don't share execution ports
// VEX-encoded SSE saves mov instructions, so full vector is preferable if building with VEX-encoding
// Use MMX for Silvermont/Atom/Merom(Core2): pshufb is slow for xmm, but fast for MMX. Only 64b shuffle unit?
uint32_t orbits_ssse3_interleave_scalar(uint64_t scalar_a)
{
uint64_t lshift = (scalar_a | scalar_a << 1);
lshift &= HIBITS64;
uint64_t rshift = lshift >> 5;
// rshift = s 0 r 0 q 0 p 0 (the last zero ORs with the top bit of the low nibble in the next byte over)
uint64_t nibbles_scalar = (rshift | lshift) & 0x0f0f0f0f0f0f0f0fULL;
// have to zero the high nibbles: the sign bit affects pshufb
__m128i nibbles = _mm_cvtsi64_si128(nibbles_scalar);
// nibbles = 0 0 0 0 q s p r
// pshufb -> 0 0 0 0 s r q p
const __m128i BITORDER_NIBBLE_LUT = _mm_setr_epi8( // setr: first arg goes in the low byte, indexed by 0b0000
0b0000,
0b0100,
0b0001,
0b0101,
0b1000,
0b1100,
0b1001,
0b1101,
0b0010,
0b0110,
0b0011,
0b0111,
0b1010,
0b1110,
0b1011,
0b1111 );
__m128i ord_nibbles = _mm_shuffle_epi8(BITORDER_NIBBLE_LUT, nibbles);
// want 00 00 00 00 AB CD EF GH from:
// ord_nibbles = 0A0B0C0D0E0F0G0H
// 0A0B0C0D0E0F0G0 H(shifted out)
__m128i merged_nibbles = _mm_or_si128(ord_nibbles, _mm_srli_epi64(ord_nibbles, 4));
// merged_nibbles= 0A AB BC CD DE EF FG GH. We want every other byte of this.
// 7 6 5 4 3 2 1 0
// pshufb is the most efficient way. Mask and then packuswb would work, but uses the shuffle port just like pshufb
__m128i ord_bytes = _mm_shuffle_epi8(merged_nibbles, _mm_set_epi8(0,0,0,0, 0,0,0,0, 0,0,0,0, 6,4,2,0));
return _mm_cvtsi128_si32(ord_bytes); // movd the low32 of the vector
}
Sorry for the mostly code-dump answer. At this point I didn't feel like it's worth spending a huge amount of time on discussing things more than the comments already do. See http://agner.org/optimize/ for guides to optimizing for specific microarchitectures. Also the [tag:x86] wiki for other resources.