Why is Clang optimizing this code out?

C++ClangCompiler Optimization

C++ Problem Overview


The purpose of the code is to find the total number of 32-bit floating point bit patterns which represent values between 0 and 1. It seems to me this should work, but for some reason the assembly output from Clang is basically the equivalent of return 0;.

I compiled this with Clang 3.3 and Clang 3.4.1, using -std=c++1y -Wall -Wextra -pedantic -O2 and -std=c++1y -Wall -Wextra -pedantic -O3

Clang 3.4 optimizes everything away with -O2 and -O3.

Clang 3.3 only optimizes everything away with -O3.

By "optimizes everything away" I mean that this is the assembly output of the program:

main:                                   # @main
	xorl	%eax, %eax
	ret

#include <limits>
#include <cstring>
#include <cstdint>

template <class TO, class FROM>
inline TO punning_cast(const FROM &input)
{
	TO out;
	std::memcpy(&out, &input, sizeof(TO));
	return out;
}

int main()
{
    uint32_t i = std::numeric_limits<uint32_t>::min();
	uint32_t count = 0;
  
	while (1)
	{
		float n = punning_cast<float>(i);
		if(n >= 0.0f && n <= 1.0f)
			count++;
      	if (i == std::numeric_limits<uint32_t>::max())
            break;
      	i++;
	}

	return count;
}

C++ Solutions


Solution 1 - C++

Here's a simpler test case which points out that it's a compiler bug:

http://coliru.stacked-crooked.com/a/58b3f9b4edd8e373

#include <cstdint>

int main()
{
	uint32_t i = 0;
	uint32_t count = 1;

	while (1)
	{
		if( i < 5 )
			count+=1;
    
		if (i == 0xFFFFFFFF)
			break;
		i++;
	}

	return count; // should return 6
}

The assembly shows that it outputs 1, instead of 6. It doesn't think it's an infinite loop either, in which case the assembly doesn't return from main.

Solution 2 - C++

This is not an answer but a datapoint that's too big for a comment.

Interestingly, if you print count right before the return then clang will still optimize everything out and print 0 with -O3 and 1065353218 with -O0. (Note that echo $? reports that the return value is always 2, no matter what the actual return was). To me, this makes it look like a compiler bug.

If you turn your while into a for:

for (uint32_t i = std::numeric_limits<uint32_t>::min(); i != std::numeric_limits<uint32_t>::max(); ++i)
{
    float n = punning_cast<float>(i);
    if(n >= 0.0f && n <= 1.0f)
        count++;
}

Then the same answer comes out for both optimization levels. Definitely true if you print, and though I haven't looked at the assembly it's likely also true for the unprinted case because it does take time before it finishes. (clang 3.4)

I've found bugs in LLVM before (funny template business that made clang segfault) and they've been responsive in fixing it if you give a nice and clear example of the fault. I suggest you submit this as a bug report.

Solution 3 - C++

Using mukunda's example above, in clang 3.4 with -O2 the issue seems to be in the vectorization phase. The vectorized code jumps on entry to past the vectorized code:

br i1 true, label %middle.block, label %vector.ph

so count's value remains unchanged from its initialization.

*** IR Dump Before Combine redundant instructions ***
; Function Attrs: nounwind readnone ssp uwtable
define i32 @main() #0 {
entry:
  br i1 true, label %middle.block, label %vector.ph

vector.ph:                                        ; preds = %entry
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %vec.phi = phi <4 x i32> [ <i32 1, i32 0, i32 0, i32 0>, %vector.ph ], [ %4, %vector.body ]
  %vec.phi8 = phi <4 x i32> [ zeroinitializer, %vector.ph ], [ %5, %vector.body ]
  %broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
  %broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
  %induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
  %induction7 = add <4 x i32> %broadcast.splat, <i32 4, i32 5, i32 6, i32 7>
  %0 = icmp ult <4 x i32> %induction, <i32 5, i32 5, i32 5, i32 5>
  %1 = icmp ult <4 x i32> %induction7, <i32 5, i32 5, i32 5, i32 5>
  %2 = zext <4 x i1> %0 to <4 x i32>
  %3 = zext <4 x i1> %1 to <4 x i32>
  %4 = add <4 x i32> %2, %vec.phi
  %5 = add <4 x i32> %3, %vec.phi8
  %6 = icmp eq <4 x i32> %induction, <i32 -1, i32 -1, i32 -1, i32 -1>
  %7 = icmp eq <4 x i32> %induction7, <i32 -1, i32 -1, i32 -1, i32 -1>
  %8 = add <4 x i32> %induction, <i32 1, i32 1, i32 1, i32 1>
  %9 = add <4 x i32> %induction7, <i32 1, i32 1, i32 1, i32 1>
  %index.next = add i32 %index, 8
  %10 = icmp eq i32 %index.next, 0
  br i1 %10, label %middle.block, label %vector.body, !llvm.loop !1

middle.block:                                     ; preds = %vector.body, %entry
  %resume.val = phi i32 [ 0, %entry ], [ 0, %vector.body ]
  %trunc.resume.val = phi i32 [ 0, %entry ], [ 0, %vector.body ]
  %rdx.vec.exit.phi = phi <4 x i32> [ <i32 1, i32 0, i32 0, i32 0>, %entry ], [ %4, %vector.body ]
  %rdx.vec.exit.phi9 = phi <4 x i32> [ zeroinitializer, %entry ], [ %5, %vector.body ]
  %bin.rdx = add <4 x i32> %rdx.vec.exit.phi9, %rdx.vec.exit.phi
  %rdx.shuf = shufflevector <4 x i32> %bin.rdx, <4 x i32> undef, <4 x i32> <i32 2, i32 3, i32 undef, i32 undef>
  %bin.rdx10 = add <4 x i32> %bin.rdx, %rdx.shuf
  %rdx.shuf11 = shufflevector <4 x i32> %bin.rdx10, <4 x i32> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
  %bin.rdx12 = add <4 x i32> %bin.rdx10, %rdx.shuf11
  %11 = extractelement <4 x i32> %bin.rdx12, i32 0
  %cmp.n = icmp eq i32 0, %resume.val
  br i1 %cmp.n, label %while.end, label %scalar.ph

scalar.ph:                                        ; preds = %middle.block
  br label %while.body
while.body:                                       ; preds = %while.body, %scalar.ph
  %i.0 = phi i32 [ %trunc.resume.val, %scalar.ph ], [ %inc, %while.body ]
  %count.0 = phi i32 [ %11, %scalar.ph ], [ %add.count.0, %while.body ]
  %cmp = icmp ult i32 %i.0, 5
  %add = zext i1 %cmp to i32
  %add.count.0 = add i32 %add, %count.0
  %cmp1 = icmp eq i32 %i.0, -1
  %inc = add i32 %i.0, 1
  br i1 %cmp1, label %while.end, label %while.body, !llvm.loop !4

while.end:                                        ; preds = %middle.block, %while.body
  %add.count.0.lcssa = phi i32 [ %add.count.0, %while.body ], [ %11, %middle.block ]
  ret i32 %add.count.0.lcssa
}

The optimizer later erases unreachable and non-effective code - which is almost the entire function body.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionChris_FView Question on Stackoverflow
Solution 1 - C++mukundaView Answer on Stackoverflow
Solution 2 - C++AdamView Answer on Stackoverflow
Solution 3 - C++DanraView Answer on Stackoverflow