Nonintuitive result of the assignment of a double precision number to an int variable in C

CFloating PointType ConversionImplicit Conversion

C Problem Overview


Could someone give me an explanation why I get two different numbers, resp. 14 and 15, as an output from the following code?

#include <stdio.h>  

int main()
{
    double Vmax = 2.9; 
    double Vmin = 1.4; 
    double step = 0.1; 

    double a =(Vmax-Vmin)/step;
    int b = (Vmax-Vmin)/step;
    int c = a;

    printf("%d  %d",b,c);  // 14 15, why?
    return 0;
}

I expect to get 15 in both cases but it seems I'm missing some fundamentals of the language.

I am not sure if it's relevant but I was doing the test in CodeBlocks. However, if I type the same lines of code in some on-line compiler ( this one for example) I get an answer of 15 for the two printed variables.

C Solutions


Solution 1 - C

> ... why I get two different numbers ...

Aside from the usual float-point issues, the computation paths to b and c are arrived in different ways. c is calculated by first saving the value as double a.

double a =(Vmax-Vmin)/step;
int b = (Vmax-Vmin)/step;
int c = a;

C allows intermediate floating-point math to be computed using wider types. Check the value of FLT_EVAL_METHOD from <float.h>.

> Except for assignment and cast (which remove all extra range and precision), ...

> -1 indeterminable;

> 0 evaluate all operations and constants just to the range and precision of the type;

> 1 evaluate operations and constants of type float and double to the range and precision of the double type, evaluate long double operations and constants to the range and precision of the long double type;

> 2 evaluate all operations and constants to the range and precision of the long double type.

> C11dr §5.2.4.2.2 9

OP reported 2

By saving the quotient in double a = (Vmax-Vmin)/step;, precision is forced to double whereas int b = (Vmax-Vmin)/step; could compute as long double.

This subtle difference results from (Vmax-Vmin)/step (computed perhaps as long double) being saved as a double versus remaining a long double. One as 15 (or just above), and the other just under 15. int truncation amplifies this difference to 15 and 14.

On another compiler, the results may both have been the same due to FLT_EVAL_METHOD < 2 or other floating-point characteristics.


Conversion to int from a floating-point number is severe with numbers near a whole number. Often better to round() or lround(). The best solution is situation dependent.

Solution 2 - C

This is indeed an interesting question, here is what happens precisely in your hardware. This answer gives the exact calculations with the precision of IEEE double precision floats, i.e. 52 bits mantissa plus one implicit bit. For details on the representation, see the wikipedia article.

Ok, so you first define some variables:

double Vmax = 2.9;
double Vmin = 1.4;
double step = 0.1;

The respective values in binary will be

Vmax =    10.111001100110011001100110011001100110011001100110011
Vmin =    1.0110011001100110011001100110011001100110011001100110
step = .00011001100110011001100110011001100110011001100110011010

If you count the bits, you will see that I have given the first bit that is set plus 52 bits to the right. This is exactly the precision at which your computer stores a double. Note that the value of step has been rounded up.

Now you do some math on these numbers. The first operation, the subtraction, results in the precise result:

 10.111001100110011001100110011001100110011001100110011
- 1.0110011001100110011001100110011001100110011001100110
--------------------------------------------------------
  1.1000000000000000000000000000000000000000000000000000

Then you divide by step, which has been rounded up by your compiler:

   1.1000000000000000000000000000000000000000000000000000
 /  .00011001100110011001100110011001100110011001100110011010
--------------------------------------------------------
1110.1111111111111111111111111111111111111111111111111100001111111111111

Due to the rounding of step, the result is a tad below 15. Unlike before, I have not rounded immediately, because that is precisely where the interesting stuff happens: Your CPU can indeed store floating point numbers of greater precision than a double, so rounding does not take place immediately.

So, when you convert the result of (Vmax-Vmin)/step directly to an int, your CPU simply cuts off the bits after the fractional point (this is how the implicit double -> int conversion is defined by the language standards):

               1110.1111111111111111111111111111111111111111111111111100001111111111111
cutoff to int: 1110

However, if you first store the result in a variable of type double, rounding takes place:

               1110.1111111111111111111111111111111111111111111111111100001111111111111
rounded:       1111.0000000000000000000000000000000000000000000000000
cutoff to int: 1111

And this is precisely the result you got.

Solution 3 - C

The "simple" answer is that those seemingly-simple numbers 2.9, 1.4, and 0.1 are all represented internally as binary floating point, and in binary, the number 1/10 is represented as the infinitely-repeating binary fraction 0.00011001100110011...[2] . (This is analogous to the way 1/3 in decimal ends up being 0.333333333... .) Converted back to decimal, those original numbers end up being things like 2.8999999999, 1.3999999999, and 0.0999999999. And when you do additional math on them, those .0999999999's tend to proliferate.

And then the additional problem is that the path by which you compute something -- whether you store it in intermediate variables of a particular type, or compute it "all at once", meaning that the processor might use internal registers with greater precision than type double -- can end up making a significant difference.

The bottom line is that when you convert a double back to an int, you almost always want to round, not truncate. What happened here was that (in effect) one computation path gave you 15.0000000001 which truncated down to 15, while the other gave you 14.999999999 which truncated all the way down to 14.

See also question 14.4a in the C FAQ list.

Solution 4 - C

An equivalent problem is analyzed in analysis of C programs for FLT_EVAL_METHOD==2.

If FLT_EVAL_METHOD==2:

double a =(Vmax-Vmin)/step;
int b = (Vmax-Vmin)/step;
int c = a;

computes b by evaluating a long double expression then truncating it to a int, whereas for c it's evaluating from long double, truncating it to double and then to int.

So both values are not obtained with the same process, and this may lead to different results because floating types does not provides usual exact arithmetic.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionGeorgiDView Question on Stackoverflow
Solution 1 - Cchux - Reinstate MonicaView Answer on Stackoverflow
Solution 2 - Ccmaster - reinstate monicaView Answer on Stackoverflow
Solution 3 - CSteve SummitView Answer on Stackoverflow
Solution 4 - CJean-Baptiste YunèsView Answer on Stackoverflow