## Properties

IEEE 754 floating point numbers may be either finite values and non-finite
values. Immediately, we see that the finite values are boring numbers like
`3.5`

. The non-finite values seem much more interesting.

The non-finite values are not-a-number, positive infinity and negative
infinity. They’re often printed as `NaN`

, `Inf`

, and `-Inf`

respectively, and
they have some unusual properties. For example, `NaN != NaN`

.

But, even finite values can seem somewhat weird until you get used to them. We’ll focus on finite values here because as odd as the non-finite values may be, I usually only ever encountered them by mistake.

## Literals

There are two common types of floating point values: single-precision and
double-precision. In C, they’re declared with the keywords `float`

and
`double`

, respectively.

Floating point literals are distinguished from integers by the inclusion of a decimal place, or are written in scientific notation. Single-precision values are further distinguished by a suffix.

- Integer:
`1`

- Single-precision:
`1.f`

,`1e0f`

- Double-precision:
`1.0`

,`1e0`

For more details, check out cppreference.

## Printing

When you consider printing floating point numbers, the real question is “what do you want to know?” Printing the exact value of a floating point number can take a lot of digits. Up to 112 decimal places for a single-precision floating point value! Typically, you don’t actually need that.

There’s two main cases for printing values that I’ve encountered in practice. One is that the value to print is the final result of some calculation. In that case, there’s usually some number of significant digits which are relevant and the value can be printed rounded to that precision.

The other case is that you’re debugging or printing a value for future consumption and you must be able to round-trip the value from binary to decimal and back without ambiguity. This turns out to be fairly simple. You simply need 9 digits for single-precision floating point values, and 17 for double-precision. Here we print our values in the exponential format:

- Single-precision:
`printf("%1.8e", value);`

- Double-precision:
`printf("%1.16e", value);`

If you want to save a little space at the expense of consistency, you can let printf decide whether to use exponential format or regular decimal format based on whichever is shorter:

- Single-precision:
`printf("%.9g", value);`

- Double-precision:
`printf("%.17g", value);`

By the way, this information comes from Bruce Dawson. His blog contains quite a few insights into floating point numbers and he discusses this topic in more detail in his post Float Precision-From Zero to 100+ Digits.

## Math Weirdness

Consider the following:

```
#include <stdio.h>
int main() {
if (0.1 + 0.2 == 0.3) {
printf("0.1 + 0.2 == 0.3\n");
} else {
printf("0.1 + 0.2 != 0.3\n");
}
if (1.0 + 2.0 == 3.0) {
printf("1.0 + 2.0 == 3.0\n");
} else {
printf("1.0 + 2.0 != 3.0\n");
}
return 0;
}
```

```
-bash-3.00$ gcc float-ex1.c && ./a.out
0.1 + 0.2 != 0.3
1.0 + 2.0 == 3.0
```

When you add together `0.1`

and `0.2`

you don’t get `0.3`

, but when you add
together `1.0`

and `2.0`

you do get `3.0`

? At this point, many people give up
on floating point and decide that it’s inherently imprecise and
incomprehensible. However, it’s worth digging deeper. It takes some time, but
you can come to understand and predict those sorts of results.

## Exact Representations

There are many numbers that cannot be exactly represented in floating point.
Take the value `0.1`

. As a fraction, it’s `1/10`

. Notice that the prime factors
of its denominator are `2`

and `5`

. Unfortunately, the only factor we can use
in binary is `2`

. Because we lack a necessary factor, the representation ends up
as a repeating sequence of digits. Thus, `0.1`

cannot be represented with a
finite number of binary digits.

When you write a literal like `0.1`

in your C code, the compiler rounds your
value to the nearest value it can exactly represent. In this case, that’s
roughly `0.10000000000000001`

. Let’s print out a few of these numbers to make
the problem a little more clear:

```
#include <stdio.h>
int main() {
if (0.1 + 0.2 == 0.3) {
printf("%.17g + %.17g == %.17g\n", 0.1, 0.2, 0.3);
} else {
printf("%.17g + %.17g != %.17g\n", 0.1, 0.2, 0.3);
}
return 0;
}
```

```
-bash-3.00$ gcc float-ex2.c && ./a.out
0.10000000000000001 + 0.20000000000000001 != 0.29999999999999999
```

It seems that for the case of `0.1`

and `0.2`

, our value was rounded up to the
nearest representable number, while for `0.3`

the value was rounded down. Thus,
adding `0.1`

and `0.2`

results in a value slightly greater than `0.3`

while
the literal `0.3`

is slightly less.

## Integers

Why was it, though, that the floating point math on integers worked out exactly right? Well, the simple answer is that all integers below a certain value can be exactly represented. These are the largest integers that are exactly representable for each type:

- Single-precision: 2
^{24} - Double-precision: 2
^{53}

This stems from the number of bits in fractional component of the floating point representation, which are 23 bits for single and 52 bits for double precision.