The Finer Points of Floating Point

Properties

IEEE 754 floating point numbers may be either finite values and non-finite values. Immediately, we see that the finite values are boring numbers like 3.5. The non-finite values seem much more interesting.

The non-finite values are not-a-number, positive infinity and negative infinity. They’re often printed as NaN, Inf, and -Inf respectively, and they have some unusual properties. For example, NaN != NaN.

But, even finite values can seem somewhat weird until you get used to them. We’ll focus on finite values here because as odd as the non-finite values may be, I usually only ever encountered them by mistake.

Literals

There are two common types of floating point values: single-precision and double-precision. In C, they’re declared with the keywords float and double, respectively.

Floating point literals are distinguished from integers by the inclusion of a decimal place, or are written in scientific notation. Single-precision values are further distinguished by a suffix.

  • Integer: 1
  • Single-precision: 1.f, 1e0f
  • Double-precision: 1.0, 1e0

For more details, check out cppreference.

Printing

When you consider printing floating point numbers, the real question is “what do you want to know?” Printing the exact value of a floating point number can take a lot of digits. Up to 112 decimal places for a single-precision floating point value! Typically, you don’t actually need that.

There’s two main cases for printing values that I’ve encountered in practice. One is that the value to print is the final result of some calculation. In that case, there’s usually some number of significant digits which are relevant and the value can be printed rounded to that precision.

The other case is that you’re debugging or printing a value for future consumption and you must be able to round-trip the value from binary to decimal and back without ambiguity. This turns out to be fairly simple. You simply need 9 digits for single-precision floating point values, and 17 for double-precision. Here we print our values in the exponential format:

  • Single-precision: printf("%1.8e", value);
  • Double-precision: printf("%1.16e", value);

If you want to save a little space at the expense of consistency, you can let printf decide whether to use exponential format or regular decimal format based on whichever is shorter:

  • Single-precision: printf("%.9g", value);
  • Double-precision: printf("%.17g", value);

By the way, this information comes from Bruce Dawson. His blog contains quite a few insights into floating point numbers and he discusses this topic in more detail in his post Float Precision-From Zero to 100+ Digits.

Math Weirdness

Consider the following:

#include <stdio.h>
int main() {
  if (0.1 + 0.2 == 0.3) {
    printf("0.1 + 0.2 == 0.3\n");
  } else {
    printf("0.1 + 0.2 != 0.3\n");
  }
  if (1.0 + 2.0 == 3.0) {
    printf("1.0 + 2.0 == 3.0\n");
  } else {
    printf("1.0 + 2.0 != 3.0\n");
  }
  return 0;
}
-bash-3.00$ gcc float-ex1.c && ./a.out
0.1 + 0.2 != 0.3
1.0 + 2.0 == 3.0

When you add together 0.1 and 0.2 you don’t get 0.3, but when you add together 1.0 and 2.0 you do get 3.0? At this point, many people give up on floating point and decide that it’s inherently imprecise and incomprehensible. However, it’s worth digging deeper. It takes some time, but you can come to understand and predict those sorts of results.

Exact Representations

There are many numbers that cannot be exactly represented in floating point. Take the value 0.1. As a fraction, it’s 1/10. Notice that the prime factors of its denominator are 2 and 5. Unfortunately, the only factor we can use in binary is 2. Because we lack a necessary factor, the representation ends up as a repeating sequence of digits. Thus, 0.1 cannot be represented with a finite number of binary digits.

When you write a literal like 0.1 in your C code, the compiler rounds your value to the nearest value it can exactly represent. In this case, that’s roughly 0.10000000000000001. Let’s print out a few of these numbers to make the problem a little more clear:

#include <stdio.h>
int main() {
  if (0.1 + 0.2 == 0.3) {
    printf("%.17g + %.17g == %.17g\n", 0.1, 0.2, 0.3);
  } else {
    printf("%.17g + %.17g != %.17g\n", 0.1, 0.2, 0.3);
  }
  return 0;
}
-bash-3.00$ gcc float-ex2.c && ./a.out
0.10000000000000001 + 0.20000000000000001 != 0.29999999999999999

It seems that for the case of 0.1 and 0.2, our value was rounded up to the nearest representable number, while for 0.3 the value was rounded down. Thus, adding 0.1 and 0.2 results in a value slightly greater than 0.3 while the literal 0.3 is slightly less.

Integers

Why was it, though, that the floating point math on integers worked out exactly right? Well, the simple answer is that all integers below a certain value can be exactly represented. These are the largest integers that are exactly representable for each type:

  • Single-precision: 224
  • Double-precision: 253

This stems from the number of bits in fractional component of the floating point representation, which are 23 bits for single and 52 bits for double precision.

Further Reading

SPARC Floating Point Basics

Registers & Instructions

There are 32 floating point registers you can use, labelled %f0 to %f31. They are completely separate from the regular integer registers you’ve been using up to this point. Instructions such as save or restore don’t affect floating point registers at all.

Floating point also gets its own set of instructions, with equivalents for most of the integer instructions you’re used to. The floating point equivalent to add is fadds, sub is fsubs, mov is fmovs, and so on.

As you may have guessed, that f that’s prepended in front of the instruction stands for floating point. But what does the s on the end mean? Well, that refers to the size of floating point value you’re operating on. There are actually a few different sizes of floating point types, but there are are two common ones I’ll be focusing on. The first size is single-precision. Single precision values are 32 bits long and fit in one register. The next is double-precision. Double-precision values are 64 bits long and require two adjacent registers for storage.

The instructions for single-precision values end in s, while the instructions for double-precision values end in d. So, for double-precision it’s really faddd and fsubd. Note, however, that there is no fmovd—you just have to use two fmovs instructions.

Comparisons & Branching

Comparison and branch instructions are more or less the same story. There are comparisons like fcmps and fcmpd as you would expect as well as branching instructions like fbge,fbl and fbne. The key difference here is that both branch and compare have a delay slot.

.section ".data"
x: .double 0r0.0
y: .double 0r1.0
.section ".text"

set	x, %o0
ldd	[%o0], %f0
set	y, %o0
ldd	[%o0], %f2
fcmpd	%f0, %f2
nop
fbne	isurehopethisgetscalled
nop

Functions

Passing floating point values to a function is a bit of a pain. They go in the output registers just like any other argument, but as you may have guessed from the existence of both mov and fmovs, you cannot move from an integer register to a floating point register or vice versa.

You can, however, load or store to and from floating point registers. As such, you’ll probably find yourself having to do something like:

.section ".data"
tmp: .single 0r0.0
.section ".text"

set	tmp, %l0
st	%f0, [%l0]
ld	[%l0], %o0
call	sin
nop

When returning a floating point value from a function, you can just leave the value in %f0 (and %f1 for double-precision values). That, at least, is quite simple.

Printing

The printf format specifier for both single and double-precision floating point values is %f. In GDB, you can use p/f $f0 to print a single-precision floating point value in a register, or x/wf <address> to print a single-precision floating point value in memory. Double-precision floating point values can be printed from memory with x/gf <address>, but there is no simple command to print them from registers.

Conversions

Storing and loading a value to move it between integer and floating point registers is fine, but it’s important to note that what is actually preserved is the bit pattern, not the value.

Floating point values are represented with completely different bit patterns from integers. While the integer 21 is represented by 0000 0000 0000 0000 0000 0000 0001 0101, the floating point value 21 is 0 10000011 01010000000000000000000. We’ll justify that result in the next section.

There’s a set of floating point instructions for converting integers to floating point and vice versa. They are aptly named. fitos and fitod convert integer to floating point, while fstoi and fdtoi convert floating point to integer. Both input and output registers must be floating point registers.

.section ".data"
tmp: .single 0r0.0
.section ".text"

mov	21, %l1
set	tmp, %l0
st	%l1, [%l0]
ld	[%l0], %f0
fitos	%f0, %f0

Manual Conversions

You can do the conversion by hand, but you’ll likely want to have a reference open as you do it, and it’s probably only useful as an exercise for better understanding the format.

So, remember our result for 21 is 0 10000011 01010000000000000000000. That first bit is the sign bit. The second set of bits are the exponent bits, and the third set of bits are the fractional bits. They’re interpreted like so: (-1)sign * 1.fraction * 2(exponent - 127)

First, since our value is positive, we know our sign bit is 0.

For the next steps, we’ll need to break our value down a little. Round down to the next smaller power of two and note the ratio: 21 = 2^4 * 1.3125.

We calculate our exponent by undoing the -127 included in the formula. We take the exponent from our power of two and add. So, 4 + 127 = 131 = 0b10000011. Those are our exponent bits.

Next is the fraction. Here we break our ratio down into one plus negative powers of two. This is how we convert it to a binary fraction. 1.3125 = 1 + 1/4 + 1/16 = 0b1.01010000000000000000000. In the floating point representation, the leading one is implied, so we just keep the component after the radix point as our fraction bits.

Leaf Functions

Optimizations

Leaf functions are simply functions that do not call other functions. This allows us to make a few optimizations and make them more efficient than standard functions. We can reduce both the number of instructions we execute and the amount of stack memory we use.

Since we’re not calling other functions, we don’t need to reserve any stack space for the return structure pointer or extra arguments with save. If we’re careful about how we handle our registers, we don’t even need to use save to keep the caller’s local and input registers safe.

There are a few consequences to not using save:

  1. We also need to drop the matching restore instruction. Its purpose was to undo the save instruction, so it’s no longer needed. In fact, it would be a serious error to have a restore without a matching save.

  2. We can’t use input or local registers, as their original values aren’t being saved at the start of our function and restored at the end. We can only use the output and global registers, as they’re expected to change when you call a function. Specifically, you can use %o0, %o1, %o2, %o3, %o4, %o5, %g0, and %g1. The rest are reserved, as usual.

  3. Because you don’t call save, the output registers are never renamed to be input registers. But where did ret get the return address from? %i6, which doesn’t contain the right value because it’s still in %o6. Instead, use retl, which will get the return address from %o6.

! max-leaf - illustrates leaf function optimizations
fmt: .asciz "Max: %d\n"

.global main
.align 4
main:
	save	%sp, -96, %sp

	! initialize with test data
	mov	23, %o0
	mov	12, %o1

	! find max value
	call	max
	nop

	! print result
	mov	%o0, %o1
	set	fmt, %o0
	call	printf
	nop

	! exit
	mov	0, %o0
	mov	1, %g1
	ta	0

! int max(int,int)
! returns the larger of the two values
! optimization note: "Look ma, no saves!"
max:
	cmp	%o0, %o1
	bl,a	1f
	mov	%o1, %o0
1:
	retl
	nop
Functions

Terminology

You may have noticed that I’ve been using the term function for something referred to in the textbook as a subroutine. While there are arguably some different implications to each term, they are frequently used interchangeably.

A Few Arguments

If we have six arguments or fewer, we can construct our function quite simply. Let’s consider a function int max(int a, int b) that simply returns the larger of its two arguments. Take a look at max.s below:

! max - finds the max of two integers
fmt:	.asciz	"Max: %d\n"
	.align 	4
	.global main
main:
	save	%sp, -96, %sp

	! call max
	mov	20, %o0
	mov	28, %o1
	call	max
	nop

	! print the result
	mov	%o0, %o1
	set	fmt, %o0
	call	printf
	nop

	! exit
	mov	0, %o0	! exit with 0
	mov	1, %g1	! exit trap
	ta	0	! trap to system

! int max(int,int)
! returns the larger of the two given values
max:	save	%sp, -64, %sp
	cmp	%i0, %i1
	ble,a	maxret
	mov	%i1, %i0
maxret:
	ret
	restore

You can see before the call to max that we’re putting our arguments into %o0 and %o1. This is exactly the same as how we called other functions we’ve used previously, like printf.

Within max, we immediately save. This saves the local and input registers, turns the output registers into input registers, and adjusts the stack and frame pointers. You must supply at least 64 bytes to store those local and input registers.

The parameters we set in the %o0 and %o1 registers are now available in the %i0 and %i1 registers, respectively. We can do a comparison between, them and if necessary move the value from %i1 into %i0 to ensure that the largest value is stored in %i0.

Finally, we reach the ret instruction, and we branch back to the instruction immediately following the delay slot of the call. But, our ret instruction has its own delay slot, which we’ve filled with restore to undo the save from the start of the function.

Having returned back to main, the larger of the two values is guaranteed to be stored in %o0.

Many Arguments

If we have more than six arguments, we run out of input registers and have to do things differently. Naively, it seems like you have enough registers, with %o0 through %o7 at your disposal, but the registers %o6 and %o7 are reserved, so they cannot be used. Instead, we can place our remaining two arguments on the stack. By convention, there’s 24 bytes per frame left on the stack for function arguments. That’s like having space for six more registers!

Given that we’re going to be relying on these standard conventions in our program, we should probably just use the standard 96 byte save in our functions. In general, it’s simpler than trying to optimize our memory usage within each function.

Let’s consider a function int max8(int a, int b, int c, int d, int e, int f, int g, int h) that simply returns the largest argument it is given. We can store the last two values on the stack using the right offsets, as shown in max8.s below:

! max8 - finds the max of eight integers
fmt:	.asciz	"Max: %d\n"
	.align 	4
	.global main
main:
	save	%sp, -96, %sp

	! call max
	mov	20, %o0
	mov	28, %o1
	mov	83, %o2
	mov	12, %o3
	mov	11, %o4
	mov	43, %o5
	mov	8, %l0
	st	%l0, [%sp + 4 + 64]
	mov	13, %l0
	st	%l0, [%sp + 8 + 64]
	call	max8
	nop

	! print the result
	mov	%o0, %o1
	set	fmt, %o0
	call	printf
	nop

	! exit
	mov	0, %o0	! exit with 0
	mov	1, %g1	! exit trap
	ta	0	! trap to system

! int max8(int,int,int,int,int,int,int,int)
! returns the largest of the eight given values
max8:	save	%sp, -64, %sp
	ld	[%fp + 4 + 64], %o0
	ld	[%fp + 8 + 64], %o1

! values are now in %i0, %i1, %i2, %i3, %i4, %i5, %o0, %o1

	cmp	%i0, %i1
	ble,a	1f
	mov	%i1, %i0
1:
	cmp	%i0, %i2
	ble,a	1f
	mov	%i2, %i0
1:
	cmp	%i0, %i3
	ble,a	1f
	mov	%i3, %i0
1:
	cmp	%i0, %i4
	ble,a	1f
	mov	%i4, %i0
1:
	cmp	%i0, %i5
	ble,a	1f
	mov	%i5, %i0
1:
	cmp	%i0, %o0
	ble,a	1f
	mov	%o0, %i0
1:
	cmp	%i0, %o1
	ble,a	1f
	mov	%o1, %i0
1:	ret
	restore

As you can see, it’s only a few extra lines of code to put additional parameters on the stack. By convention, the space from %sp + 68 to %sp + 92 is dedicated to input arguments. Note that we’re using positive offsets from the stack pointer to store our arguments. This is different from how we addressed local stack variables, which used negative offsets from the frame pointer.

Calling save in max8 changes the stack. The stack pointer grows more negative, and the frame pointer takes on the stack pointer’s old value. Consequentially, when accessing our extra arguments from within max8, we now use positive offsets from the frame pointer.

Short Labels

As you use more functions and your programs grow in size, it may be annoying to specify unique, meaningful labels for every branch. If a named label would be more of a hindrance than a help, you can use a single-digit numeric label instead. These labels can be duplicated, and are branched to by specifying the digit followed by f or b. The suffix f indicates that you wish to branch to the next matching label, while the suffix b indicates that you wish to branch to the previous matching label.

I suggest only using these short labels when you’ll be branching from some place very nearby. The code above is easily understandable in part because there’s a clear pattern, and the label is only ever a couple lines away from the branch statement. The further the branch, the more important a descriptive label is.

Inline Functions

Macros

Inline functions can be created using macros. They are very simple, basically just using the macro compiler to copy and paste code from the macro definition to each location where the macro is used.

It’s slightly fancier than that, though, because within your macro code, you get the special variables $1, $2, $3, etc. When the macro compiler does its copy and pasting, it replaces $1, $2, $3, and so forth, with the 1st, 2nd, 3rd and subsequent arguments given to the macro.

The # character is the macro comment character, which can use when documenting the behaviour and parameters of our function. The # prefixed lines are not discarded; they do show up in the .s file, but gcc ignores them. Alternatively, we could define a comment macro that replaces all inputs with nothing.

Check out the program below, shift.m.

# rashift(hir, lor, tmp)
#
# rashift implements an arithmatic right shift by 1 bit,
# acting on a pair of 32bit registers as if they were a
# single 64bit register.
#
# hir - the high register
# lor - the low register
# tmp - a temporary register for use for calculation
define(rashift, `! `rashift' $1, $2, $3
	sll	$1, 31, $3
	sra	$1,  1, $1
	srl	$2,  1, $2
	or	$3, $2, $2')

fmt:	.asciz	"value = 0x%x%x\n"
	.align	4
	.global	main
main:	save	%sp, -96, %sp

	! initialize with test data
	set	0x84218421, %l0
	set	0x82411248, %l1

	! print value before `rashift'
	set	fmt, %o0
	mov	%l0, %o1
	call	printf
	mov	%l1, %o2

	! do `rashift'
	rashift(%l0, %l1, %l2)

	! print value after `rashift'
	set	fmt, %o0
	mov	%l0, %o1
	call	printf
	mov	%l1, %o2

	! exit program
	mov	1, %o0  !return zero on exit
	mov	1, %g1  !exit request
	ta	0       !trap to system

Escaping Macros

Notice that I quoted rashift every time I mentioned it in a comment? This is very important to prevent m4 from trying to expand the macro. If you forget, m4 will insert broken code into your program. Or worse, if you forget within the macro definition of rashift itself, you’ll cause an endless loop resulting in this error message:

m4:shift.m:26 pushed back more than 4096 chars

So, be sure to remember to quote any comments referring to the function, or you’re going to have problems.

Comments

I started the macro definition with ! `rashift' $1 $2 $3. This is an assembly comment, so that we can easily find our usage of rashift within the .s file and see how it was called.

Inspecting the .s generated by m4 is sometimes quite helpful in debugging problems that may not be obvious in the .m file, so its worth leaving hints to help track problematic code in the .s back to its origin in the .m.

In fact, let’s take a look at the .s file right now.

# rashift(hir, lor, tmp)
#
# rashift implements an arithmatic right shift by 1 bit,
# acting on a pair of 32bit registers as if they were a
# single 64bit register.
#
# hir - the high register
# lor - the low register
# tmp - a temporary register for use for calculation


fmt:	.asciz	"value = 0x%x%x\n"
	.align	4
	.global	main
main:	save	%sp, -96, %sp

	! initialize with test data
	set	0x84218421, %l0
	set	0x82411248, %l1

	! print value before rashift
	set	fmt, %o0
	mov	%l0, %o1
	call	printf
	mov	%l1, %o2

	! do rashift
	! rashift %l0, %l1, %l2
	sll	%l0, 31, %l2
	sra	%l0,  1, %l0
	srl	%l1,  1, %l1
	or	%l2, %l1, %l1

	! print value after rashift
	set	fmt, %o0
	mov	%l0, %o1
	call	printf
	mov	%l1, %o2

	! exit program
	mov	1, %o0  !return zero on exit
	mov	1, %g1  !exit request
	ta	0       !trap to system

As you can see, not much changed. Our define at the top is gone. Our # comments have stayed, and our rashift(%l0, %l1, %l2) has been replaced with code from our define.

It’s worth noting, though, that the additions and removals mean that the line numbers don’t always match between the .s and .m. That’s something to remember when gcc tells you that there’s an error in the .s file on some particular line.