The Finer Points of Floating Point


IEEE 754 floating point numbers may be either finite values and non-finite values. Immediately, we see that the finite values are boring numbers like 3.5. The non-finite values seem much more interesting.

The non-finite values are not-a-number, positive infinity and negative infinity. They’re often printed as NaN, Inf, and -Inf respectively, and they have some unusual properties. For example, NaN != NaN.

But, even finite values can seem somewhat weird until you get used to them. We’ll focus on finite values here because as odd as the non-finite values may be, I usually only ever encountered them by mistake.


There are two common types of floating point values: single-precision and double-precision. In C, they’re declared with the keywords float and double, respectively.

Floating point literals are distinguished from integers by the inclusion of a decimal place, or are written in scientific notation. Single-precision values are further distinguished by a suffix.

  • Integer: 1
  • Single-precision: 1.f, 1e0f
  • Double-precision: 1.0, 1e0

For more details, check out cppreference.


When you consider printing floating point numbers, the real question is “what do you want to know?” Printing the exact value of a floating point number can take a lot of digits. Up to 112 decimal places for a single-precision floating point value! Typically, you don’t actually need that.

There’s two main cases for printing values that I’ve encountered in practice. One is that the value to print is the final result of some calculation. In that case, there’s usually some number of significant digits which are relevant and the value can be printed rounded to that precision.

The other case is that you’re debugging or printing a value for future consumption and you must be able to round-trip the value from binary to decimal and back without ambiguity. This turns out to be fairly simple. You simply need 9 digits for single-precision floating point values, and 17 for double-precision. Here we print our values in the exponential format:

  • Single-precision: printf("%1.8e", value);
  • Double-precision: printf("%1.16e", value);

If you want to save a little space at the expense of consistency, you can let printf decide whether to use exponential format or regular decimal format based on whichever is shorter:

  • Single-precision: printf("%.9g", value);
  • Double-precision: printf("%.17g", value);

By the way, this information comes from Bruce Dawson. His blog contains quite a few insights into floating point numbers and he discusses this topic in more detail in his post Float Precision-From Zero to 100+ Digits.

Math Weirdness

Consider the following:

#include <stdio.h>
int main() {
  if (0.1 + 0.2 == 0.3) {
    printf("0.1 + 0.2 == 0.3\n");
  } else {
    printf("0.1 + 0.2 != 0.3\n");
  if (1.0 + 2.0 == 3.0) {
    printf("1.0 + 2.0 == 3.0\n");
  } else {
    printf("1.0 + 2.0 != 3.0\n");
  return 0;
-bash-3.00$ gcc float-ex1.c && ./a.out
0.1 + 0.2 != 0.3
1.0 + 2.0 == 3.0

When you add together 0.1 and 0.2 you don’t get 0.3, but when you add together 1.0 and 2.0 you do get 3.0? At this point, many people give up on floating point and decide that it’s inherently imprecise and incomprehensible. However, it’s worth digging deeper. It takes some time, but you can come to understand and predict those sorts of results.

Exact Representations

There are many numbers that cannot be exactly represented in floating point. Take the value 0.1. As a fraction, it’s 1/10. Notice that the prime factors of its denominator are 2 and 5. Unfortunately, the only factor we can use in binary is 2. Because we lack a necessary factor, the representation ends up as a repeating sequence of digits. Thus, 0.1 cannot be represented with a finite number of binary digits.

When you write a literal like 0.1 in your C code, the compiler rounds your value to the nearest value it can exactly represent. In this case, that’s roughly 0.10000000000000001. Let’s print out a few of these numbers to make the problem a little more clear:

#include <stdio.h>
int main() {
  if (0.1 + 0.2 == 0.3) {
    printf("%.17g + %.17g == %.17g\n", 0.1, 0.2, 0.3);
  } else {
    printf("%.17g + %.17g != %.17g\n", 0.1, 0.2, 0.3);
  return 0;
-bash-3.00$ gcc float-ex2.c && ./a.out
0.10000000000000001 + 0.20000000000000001 != 0.29999999999999999

It seems that for the case of 0.1 and 0.2, our value was rounded up to the nearest representable number, while for 0.3 the value was rounded down. Thus, adding 0.1 and 0.2 results in a value slightly greater than 0.3 while the literal 0.3 is slightly less.


Why was it, though, that the floating point math on integers worked out exactly right? Well, the simple answer is that all integers below a certain value can be exactly represented. These are the largest integers that are exactly representable for each type:

  • Single-precision: 224
  • Double-precision: 253

This stems from the number of bits in fractional component of the floating point representation, which are 23 bits for single and 52 bits for double precision.

Further Reading

Intro to GDB

Debugging Programs

You can use program GDB to inspect your code as its executing. That’s a great way to identify things that might be going wrong. You can solve a lot of problems just by figuring out what you think should happen, and stepping through your code until you encounter something you don’t expect.

When you start GDB on your program, you will be given a command prompt that starts with (gdb) on each line. There, you can enter commands for GDB. Here’s a listing of some of the things you can do:

gdb a.out              # open a program named a.out in gdb
(gdb) break *main      # stop program upon reaching main label
(gdb) run              # execute until you hit a break point
(gdb) x/5i main        # print the first 5 instructions in main
(gdb) display/i $pc    # print the current instruction at each stop
(gdb) ni               # execute the next instruction
(gdb) print $x20       # print the contents of register %x20
(gdb) break *main+8    # stop two instructions after main
(gdb) delete break 1   # remove the first breakpoint created
(gdb) delete display 1 # remove the first display created
(gdb) continue         # resume execution until the next break point
(gdb) quit             # exit gdb


Now, let’s use those commands to take a look through our expr.asm program from last post and try to find the what the result was. We stored that value in x20 on one of the last few lines of the program. So, let’s check it out.

[cgbloor@csa2 f16]$ m4 expr.asm > expr.s && gcc expr.s && gdb ./a.out
GNU gdb (GDB) Fedora 7.11.1-75.fc24
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "aarch64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
Find the GDB manual and other documentation resources online at:
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./a.out...(no debugging symbols found)...done.
(gdb) break main      # I want to stop the program at the start of main
Breakpoint 1 at 0x400574
(gdb) display/i $pc   # I want to see where I am as I step through
1: x/i $pc
<error: No registers.>
(gdb) run             # ok, now I'm ready to start the program. Go!
Starting program: /home/grads/cgbloor/ws/355/f16/a.out 

Breakpoint 1, 0x0000000000400574 in main ()
1: x/i $pc
=> 0x400574 <main+20>:  mul x24, x21, x22
(gdb) x/4i $pc-4  # opps, should have used 'break *main'. Where am I?
   0x400570 <main+16>:  sub x22, x19, #0x7
=> 0x400574 <main+20>:  mul x24, x21, x22
   0x400578 <main+24>:  sub x23, x19, #0xb
   0x40057c <main+28>:  sdiv  x20, x24, x23
(gdb) ni          # ok. I see. Well, let's continue.
0x0000000000400578 in main ()
1: x/i $pc
=> 0x400578 <main+24>:  sub x23, x19, #0xb
(gdb) ni          # not yet...
0x000000000040057c in main ()
1: x/i $pc
=> 0x40057c <main+28>:  sdiv  x20, x24, x23
(gdb) ni          # this instruction calculates the final result, so one more step!
0x0000000000400580 in main ()
1: x/i $pc
=> 0x400580 <main+32>:  mov w0, #0x0                    // #0
(gdb) print $x20  # let's see that final value
$1 = -8
(gdb) continue    # excellent. we can let the program finish
[Inferior 1 (process 31146) exited normally]
(gdb) quit        # done

More Information

As limited as GDB looks, you can actually do a lot with it. We only introduce a few of the possibilities. There are a lot more commands. The video below, Give me 15 minutes & I’ll change your view of GDB, covers some of the most surprising features I’ve heard of. Some of them are even useful! Like the visual mode you get when you hit ctrl+x, 2 twice. Just don’t try using the visual mode with script.

Basic ARM

Compiling and Running

Like C programs, assembly code must also be compiled to create an executable. Assembly source code files use the extension .s and can be passed to GCC in exactly the same way as C. That is, a source file called hello.s could be compiled into an executable named hello with the command gcc hello.s -o hello.

Give it a try with the following code in hello.s. Remember that to run your program, it requires ./hello because you need to specify which directory your program resides in!

fmt:	.string "Hello World!\n"  // our format string

	.balign 4                 // ensure aligned to a 4-byte boundary
	.global main              // make main visible to linker
	stp	x29, x30, [sp, -16]! // allocate stack space
	mov	x29, sp              // update fp

	// print
	adrp	x0, fmt             // set format string high bits
	add	x0, x0, :lo12:fmt    // set format string low bits
	bl	printf                // call printf

	// exit
	mov	w0, 0                // set return value
	ldp	x29, x30, [sp], 16   // restore stack
	ret	                     // return to OS


It’s helpful to be able to use macros in our programs. Rather than editing a .s file directly, we’ll put our code in a different file and then generate the raw assembly. The file name doesn’t matter for this, but let’s use the extension .asm as our convention. The macro file can be compiled into an assembly file like so: m4 hello.asm > hello.s

Putting those two steps together, we get this process to transform a .asm file into an executable file:

m4 hello.asm > hello.s
gcc hello.s -o hello

Doing both these steps each time you want to compile your program gets a little tedius. You can combine them into a single command using &&. It ends up looking like m4 hello.asm > hello.s && gcc hello.s -o hello. The second part of the command will only execute if the first part succeeds.

Here’s a macro program named expr.asm that you could try building. It’s a handy example for debugging, too.

// This program computes the expression:
//   y = (x - 1) * (x - 7) / (x - 11) for x = 9
// The polynomial coeficients are:
  define(a2, 1)
  define(a1, 7)
  define(a0, 11)

// The variables x, y and temporary values are:
  define(x_r, x19)
  define(y_r, x20)
  define(t1_r, x21)
  define(t2_r, x22)
  define(t3_r, x23)
  define(num_r, x24)

  .balign       4
  .global       main
  stp   x29, x30, [sp, -16]! // allocate stack space
  mov   x29, sp              // update fp

  mov   x_r, 9           // initialize x
  sub   t1_r, x_r, a2    // (x - a2) into t1
  sub   t2_r, x_r, a1    // (x - a1) into t2
  mul   num, t1_r, t2_r  // calculate the numerator
  sub   t3_r, x_r, a0    // (x - a0) into t3, the divisor
  sdiv  y_r, num_r, t3_r // calculate the result

  // exit
  mov   w0, 0               // set return value
  ldp   x29, x30, [sp], 16  // restore stack
  ret                       // return to OS
Intro to C

Compiling and Running

C programs must be compiled to be run. The GNU Compiler Collection (GCC) can be used to turn your C source code file into an executable program. The command gcc hello.c -o hello takes a source code file named hello.c and outputs an executable file named hello.

The .c extension on the source file is important because GCC can compile other languages, like Java or Go. The extension acts as a hint as to what language the program was written in. Aside from that, either file can be named whatever you want.1

When running a program found in the system’s standard search paths, all that is required is to enter the name of the program. That’s how we can type vim to run vim, for instance.

To run a program not found in a standard search path, like the hello program we just created, we have to be specific about where the program is located. In this case, it’s in the current directory. The current directory is represented by . so we can run the program with ./hello

In summary:

gcc hello.c -o hello

Hello World!

A timeless classic, the hello world program is one of the simplest you can write. Here we include the standard input/output header stdio.h in order to use printf. We define our main function as returning an integer and taking no arguments, then call printf with a string to output. Note that within a string literal, the sequence \n represents the newline character.

#include <stdio.h>

int main() {
  printf("Hello World!\n");
  return 0;

Input and Print Values

For both input and output, variables can be marked in the format string using escape codes. The exact code used depends on the type of variable you have. For example, integers use the code %d, while strings use %s. There’s lots more options, too, so you may wish to consult a reference guide or some examples if you’re looking to do something fancy.

In the case of printf, the escape codes will be replaced with the given variables. In the case of scanf, the string entered by the user will be parsed and the parts entered in place of the escape code will be stored in the given variable.

Also worth noting is that scanf expects to be passed a pointer to the variable that will be filled with the input value. This is unlike printf, which can simply be passed the variable itself. The reason for this difference is that scanf must modify the variable, while printf does not. We use the address of operator & to get a pointer to our variables for scanf.

#include <stdio.h>

int main() {
  int x, y;
  printf("Enter a number: ");
  scanf("%d", &x);
  printf("Enter another number: ");
  scanf("%d", &y);
  printf("The sum of %d and %d is %d\n", x, y, x + y);
  return 0;


There is no fundamental string structure, class or keyword in C. Strings are represented as character arrays terminated by a null character. The length of the string is not stored anywhere. Either you know the string length and keep track of it yourself, or you determine the string length by stepping through the string until your reach the null character.

Running the example below, you can see that strings are essentially arrays of numeric values, which are interpreted as characters using a character encoding. e.g. an ascii table.

#include <stdio.h>

int main() {
  char hello[] = "Hello World!";
  int i;
  for (i = 0; i < 13; ++i) {
    printf("%c = %d\n", hello[i], (int)hello[i]);
  return 0;

String Manipulation

String manipulation functions can be found in string.h and can be used for tasks such as copying or comparing strings. You can consult a reference to check out what functions are available. In the example below, we copy a string from the literal into a buffer, modify the string, and then compare it with the original.

#include <stdio.h>
#include <string.h>

int main() {
  char buffer[6];
  char* h = "hello";

  // copy the string from 'h' into 'buffer'
  strcpy(buffer, h);
  // now change the first letter in the buffer
  buffer[0] = 'j';

  int result = strcmp(buffer, h);
  if (result == 0) {
    printf("They match!\n");
  } else {
    printf("They do not match!\n");
    printf("Expected: %s\nActual: %s\n",
      h, buffer);
  return 0;

Arrays Vs. Pointers

Arrays and pointers in C both can use the operator [] to access values at offsets. This makes a pointer to the first member of an array seem very similar to the actual array variable itself. Using the operator on either the pointer or the array would return the same thing.

However, there are some major differences if you know where to look. For example, consider the sizeof operator. The sizeof operator returns the number of bytes of memory that a variable takes up. For a pointer, this is usually either 4 bytes (32 bits) or 8 bytes (64 bits) depending on whether it’s a 32-bit or 64-bit program. For arrays, the value is the total size of the array data, which is number of elements times the size of individual elements.

When an array is passed to a function, that function just gets a pointer to the first element of the array. It’s said that the array has ‘decayed’ into a pointer. You cannot pass an actual array into a function. Just to confuse you, there’s no compiler error for specifying that a function takes an array. Instead, the compiler will treat the array argument as if it were a pointer argument.

My recommendation is to never use the array type for function arguments. If you’re going to just end up with a pointer, you might as well be explicit about it. The program below illustrates the differences between arrays and pointers in relation to sizeof and functions.

#include <stdio.h>

void print_sizeof(char* ptr) {
  printf("The passed variable is %u bytes large\n", (unsigned int)sizeof(ptr));

int main() {
  char arr[] = "Hello World!";
  char* ptr = "Hello World!";

  printf("arr is %u bytes large\n", (unsigned int)sizeof(arr));
  printf("ptr is %u bytes large\n", (unsigned int)sizeof(ptr));

  printf("when passed to a function, an array decays into a pointer.\n");

  return 0;


  1. Well, you can actually give the source file any extension you want, but then you’d have to pass -x c to gcc to explicitly say the language is c. See the docs.

Intro to SSH

Remote Access

Most desktop machines you’ll encounter are x86 architecture machines. To properly learn to work with the ARM architecture, you’ll need an ARM machine. If you don’t have one, don’t worry! You can connect to one of the university ARM servers to do your work in this course.

The names of the various CPSC servers are listed on the Remote Access page. You’ll need to use SSH to access these servers. On Linux or OSX, you can use a terminal command like so: ssh

Of course, cgbloor is my CPSC username. You’ll need to replace that with your own. If connecting from Windows, you’ll want to use a program like PuTTY and specify the server in the hostname section.


When working remotely on a server, you’ll need a way to write code. For most beginners I suggest using nano. It’s easy to get started with because the most common keyboard shortcuts are all listed at the bottom.

You’ll find that most programmers working with editors on the command line eventually move to vim or emacs. Many developers will argue vehemently that one or the other is better. It’s a complex, multi-faceted debate, with countless nuanced and dissenting opinions. However, all opinions which favour emacs are objectively wrong.

“Real programmers set the universal constants at the start such that the universe evolves to contain the disk with the data they want.” (xkcd 378)

Vim requires a bit of hand-holding for new users because it’s quite unlike most other text editors. If you are interested in learning, you should find a guide, because without help, people tend to struggle even just typing out a single word. Not that vim is hard to learn. It’s easy. You just have to take some time to do it. There are plenty of good materials for learning. My favourite introduction is this one. You can also run the command vimtutor to open up a document that will guide you.