IEEE-754 Floating Point

Learning Objectives

Understand the sections of a 32-bit and 64-bit IEEE-754 number.
Memorize the exponent bias for a 32-bit and 64-bit IEEE-754 exponent.
Be able to take a IEEE-754 number and express it in base 10.
Be able to take a base 10 number and express it in IEEE-754.

IEEE-754 Format

IEEE-754 is just a fancy name for a standard that tells you (and the computer) how to arrange 32-bits or 64-bits. In this case, it's how you arrange these bits to allow for scientific notation, which is how we float the decimal. We can move the decimal place left by decreasing the exponent, or we can move the decimal place right by increasing the exponent. The following equations show how the exponent affects the decimal's location.

$$1.23\times~10^3=1230$$

$$1.23\times~10^{-3}=.00123$$

As you can see here, there are three (3) fields that IEEE-754 defines: (1) a sign (0 = positive, 1 = negative), (2) an exponent, and (3) a fraction. The sign is either a 0 or 1, and it DOES NOT follow two's complement. There is a -0 and a +0. The exponent is the exponent of base 2. The fraction is everything to the right of the decimal point. So, $1.234\times~10^7$ would have a sign of 0 (positive), an exponent of 7 (not exactly, but stay with me so far), and a fraction of .234. This is all the information we need. Keep in mind that .234 is base 10. All IEEE-754 formats describe base 2. Just like x 10 to some power floats the decimal in base 10, if we multiply by 2 to some power, we can float the decimal in base 2. All IEEE-754 defines is scientific notation!

The exponent has a bias, which allows an unsigned number to have negative values. The exponent itself is an unsigned number. The 32-bit and 64-bit formats have different biases.

Take a number $1.1101\times2^1$. Just like base 10, in base 2, when we multiply by 2, we move the decimal either right or left. So, the number becomes $11.01_2=3.25_{10}$. In IEEE-754, the number must be 1.something. This is called normalized. So, we can't store 11.01 with an exponent of 0, we must store 1.101 with an exponent of 1. This is fairly easy to do by just adding or subtracting from the exponent until the decimal is directly after the 1. This gives us more bits to store in the fraction portion.

32-bit IEEE-754

In C++, the 32-bit IEEE-754 number is used by the float data type in C++. In here, the 32-bits are arranged as follows.

IEEE-754, 32-bit format. 1-bit sign, 8-bit exponent, 23-bit fraction. Bias is 127.

When we talk about a bias of 127, we mean that if we look at the 8-bit exponent in the 32-bit format, it's going to be 127 bigger than the actual exponent. So, if we want to figure out the actual exponent, we'd subtract 127. Say we had an exponent of $1001\_1100_2=128+16+8+4=156_{10}$. So, we have a biased exponent of 156. We then subtract 127 from 156 to get the actual power of 2: $156-127=29$. This means our number is $1.\text{fraction}\times 2^{29}$. The fraction is copied directly from the last 23 bits of the number. Recall that this is everything to the RIGHT of the decimal. The 1. portion is implied and is not explicitly in the 32-bits.

32-bit Example

Convert the following 32-bit number into decimal: 0x418e_0000.

First, we need to convert this into binary so we can parse out the 1-bit sign, 8-bit exponent, and 23-bit fraction:

$$418e\_0000_{16}=0100\_0001\_1000\_1110\_0000\_0000\_0000\_0000_2$$

The first bit is 0, which is a positive number. The next 8-bits are $1000\_0011_2=128+2+1=131_{10}$, and the last 23-bits are $.000111_2$. Since everything to the right of 0, we can ignore it. Just like 1.230 and 1.23 are identical, the same holds true for base 2.

Our exponent is 131, so we subtract the bias 127 to get an exponent of $131-127=4$. The fractional portion was $.000111$, and we have an implied 1., so we have $1.000111\times 2^4$. Recall that a positive exponent moves the decimal right. In this case, it moves it to the right by four places. To remind yourself which way the decimal goes remember, bigger exponent means bigger number.

$$1.000111_2\times 2^4=10001.11_2$$

Now, we convert to base 10. Look to the left of the decimal first, $10001_2=16+1=17_{10}$. Then, look to the right of the decimal: $.11_2=0.5+0.25=0.75_{10}$. Putting the numbers together gives us $17+0.75=17.75$.

So, 0x418e_0000 is the decimal value 17.75F.

Another 32-bit Example

Convert -81.0625 into 32-bit, IEEE-754 in hexadecimal.

Going the opposite direction still requires us to find three things: sign, exponent, and fraction. So, as always, first step: convert to binary.

$-81.0625_{10}=-1010001.0001_2=1010001.001_2\times 2^0=-1.010001001_2\times 2^6$.

So, now we have it in binary and in normalized notation, that is 1.something. Now all we have to do is grab the sign, bias the exponent, and write down the fraction.

Sign = $\text{negative}=1$
Exponent = $6+127=133_{10}=1000\_0101_2$
Fraction = $0100\_0100\_0100_2$

So, the sign comes first, followed by the exponent, and finally the fraction--remember to keep adding zeroes until you get 23 total bits for the fraction.

Sign   Exponent      Fraction
[1]    [1000 0101]   [0100 0100 0100 0000 0000 000]

1100 0010 1010 0010 0010 0000 0000 0000
C    2    A    2    2    0    0    0

So, -81.0625 is 0xc2a2_2000 in 32-bit, IEEE-754 format.

64-bit IEEE-754 Format

The 64-bit IEEE-754 format can be used in C++ by using the double data type. You will see that a double doesn't double every field, but instead gives you a somewhat bigger exponent, but a MUCH larger fraction space.

We still have three different fields: sign, exponent, and fraction, yet we have double space. Interestingly enough, the largest expansion is the fraction. The exponent only goes from 8 bits in 32-bit format to 11 bits in 64-bit format.

IEEE-754, 64-bit format. 1-bit sign, 11-bit exponent, 52-bit fraction. Bias is 1023.

64-bit IEEE-754 Examples

Convert 0xc054_4400_0000_0000 into decimal.

We need to parse out the three fields, so as always, convert to binary:

C    0    5    4    4    4    0    0    0    0    0    0    0    0    0    0
1100 0000 0101 0100 0100 0100 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

We then pick out our 1-bit sign, 11-bit exponent, and 52-bit fraction:

Sign = 1 = negative
Exponent = $1000\_0000\_101_2=1029-1023=6$
Fraction = $1.0100\_0100\_01_2$

You can see that I already subtracted the 64-bit bias (1023) to get an exponent of 6, and I already added the implied 1. to the fraction. So, now our number is as follows.

$$-1.0100\_0100\_01_2\times 2^6=-1010001.0001_2=-88.0625_{10}$$

So, 0xc054_4400_0000_0000 is the value -88.0625.

Another 64-bit IEEE-754 Example

Convert 17.75 into 64-bit, IEEE-754 format in hexadecimal.

First, convert to binary: $17.75_{10}=10001.11_2$. Then, normalize the value: $10001.11_2\times 2^0=1.000111_2\times 2^4$. So, now we have our sign (0=positive), our exponent (4), and our fraction (.000111). Now, bias our exponent by adding $4+1023=1027_{10}=100\_0000\_0011_2$.

Sign  Exponent        Fraction
[0]   [100 0000 0011] [0001 1100 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000]

0100 0000 0011 0001 1100 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
4    0    3    1    C    0    0    0    0    0    0    0    0    0    0    0

So, 17.75 is 0x4031_c000_0000_0000 in 64-bit IEEE-754 format.

C++ Float/Double

C++ literals are different for float and double. The literal 1.0 is a double, whereas 1.0F is a float. This allows a programmer to tell C++ whether they want a 64-bit or 32-bit floating point data type.