CSC
133 - Discrete Mathematical Structures |
Most languages use IEEE (Institute of Electronics and Electrical Engineers) Standard 754 to store real numbers. While knowledge of the standard may no longer be crucial for programmers due to abstraction of this detail it is placed here for completeness and as a reference in case the need arises to parse raw data and the result is not as expected. IEEE 754 Applet.
IEEE
754:
The IEEE 754 Standard uses 1-plus form of the binary normalized
fraction (rounded). The fraction part is called the mantissa.
1-plus normalized scientific notation base two is then
N = (1.b_{1}b_{2}b_{3}b_{4 }...)_{2} x 2^{+}^{E}
The 1 is understood to be
there and is not recorded.
The primitive data type float is 4 bytes, or
32 bits:
1 bit Sign |
8 bit exponent |
23 bit mantissa |
While the double is 8 bytes, or 64 bits, formatted as follows:
1 bit Sign |
11 bit exponent |
52 bit mantissa |
Sign: 0 positive, 1 negative.
Exponent: excess-127 format for float, excess-1023 format for double.
Mantissa: normalized 1-plus fraction with the 1 to the left of the radix point not recorded, float: b_{1}b_{2}b_{3}b_{4}…b_{23}, double: b_{1}b_{2}b_{3}b_{4}…b_{52}. This value is rounded based on the value of the next least significant bit not recorded (if there is a 1 in b_{24}, b_{53} respectively, increment the least significant bit).
The Float:
The largest positive finite
float
literal is 3.40282347e+38f
.
0111 1111 0111 1111 1111 1111 1111 1111
The smallest positive
finite nonzero literal of type float
is 2^{-149}
.
0000 0000 0000 0000 0000 0000 0000 0001
The
largest positive finite double
literal is 1.79769313486231570e+308
.
The smallest positive finite nonzero literal of type double
is 4.94065645841246544e-324
.
A compile-time error occurs
if a nonzero floating-point literal is too large,
so that on rounded conversion to its internal representation it becomes
an IEEE
754 infinity. A program can represent infinities without producing a
compile-time error by using constant expressions such as 1f/0f
or -1d/0d
or by using the predefined constants POSITIVE_INFINITY
and NEGATIVE_INFINITY
of the classes Float
and Double
.
A compile-time error occurs if a nonzero floating-point literal is too small, so that, on rounded conversion to its internal representation, it becomes a zero. A compile-time error does not occur if a nonzero floating-point literal has a small value that, on rounded conversion to its internal representation, becomes a nonzero denormalized number.
When the exponent field is all zeros, the mantissa is interpreted to be denormalized.
Not-a-Number, NaN, occurs from 0/0 or /.
Possible Float Representations | Exponent, (E) | Mantissa (fraction part, f) |
Evaluation |
00000000000000000000000000000000 | E_{min} - 1, (-127) | f = 0 | +0 |
10000000000000000000000000000000 | E_{min} - 1, (-127) | f = 0 | -0 |
*00000000f_{i}f_{i}f_{i}f_{i}f_{i}f_{i}f_{i}f_{i}f_{i}f_{i}f_{i}f_{i}f_{i}f_{i}f_{i}f_{i}f_{i}f_{i}f_{i}f_{i}f_{i}f_{i}f_{i} | E_{min} - 1, (-127) | f
�
0, f_{i} = 0 or 1 but not all 0 |
+0.f x 2^{Emin} |
*eeeeeeee*********************** | -126 < E < 127 | any | +1.f x 2^{E} |
011111111f_{i}f_{i}f_{i}f_{i}f_{i}f_{i}f_{i}f_{i}f_{i}f_{i}f_{i}f_{i}f_{i}f_{i}f_{i}f_{i}f_{i}f_{i}f_{i}f_{i}f_{i}f_{i}f_{i} | E_{max} + 1, (128) | f � 0, f_{i} = 0 or 1 but not all 0 |
NaN |
01111111100000000000000000000000 | E_{max} + 1, (128) | f = 0 | + |
11111111100000000000000000000000 | E_{max} + 1, (128) | f = 0 | - |
Where do all the floating point numbers fall?
The maximum number of distinct values that can be represented with 32 bits is 2^{32} whether the format is unsigned integer, two's complement integer, or IEEE 754 single precision rounded. Interestingly, there are more distinct floats (IEEE 754 single precision rounded) in [-1 to +1] than in the rest of the number line, i.e. (- to -1) union (1 to +).
[-1 to +1] | |
10111111100000000000000000000000 (-1.0) | |
... | 2^{31} distinct values |
10111111000000000000000000000000 (-0.5) | |
10111111000000000000000000000001 (-5.0000006e-1) | |
... | |
10000000000000000000000000000000 (-0) | |
00000000000000000000000000000000 (+0) | |
... | |
00111111000000000000000000000000 (0.5) | |
... | |
00111111100000000000000000000000 (1.0) | |
(- to -1) union (1 to +) | 2^{31} - (2^{23} - 1) + 1 distinct values |
includes the 2^{23} - 1 non-distinct representations for NaN | |
011111111f_{i}f_{i}f_{i}f_{i}f_{i}f_{i}f_{i}f_{i}f_{i}f_{i}f_{i}f_{i}f_{i}f_{i}f_{i}f_{i}f_{i}f_{i}f_{i}f_{i}f_{i}f_{i}f_{i} |
Graphically, the spacing
between distinct values (ε)
doubles with each
exponential increment away from zero. For the number 2^{20}
this spacing
is 0.0625 to the left and 0.125 to the right. To see this consider the
right side
of 2^{20} which contains (2^{21}
- 2^{20})/ 0.125 =
8388608 = 2^{23} distinct values. For the
number 2^{40}
this spacing is 65,536 to the left and 131,072 to the right.
Real numbers are assigned the distinct
representation that is closest to that real number. At run time all
values
larger than the largest distinct value are assigned to positive
infinity.
Similarly, values further from zero in the negative direction
than the
furthest distinct negative number are assigned negative
infinity.
Consider 25431.1234 and 25431.1230 or 25431.1239, they all have the same
representation as a Float. Here 25431.1234 falls between
16,384 (2^{14}) and 32,768 (2^{15}),
with 23 bits of precision at the range 2^{14} there
remains 9 bits for the mantissa, so the least significant bit
increments at 2^{-9}
≈ .002 =
ε. All
values between 25431.1221 and 25431.1240 have the same floating point
representation, 46C6AE3F.
Conversion Example 1: Write the number 1234.0 as a float.
1 bit Sign |
8 bit exponent |
23 bit mantissa |
0 |
100 0100 1 |
001 1010 0100 0000 0000 0000 |
Here we have too many 0's
and 1's to read comfortably so group this word into 4 digit binary
segments and convert to Hex by "lookup".
0100 0100 1001 1010 0100 0000 0000 0000
Hex: 4 4 9 A 4 0 0 0
Decimal: 1234
Conversion Example 2: Write the number 25,431.1234 as a float.
1 bit Sign |
8 bit exponent |
23 bit mantissa |
0 |
10001101 |
1000 1101 0101 1100 0111 111 |
0100 0110 1100 0110 1010
1110 0011 1111
Hex: 4 6 C 6 A E 3 F
Decimal: 25,341.123 (note the loss of precision).
Conversion Example
3: 25341.1234 - 0.01234 = 25341.11106, Carry out this same
calculation using IEEE 754 floating point single precision machine
numbers.
1 .10001101010111000111111 x 2^{14} - 1
.10010100010110110110110 x 2^{-7} =
1 .10001101010111000111111
x 2^{14} - .00000000000000000000110 x 2^{14}
=
1 .10001101010111000111001 x 2^{14}
0100 0110 1100 0110 1010
1110 0011 1001 = 25431.111
HEX: 4 6 C 6 A E 3 9
Decimal: 25341.111 (note the further loss of precision).
Primitive Data Types: | Integers | Base Conversion | Number Systems |
Complements in Radix r | Practice |