Java Primitive Data Types - Reals - IEEE754

Java uses IEEE (Institute of Electronics and Electrical Engineers) Standard 754 to store real numbers. While knowledge of the standard may no longer be crucial for programmers due to abstraction of this detail it is placed here for completeness and as a reference in case the need arises to parse raw data and the result is not as expected. IEEE 754 Applet.

Related Articles:
From: "The Java Virtual Machine Specification" at
http://java.sun.com/docs/books/vmspec/2nd-edition/html/Concepts.doc.html#33377
From: "The Java Language Specification" at
http://java.sun.com/docs/books/jls/second_edition/html/lexical.doc.html#230798
From: "The Java API"
http://java.sun.com/j2se/1.4/docs/api/java/lang/Float.html

IEEE 754:
The IEEE 754 Standard uses 1-plus form of the binary normalized fraction (rounded). The fraction part is called the mantissa.
1-plus normalized scientific notation base two is then

N = ± (1.b1b2b3b4 ...)2 x 2+E

The 1 is understood to be there and is not recorded.
The Java primitive data type float is 4 bytes, or 32 bits:

 1 bit Sign 8 bit exponent 23 bit mantissa

While the double is 8 bytes, or 64 bits, formatted as follows:

 1 bit Sign 11 bit exponent 52 bit mantissa

Sign: 0 ® positive, 1 ® negative.

Exponent: excess-127 format for float, excess-1023 format for double.

• Float: Emin = -126, Emax = 127
• Double: Emin = -1022, Emax = 1023
• Consider an 8 bit number which has a range of 0 - 256. We use the formula excess - 127 = E to assign the value of our exponent with excess = 127 representing 0.
• In this manner, excess = 120 is an exponent of -7 since 120 - 127 = -7, and excess = 155 is the exponent +28 since 155 - 127 = +28. Excess values of 0, (Emin - 1), and 255, (Emax + 1), have special meanings which are discussed below.

Mantissa: normalized 1-plus fraction with the 1 to the left of the radix point not recorded, float: b1b2b3b4b23, double: b1b2b3b4b52. This value is rounded based on the value of the next least significant bit not recorded (if there is a 1 in b24, b53 respectively, increment the least significant bit).

The Java Float:

The largest positive finite `float` literal is `3.40282347e+38f`

0111 1111 0111 1111 1111 1111 1111 1111

The smallest positive finite nonzero literal of type `float` is `1.40239846e-45f`

0000 0000 0000 0000 0000 0000 0000 0001

The largest positive finite `double` literal is `1.79769313486231570e+308`
The smallest positive finite nonzero literal of type `double` is `4.94065645841246544e-324`.

A compile-time error occurs if a nonzero floating-point literal is too large, so that on rounded conversion to its internal representation it becomes an IEEE 754 infinity. A program can represent infinities without producing a compile-time error by using constant expressions such as `1f/0f` or `-1d/0d` or by using the predefined constants `POSITIVE_INFINITY` and `NEGATIVE_INFINITY` of the classes `Float` and `Double`.

A compile-time error occurs if a nonzero floating-point literal is too small, so that, on rounded conversion to its internal representation, it becomes a zero. A compile-time error does not occur if a nonzero floating-point literal has a small value that, on rounded conversion to its internal representation, becomes a nonzero denormalized number.

When the exponent field is all zeros, the mantissa is interpreted to be denormalized.

Not-a-Number, NaN, occurs from 0/0 or /.

 Possible Float Representations Exponent, (E) Mantissa  (fraction part, f) Evaluation 00000000000000000000000000000000 Emin - 1,   (-127) f = 0 +0 10000000000000000000000000000000 Emin - 1,   (-127) f = 0 -0 *00000000fififififififififififififififififififififififi Emin - 1,   (-127) f ¹ 0,  fi = 0 or 1 but not all 0 +0.f  x 2Emin *eeeeeeee*********************** -126 < E < 127 any +1.f  x 2E 011111111fififififififififififififififififififififififi Emax + 1,  (128) f ¹ 0,  fi = 0 or 1 but not all 0 NaN 01111111100000000000000000000000 Emax + 1,  (128) f = 0 + 11111111100000000000000000000000 Emax + 1,  (128) f = 0 -

Where do all the floating point numbers fall?

The maximum number of distinct values that can be represented with 32 bits is 232 whether the format is unsigned integer, two's complement integer, or IEEE 754 single precision rounded. Interestingly, there are more distinct Java floats (IEEE 754 single precision rounded) in [-1 to +1] than in the rest of the number line, i.e. (- to -1) union (1 to +).

 [-1 to +1] 10111111100000000000000000000000    (-1.0) ... 231 distinct values 10111111000000000000000000000000    (-0.5) 10111111000000000000000000000001    (-5.0000006e-1) ... 10000000000000000000000000000000    (-0) 00000000000000000000000000000000    (+0) ... 00111111000000000000000000000000    (0.5) ... 00111111100000000000000000000000    (1.0) (- to -1) union (1 to +) 231 - (223 - 1) + 1 distinct values includes the 223 - 1 non-distinct representations for NaN 011111111fififififififififififififififififififififififi

Graphically, the spacing between distinct values (ε) doubles with each exponential increment away from zero. For the number 220 this spacing is 0.0625 to the left and 0.125 to the right. To see this consider the right side of 220 which contains (221 - 220)/ 0.125 = 8388608 = 223 distinct values.  For the number 240 this spacing is 65,536 to the left and 131,072 to the right.  Real numbers are assigned the distinct representation that is closest to that real number. At run time all values larger than the largest distinct value are assigned to positive infinity. Similarly,  values further from zero in the negative direction than the furthest distinct negative number are assigned negative infinity.
Try it! Type in 25431.1234 and click on the Parse Input button. Then try 25431.1230 or 25431.1239, they all have the same representation as a Java Float. Here 25431.1234 falls between 16,384 (214) and 32,768 (215), with 23 bits of precision at the range 214 there remains 9 bits for the mantissa, so the least significant bit increments at 2-9 .002 = ε.  All values between 25431.1221 and 25431.1240 have the same floating point representation, 46C6AE3F.

Applet Source Code

Conversion Example 1: Write the number 1234.0 as a Java float.

1. Use repeated division by 16 to convert the integer part to hexadecimal followed by a four digit binary "lookup"
(recall that the first remainder out of this algorithm is adjacent to the radix point):
16)1234   a0 = 2.
16)77   a1 = D
16)4   a2 = 4
0.
4D216 = 0100 1101 0010.
2. Determine the radix point shift for this number to achieve 1-plus normalized binary scientific notation:
0100 1101 0010. = 1.0011010010 x 210
Write the exponent in excess-127 format: 137 - 127 = 10.
Convert E to the binary form: 137 = 10001001.
3. Record the mantissa as
b1b2b3b4b23 = 0011 0100 1000 0000 0000 000.
 1 bit Sign 8 bit exponent 23 bit mantissa 0 100 0100 1 001 1010 0100 0000 0000 0000

Here we have too many 0's and 1's to read comfortably so group this Java word into 4 digit binary segments and convert to Hex by "lookup".
0100 0100 1001 1010 0100 0000 0000 0000
Hex: 4 4 9 A 4 0 0 0
Decimal: 1234

Conversion Example 2: Write the number 25,431.1234 as a Java float.

1. Use repeated division/multiplication by 16 to convert the integer/fraction part to hexadecimal followed by a four digit binary "lookup":
6357.1F9716 = 0110 0011 0101 0111.0001 1111 1001 0111.
2. Determine the radix shift for this number to achieve 1-plus normalized binary scientific notation:
0110 0011 0101 0111.0001 1111 1001 0111 = 1.100011010101110001111110010111 x 214
Write the exponent in excess-127 format: 141 - 127 = 14.
Convert E to the binary form: 141 = 10001101.
3. Record the mantissa as (the 24th bit is a 0, so no round up on the 23rd bit)
b1b2b3b4b23 = 1000 1101 0101 1100 0111 111.
 1 bit Sign 8 bit exponent 23 bit mantissa 0 10001101 1000 1101 0101 1100 0111 111

0100 0110 1100 0110 1010 1110 0011 1111
Hex: 4 6 C 6 A E 3 F
Decimal: 25,341.123 (note the loss of precision).

Conversion Example 3: 25341.1234 - 0.01234 = 25341.11106, Carry out this same calculation using IEEE 754 floating point single precision machine numbers.
1 .10001101010111000111111 x 214 - 1 .10010100010110110110110 x 2-7 =

1 .10001101010111000111111 x 214 - .00000000000000000000110 x 214 =
1 .10001101010111000111001 x 214

0100 0110 1100 0110 1010 1110 0011 1001 = 25431.111
HEX: 4 6 C 6 A E 3 9
Decimal: 25341.111 (note the further loss of precision).

 Java Primitive Data Types: Integers Base Conversion Number Systems Complements in Radix r Euclid's Division Algorithm IEEE 754 applet Homework

by J. A. Tompkins tompkinsj@uncw.edu