Java Primitive Data Types -Floating Point

Java uses IEEE (Institute of Electronics and Electrical Engineers) Standard 754 to store real numbers.

Background: Consider the number 1,234.0, using scientific notation this is 1.2340 x 103. There is always a units digit to the left of the decimal (except when the number is itself 0). The fraction part is called the mantissa.

The IEEE 754 Standard uses 1-plus form of the normalized fraction
1-plus normalized scientific notation base two is then

N = ± (1.b1b2b3b4 )2 x 2+E

The 1 is understood to be there and is not recorded.
The Java primitive data type float is 4 bytes, or 32 bits:

1 bit Sign

8 bit exponent

23 bit mantissa

While the double is 8 bytes, or 64 bits, formatted as follows:

1 bit Sign

11 bit exponent

52 bit mantissa

Sign: 0 ® positive, 1 ® negative.

Exponent: excess-127 format for float, excess-1023 format for double.

Mantissa: normalized 1- plus fraction with 1 not recorded, float: b1b2b3b4b23, double: b1b2b3b4b52.

Example Write the number 1234.0 as a Java float.

  1. Use repeated division by 8 to convert the integer part to octal followed by a three digit binary "lookup":
    8)1234 a0 = 2,
    8)154 a1 = 2,
    8)19 a2 = 3,
    8)2 a3 = 2,
    0.
    23228 = 010 011 010 010.
  2. Determine the decimal point shift for this number to achieve 1-plus normalized binary scientific notation:
    010 011 010 010. = 1.0011010010 x 210
    Write the exponent in excess-127 format: 137 - 127 = 10.
    Convert E to the binary form: 137 = 10001001.
  3. Record the mantissa as
    b1b2b3b4b23 = 0011 0100 1000 0000 0000 000.

1 bit Sign

8 bit exponent

23 bit mantissa

0

100 0100 1

001 1010 0100 0000 0000 0000

Here we have too many 0's and 1's to read comfortably so group this Java word into 4 digit binary segments and convert to Hex by "lookup".
0100 0100 1001 1010 0100 0000 0000 0000
Hex: 4 4 9 A 4 0 0 0
Decimal: 1234

Example Write the number 25,431.1234 as a Java float.

  1. Use repeated division/multiplication by 8 to convert the integer/fraction part to octal followed by a three digit binary "lookup":
    61527.0777138 = 110 001 101 010 111.000 111 111 001 011.
  2. Determine the decimal shift for this number to achieve 1-plus normalized binary scientific notation:
    110 001 101 010 111.000 111 111 001 011 = 1.10001101010111000111111001011 x 214
    Write the exponent in excess-127 format: 141 - 127 = 14.
    Convert E to the binary form: 141 = 10001101.
  3. Record the mantissa as
    b1b2b3b4b23 = 1000 1101 0101 1100 0111 111.

1 bit Sign

8 bit exponent

23 bit mantissa

0

10001101

1000 1101 0101 1100 0111 111

0100 0110 1100 0110 1010 1110 0011 1111
Hex: 4 6 C 6 A E 3 F
Decimal: 25,341.123 (note the loss of precision).

Example 25341.1234 - 0.01234 = 25341.11106, Carry out this same calculation using IEEE 754 floating point single precision machine numbers.
1 .10001101010111000111111 x 214 - 1 .10010100010110110110110 x 2-7 =

1 .10001101010111000111111 x 214 - .00000000000000000000011 x 214 =
1 .10001101010111000111100 x 214

0100 0110 1100 0110 1010 1110 0011 1100 = 25431.117
HEX: 4 6 C 6 A E 3 C
Decimal: 25341.117 (note the further loss of precision).

BACK