IEEE 754 Machine Numbers and Machine Arithmetic

In order to make numerical programs portable between different machines, the IEEE 754 standard defines machine numbers and how arithmetic operations should be performed.

Machine Numbers

Machine numbers consist of the following: The most important values are

Single and Double Precision Machine Numbers

bits n emax emin εM xmax xmin
Single Precision 32 23 127 -126 2-24≈6·10-8 ≈2128≈3·1038 2-126≈10-38
Double Precision 64 52 1023 -1022 2-53≈10-16 ≈21024≈2·10308 2-1022≈2·10-308

Internal Representation of Machine Numbers

Machine numbers use 1 bit for the sign, k bits for the exponent and n bits for the mantissa. They are represented as a sequence of bits (each of which is 0 or 1) as follows:

s e1 ... ek d1 ... dn

For single precision numbers we have n=23, k=8.

For double precision numbers we have n=52, k=11.

The sign is "+" for s=0 and "-" for s=1.

The exponent is obtained as e = (e1 ... ek )2 - b where b = 2k-1-1. The largest and smallest values of e are used to represent special values. Hence the smallest remaining value is emin = 1 - b = 2 - 2k-1, the largest remaining value is emax = 2k - 2 - b = 2k-1 - 1.

For emineemax we have
x = ±(1.d1...dn)2 2e, representing normalized numbers
For e = emin - 1 we have
x = ±(0.d1...dn)2 2emin, representing ±0 and subnormal numbers (aka denormalized numbers).
For e = emax + 1 we have
x = ±Inf if all dj=0
x = NaN otherwise

Note: All numbers with sign "+", arranged by size from +0 up to +Inf correspond to all the bit sequences (0 0...0 0...0) up to (0 1...1 0...0), arranged as binary integers. Therefore it is easy to compare two machine numbers, or to find the next smaller or larger machine number.

Rounding

Normally rounding "to nearest" is enabled. Let x be an arbitrary real number.
For |x| ≥ (2-2-n-1)·2emax
fl(x) = ±Inf
otherwise
fl(x) is the nearest machine number. In the case of a tie the number with dn=0 is chosen.

If xmin ≤ |x| ≤ xmax the rounding error is bounded by the machine epsilon εM=2-n-1:
      |(fl(x)-x)/x| ≤ εM = 2-n-1

Machine Arithmetic