IEEE 754 Machine Numbers and Machine Arithmetic

In order to make numerical programs portable between different machines, the IEEE 754 standard defines machine numbers and how arithmetic operations should be performed. Virtually all current computers comply with this standard (interesting history behind the standard: "battle over gradual underflow")

Machine Numbers

Machine numbers consist of the following: The most important values are

Single and Double Precision Machine Numbers

bits n emax emin εM xmax xmin
Single Precision 32 24 128 -125 2-24≈6·10-8 ≈2128≈3·1038 2-126≈10-38
Double Precision 64 53 1024 -1021 2-53≈10-16 ≈21024≈2·10308 2-1022≈2·10-308

Internal Representation of Machine Numbers

Machine numbers use 1 bit for the sign, k bits for the exponent and n-1 bits for the mantissa. They are represented as a sequence of bits (each of which is 0 or 1) as follows:

s e1 ... ek d2 ... dn

For single precision numbers we have n=24, k=8.

For double precision numbers we have n=53, k=11.

The sign is "+" for s=0 and "-" for s=1.

The exponent is obtained as e = (e1 ... ek )2 - b where b = 2k-1-2. The largest and smallest values of e are used to represent special values. Hence the smallest remaining value is emin = 1 - b = 3 - 2k-1, the largest remaining value is emax = 2k - 2 - b = 2k-1.

For emin <= e <= emax we have
x = ±(.1d2...dn)2 2e, representing normalized numbers
For e = emin - 1 we have
x = ±(.0d2...dn)2 2emin, representing ±0 and subnormal numbers (aka denormalized numbers).
For e = emax + 1 we have
x = ±Infinity if all dj=0
x = NaN otherwise

Note: All numbers with sign "+", arranged by size from +0 up to +Infinity correspond to all the bit sequences (0 0...0 0...0) up to (0 1...1 0...0), arranged as binary integers. Therefore it is easy to compare two machine numbers, or to find the next smaller or larger machine number.

Rounding

Normally rounding "to nearest" is enabled. Let x be an arbitrary real number.
For |x| ≥ (1-2-n-1)·2emax
fl(x) = ±Infinity
otherwise
fl(x) is the nearest machine number. In the case of a tie the number with dn=0 is chosen.

If xmin ≤ |x| ≤ xmax the rounding error is bounded by the machine epsilon εM=2-n:
      |(fl(x)-x)/x| ≤ εM = 2-n

Machine Arithmetic