In order to make numerical programs portable between different machines, the IEEE 754 standard defines machine numbers and how arithmetic operations should be performed. Virtually all current computers comply with this standard (interesting history behind the standard: "battle over gradual underflow")
bits | n | emax | emin | εM | xmax | xmin | |
---|---|---|---|---|---|---|---|
Single Precision | 32 | 24 | 128 | -125 | 2-24≈6·10-8 | ≈2128≈3·1038 | 2-126≈10-38 |
Double Precision | 64 | 53 | 1024 | -1021 | 2-53≈10-16 | ≈21024≈2·10308 | 2-1022≈2·10-308 |
Machine numbers use 1 bit for the sign, k bits for the exponent and n-1 bits for the mantissa. They are represented as a sequence of bits (each of which is 0 or 1) as follows:
s e1 ... ek d2 ... dn
For single precision numbers we have n=24, k=8.
For double precision numbers we have n=53, k=11.
The sign is "+" for s=0 and "-" for s=1.
The exponent is obtained as e = (e1 ... ek )2 - b where b = 2k-1-2. The largest and smallest values of e are used to represent special values. Hence the smallest remaining value is emin = 1 - b = 3 - 2k-1, the largest remaining value is emax = 2k - 2 - b = 2k-1.
Note: All numbers with sign "+", arranged by size from +0 up to +Infinity correspond to all the bit sequences (0 0...0 0...0) up to (0 1...1 0...0), arranged as binary integers. Therefore it is easy to compare two machine numbers, or to find the next smaller or larger machine number.
If xmin ≤ |x| ≤ xmax the rounding error is bounded by the machine epsilon εM=2-n:
|(fl(x)-x)/x| ≤ εM = 2-n
x = 1e-300; y = -x*x % underflow to -0, but is displayed as 0 z = 1/y % gives -InfHowever, IEEE 754 defines the comparison operator "==" such that +0==-0 is true. Note that NaN==NaN is defined as false.