IEEE 754 Machine Numbers and Machine Arithmetic

In order to make numerical programs portable between different machines, the IEEE 754 standard defines machine numbers and how arithmetic operations should be performed. Virtually all current computers comply with this standard (interesting history behind the standard: "battle over gradual underflow")

Machine Numbers

Machine numbers consist of the following:

normalized machine numbers x = ±(.1d₂...d_n)₂ 2^e with e_min≤e≤e_max
These are the standard machine numbers discussed in class used to represent nonzero values.
subnormal numbers x = ±(.0d₂...d_n)₂ 2^e_min (this includes +0, -0)
This allows to represent 0 with a sign (in case of underflow the sign is preserved, see example below).
Subnormal numbers fix the problem that the distance from zero to x_min is much larger than the distance from x_min to the next larger machine number.
special values +Inf, -Inf, NaN
+Inf and -Inf are used to represent ±∞ in case of overflow, NaN ("Not a Number") is used to represent indeterminate results like 0/0 or Inf-Inf

The most important values are

machine epsilon ε_M = 2^-n
This is an upper bound of the rounding error if x_min ≤ |x| ≤ x_max
largest machine number x_max = (1-2^-n)·2^e_max
If |x| ≥ (1-2^-n-1)·2^e_max we have overflow and obtain ±Inf.
smallest normalized machine number x_min = 2^e_min-1
If |x| < x_min subnormal numbers with lower precision are used ("gradual underflow").

Single and Double Precision Machine Numbers

	bits	n	e_max	e_min	ε_M	x_max	x_min
Single Precision	32	24	128	-125	2^-24≈6·10^-8	≈2¹²⁸≈3·10³⁸	2^-126≈10^-38
Double Precision	64	53	1024	-1021	2^-53≈10^-16	≈2¹⁰²⁴≈2·10³⁰⁸	2^-1022≈2·10^-308

Internal Representation of Machine Numbers

Machine numbers use 1 bit for the sign, k bits for the exponent and n-1 bits for the mantissa. They are represented as a sequence of bits (each of which is 0 or 1) as follows:

s e₁ ... e_k d₂ ... d_n

For single precision numbers we have n=24, k=8.

For double precision numbers we have n=53, k=11.

The sign is "+" for s=0 and "-" for s=1.

The exponent is obtained as e = (e₁ ... e_k )₂ - b where b = 2^k-1-2. The largest and smallest values of e are used to represent special values. Hence the smallest remaining value is e_min = 1 - b = 3 - 2^k-1, the largest remaining value is e_max = 2^k - 2 - b = 2^k-1.

For e_min <= e <= e_max we have: x = ±(.1d₂...d_n)₂ 2^e, representing normalized numbers
For e = e_min - 1 we have: x = ±(.0d₂...d_n)₂ 2^e_min, representing ±0 and subnormal numbers (aka denormalized numbers).
For e = e_max + 1 we have: x = ±Infinity if all d_j=0
x = NaN otherwise

Note: All numbers with sign "+", arranged by size from +0 up to +Infinity correspond to all the bit sequences (0 0...0 0...0) up to (0 1...1 0...0), arranged as binary integers. Therefore it is easy to compare two machine numbers, or to find the next smaller or larger machine number.

Rounding

Normally rounding "to nearest" is enabled. Let x be an arbitrary real number.

For |x| ≥ (1-2^-n-1)·2^e_max: fl(x) = ±Infinity
otherwise: fl(x) is the nearest machine number. In the case of a tie the number with d_n=0 is chosen.

If x_min ≤ |x| ≤ x_max the rounding error is bounded by the machine epsilon ε_M=2^-n:
|(fl(x)-x)/x| ≤ ε_M = 2^-n

Machine Arithmetic

For addition, subtraction, multiplication, division and square roots of machine numbers the rounded exact result must be returned. E.g., adding two machine numbers x, y returns the machine number fl(x+y).
transcendental functions like sin(x), exp(x), log(x) give an approximation with a relative error of less than 2ε_M. Trying to find the closest machine number may require huge extra accuracy ("table-maker's dilemma").
Operations involving +0, -0, Inf, -Inf, NaN: The IEEE754 standard defines for all these operations a result, e.g.,
1/+0 = +Inf , 1/-0 = -Inf , Inf + Inf = Inf , Inf - Inf = NaN , 0/0 = NaN , 0*Infinity = NaN.
Arithmetic operations involving NaN return NaN (with a few exceptions: e.g. min(x,NaN)=x, max(x,NaN)=x)
Note that there are two distinct machine numbers +0 and -0 which behave differently in expressions such as 1/0:
```
  x = 1e-300; 
  y = -x*x      % underflow to -0, but is displayed as 0
  z = 1/y       % gives -Inf
```
However, IEEE 754 defines the comparison operator "==" such that +0==-0 is true. Note that NaN==NaN is defined as false.