IEEE 754 Machine Numbers and Machine Arithmetic

In order to make numerical programs portable between different machines, the IEEE 754 standard defines machine numbers and how arithmetic operations should be performed.

the IEEE 754 standard was officially published in 1985 with revisions in 2008, 2019
interesting history behind the standard: "battle over gradual underflow"
Virtually all current computers comply with this standard
However, for many languages and compilers there is a 'fast-math' compiler option which produces faster code by disabling some features of standard machine arithmetic (e.g. subnormal numbers).
Note that that this can lead to inaccurate results and may cause algorithms to fail.

Machine Numbers

Machine numbers consist of the following:

normalized machine numbers x = ±(1.d₁...d_n)₂ 2^e with e_min≤e≤e_max
This lets us approximate a number x with the full machine accuracy ε_M if x_min ≤ |x| ≤ x_max
subnormal numbers x = ±(0.d₁...d_n)₂ 2^e_min (this includes +0, -0)
This lets us approximate numbers x with |x| < x_min. Note that this uses fewer significant digits, depending on how many leading zeros we have in d₁...d_n.
Choosing all d_j=0 gives the two machine numbers +0 and -0. This allows to remember the sign in the case of an underflow. See below for details how operations on +0 and -0 work.
Subnormal numbers fix the problem that the distance from zero to x_min is much larger than the distance from x_min to the next larger machine number. Without subnormal numbers we can have that x>y is true, but x-y>0 is false on the computer.
special values +Inf, -Inf, NaN
+Inf and -Inf are used to represent ±∞ in case of overflow, NaN ("Not a Number") is used to represent indeterminate results like 0/0 or Inf-Inf

The most important values are

machine epsilon ε_M = 2^-n-1
This is an upper bound of the rounding error if x_min ≤ |x| ≤ x_max
largest machine number x_max = (2-2^-n)·2^e_max
If |x| ≥ (2-2^-n-1)·2^e_max we have overflow and obtain ±Inf.
smallest normalized machine number x_min = 2^e_min
If |x| < x_min subnormal numbers with lower precision are used ("gradual underflow").

Single and Double Precision Machine Numbers

	bits	n	e_max	e_min	ε_M	x_max	x_min
Single Precision	32	23	127	-126	2^-24≈6·10^-8	≈2¹²⁸≈3·10³⁸	2^-126≈10^-38
Double Precision	64	52	1023	-1022	2^-53≈10^-16	≈2¹⁰²⁴≈2·10³⁰⁸	2^-1022≈2·10^-308

Internal Representation of Machine Numbers

Machine numbers use 1 bit for the sign, k bits for the exponent and n bits for the mantissa. They are represented as a sequence of bits (each of which is 0 or 1) as follows:

s e₁ ... e_k d₁ ... d_n

For single precision numbers we have n=23, k=8.

For double precision numbers we have n=52, k=11.

The sign is "+" for s=0 and "-" for s=1.

The exponent is obtained as e = (e₁ ... e_k )₂ - b where b = 2^k-1-1. The largest and smallest values of e are used to represent special values. Hence the smallest remaining value is e_min = 1 - b = 2 - 2^k-1, the largest remaining value is e_max = 2^k - 2 - b = 2^k-1 - 1.

For e_min ≤ e ≤ e_max we have: x = ±(1.d₁...d_n)₂ 2^e, representing normalized numbers
For e = e_min - 1 we have: x = ±(0.d₁...d_n)₂ 2^e_min, representing ±0 and subnormal numbers (aka denormalized numbers).
For e = e_max + 1 we have: x = ±Inf if all d_j=0
x = NaN otherwise

Note: All numbers with sign "+", arranged by size from +0 up to +Inf correspond to all the bit sequences (0 0...0 0...0) up to (0 1...1 0...0), arranged as binary integers. Therefore it is easy to compare two machine numbers, or to find the next smaller or larger machine number.

Rounding

Normally rounding "to nearest" is enabled. Let x be an arbitrary real number.

For |x| ≥ (2-2^-n-1)·2^e_max: fl(x) = ±Inf
otherwise: fl(x) is the nearest machine number. In the case of a tie the number with d_n=0 is chosen.

If x_min ≤ |x| ≤ x_max the rounding error is bounded by the machine epsilon ε_M=2^-n-1:
|(fl(x)-x)/x| ≤ ε_M = 2^-n-1

Machine Arithmetic

For addition, subtraction, multiplication, division and square roots of machine numbers the rounded exact result must be returned. E.g., adding two machine numbers x, y returns the machine number fl(x+y).
functions like sin(x), exp(x), log(x) give an approximation with a relative error which may be slightly larger than ε_M. Trying to find the closest machine number may require huge extra accuracy ("table-maker's dilemma").
Operations involving +0, -0, Inf, -Inf, NaN: The IEEE754 standard defines for all these operations a result, e.g.,
1/+0 = +Inf , 1/-0 = -Inf , Inf + Inf = Inf , Inf - Inf = NaN , 0/0 = NaN , 0*Inf = NaN.
Arithmetic operations involving NaN return NaN. However, revision IEEE754-2008 defined min(x,NaN)=x, max(x,NaN)=x (which Matlab uses), but this was changed back to NaN in IEEE754-2019.
Note that NaN==NaN is defined as false.
The power 0^0 is another tricky case: here the standard allows three different power functions pown(x,y), powr(x,y), pow(x,y):
- pown(x,y) accepts real x, integer y and gives pown(0,0)=1.
- powr(x,y) accepts real x, real y and gives powr(0,0)=NaN.
- pow(x,y) accepts real x, real y and gives pow(0,0)=1.
Virtually all programming languages like Matlab, C, C++, Java, Python use the third option and return 0⁰=1. This is the most reasonable definition since this allows the evaluation of polynomials a₀x⁰+a₁x¹+...+a_nxⁿ without giving NaN.
Note that there are two distinct machine numbers +0 and -0. But IEEE 754 defines the comparison operators such that +0==-0 is true and +0>-0 is false.
In Matlab both +0 and -0 are displayed as 0 when omitting the semicolon, or using disp.
fprintf with format %g or %.15g prints the two zeros as 0 and -0.
fprintf with format %+g or %+.15g prints the two zeros as +0 and -0.
The numbers +0 and -0 behave differently in expressions such as 1/0. Here is a Matlab example:
```
>> x = 1e-300; y = -x*x      % underflow to -0, but this is displayed as 0
y = 
    0  
>>  z = 1/y                  % 1/-0 gives -Inf
z =
  -Inf
```