IEEE 754 Machine Numbers and Machine Arithmetic
In order to make numerical programs portable between different
machines, the IEEE 754 standard defines machine numbers and how arithmetic
operations should be performed.
Machine Numbers
Machine numbers consist of the following:
- normalized machine numbers x =
±(1.d1...dn)2
2e with emin≤e≤emax
This lets us approximate a number x with the full machine accuracy εM
if xmin ≤ |x| ≤ xmax
-
subnormal numbers x =
±(0.d1...dn)2
2emin (this includes +0, -0)
This lets us approximate numbers x with |x| < xmin.
Note that this uses fewer significant digits, depending on how many leading zeros we have
in d1...dn.
Choosing all dj=0 gives the two machine numbers +0 and -0. This allows to remember the sign in the case of an underflow. See below for details how operations on +0 and -0 work.
Subnormal numbers fix the problem that the distance from zero to xmin is much larger
than the distance from xmin to the next larger machine number. Without subnormal numbers we can have that x>y is true, but x-y>0 is false on the computer.
- special values +Inf, -Inf, NaN
+Inf and -Inf are used to represent ±∞ in case of overflow, NaN ("Not a Number") is used to represent indeterminate results like
0/0 or Inf-Inf
The most important values are
-
machine epsilon εM = 2-n-1
This is an upper bound of the rounding error if xmin ≤ |x| ≤ xmax
-
largest machine number xmax =
(2-2-n)·2emax
If |x| ≥ (2-2-n-1)·2emax we have overflow
and obtain ±Inf.
-
smallest normalized machine number xmin = 2emin
If |x| < xmin subnormal numbers with lower precision are used ("gradual underflow").
Single and Double Precision Machine Numbers
| bits | n |
emax | emin |
εM |
xmax | xmin |
Single Precision | 32 | 23 |
127 | -126 |
2-24≈6·10-8 |
≈2128≈3·1038 |
2-126≈10-38 |
Double Precision | 64 | 52 |
1023 | -1022 |
2-53≈10-16 |
≈21024≈2·10308 |
2-1022≈2·10-308 |
Internal Representation of Machine Numbers
Machine numbers use 1 bit for the sign, k bits for the exponent and n bits for the mantissa. They are represented as a sequence of bits (each of which is 0 or 1) as follows:
s e1 ... ek
d1 ... dn
For single precision numbers we have n=23,
k=8.
For double precision numbers we have n=52,
k=11.
The sign is "+" for s=0 and "-" for
s=1.
The exponent is obtained as e =
(e1 ... ek )2 - b
where b = 2k-1-1. The largest and smallest values of
e are used to represent special values. Hence the smallest remaining
value is emin = 1 - b = 2 -
2k-1, the largest remaining value is
emax = 2k - 2 - b =
2k-1 - 1.
- For emin ≤ e ≤
emax we have
- x =
±(1.d1...dn)2
2e, representing normalized
numbers
- For e = emin - 1 we have
- x =
±(0.d1...dn)2
2emin, representing ±0
and subnormal numbers (aka denormalized
numbers).
- For e = emax + 1 we have
- x = ±Inf if all
dj=0
x = NaN otherwise
Note: All numbers with sign "+", arranged by size from +0
up to +Inf correspond to all the bit sequences (0 0...0 0...0) up to
(0 1...1 0...0), arranged as binary integers. Therefore it is easy to compare
two machine numbers, or to find the next smaller or larger machine number.
Rounding
Normally rounding "to nearest" is enabled. Let x be an arbitrary real number.
- For |x| ≥ (2-2-n-1)·2emax
- fl(x) = ±Inf
- otherwise
- fl(x) is the nearest machine number. In the case of a tie the
number with dn=0 is chosen.
If xmin ≤ |x| ≤ xmax the rounding error is bounded by the machine
epsilon εM=2-n-1:
|(fl(x)-x)/x| ≤ εM = 2-n-1
Machine Arithmetic
- For addition, subtraction, multiplication, division and square
roots of machine numbers the rounded exact result
must be returned. E.g., adding two machine numbers x, y returns
the machine number fl(x+y).
- functions like sin(x), exp(x), log(x) give an approximation
with a relative error which may be slightly larger than εM. Trying to find the closest machine number may require huge extra accuracy ("table-maker's dilemma").
- Operations involving +0, -0, Inf, -Inf, NaN: The IEEE754 standard defines for all these operations a result, e.g.,
1/+0 = +Inf , 1/-0 = -Inf , Inf + Inf = Inf ,
Inf - Inf = NaN , 0/0 = NaN , 0*Inf = NaN.
Arithmetic operations involving NaN return NaN.
However, revision IEEE754-2008 defined
min(x,NaN)=x, max(x,NaN)=x (which Matlab uses), but this was changed back to NaN in IEEE754-2019.
Note that NaN==NaN is
defined as false.
The power 0^0 is another tricky case:
here the standard allows three different power functions pown(x,y), powr(x,y), pow(x,y):
- pown(x,y) accepts real x, integer y and gives pown(0,0)=1.
- powr(x,y) accepts real x, real y and gives powr(0,0)=NaN.
- pow(x,y) accepts real x, real y and gives pow(0,0)=1.
Virtually all programming languages
like Matlab, C, C++, Java, Python use the third option and return 00=1.
This is the most reasonable definition
since this allows the evaluation of polynomials
a0x0+a1x1+...+anxn
without giving NaN.
-
Note that there are two distinct machine numbers +0 and -0. But IEEE 754 defines the
comparison operators such that +0==-0 is true and +0>-0 is false.
In Matlab both +0 and -0 are displayed as 0 when omitting the semicolon, or using disp.
fprintf with format %g
or %.15g prints the two zeros as 0 and -0.
fprintf with format %+g
or %+.15g prints the two zeros as +0 and -0.
The numbers +0 and -0 behave differently in expressions such as 1/0. Here is a Matlab example:
>> x = 1e-300; y = -x*x % underflow to -0, but this is displayed as 0
y =
0
>> z = 1/y % 1/-0 gives -Inf
z =
-Inf