Читать книгу Applied Numerical Methods Using MATLAB - Won Y. Yang - Страница 29
1.2.1 IEEE 64‐bit Floating‐Point Number Representation
ОглавлениеMATLAB uses the IEEE 64‐bit floating‐point number system to represent all numbers. It has a word structure consisting of the sign bit, the exponent field, and the mantissa field as follows:
63 | 62 | 52 | 51 | 0 |
S | Exponent | Mantissa |
Each of these fields expresses S, E, and M of a number f in the way described as follows:
Sign bit
Exponent field (b62b61b60 ⋯ b52): adopting the excess 1023 code
Mantissa field (b51b50 ⋯ b1b0):In the un‐normalized range where the numbers are so small that they can be represented only with the value of hidden bit 0, the number represented by the mantissa is(1.2.1) You might think that the value of the hidden bit is added to the exponent, instead of to the mantissa.
In the normalized range, the number represented by the mantissa together with the value of hidden bit bh = 1 is
(1.2.2)
The set of numbers S, E, and M, each represented by the sign bit S, the exponent field Exp and the mantissa field M, represents a number as a whole
(1.2.3)
We classify the range of numbers depending on the value (E) of the exponent and denote it as
(1.2.4)
In each range, the least unit, i.e. the value of least significant bit (LSB) or the difference between two consecutive numbers represented by the mantissa of 52 bits is
(1.2.5)
Let us take a closer look at the bit‐wise representation of numbers belonging to each range:
(0) 0 (zero)
63 | 62 | 52 | 51 | 0 |
S | 000 ⋯ 0000 | 0000 0000 ⋯ 0000 0000 |
(1) Un‐normalized range (with the value of hidden bit bh = 0)(1.2.6.1a) (1.2.6.1b)
(2) The smallest normalized range (with the value of hidden bit bh = 1)(1.2.6.2a) (1.2.6.2b)
(3) Basic normalized range (with the value of hidden bit bh = 1)(1.2.6.3a) (1.2.6.3b)
(4) The largest normalized range (with the value of hidden bit bh = 1)(1.2.6.4a) (1.2.6.4b)
(5) ± ∞(inf) with Exp = 211 − 1 = 2047, E = Exp − 1023 = 1024 (meaningless)
From what has been mentioned earlier, we know that the minimum and maximum positive numbers are, respectively,
(1.2.7a)
(1.2.7b)
where the three MATLAB constants, i.e. eps
, realmin
, and realmax
, represent 2−52, 2−1022, and (2 − 2−52) × 21023, respectively. This can be checked by running the script “nm109.m” in Section 1.I..
Now, in order to gain some idea about the arithmetic computational mechanism, let us see how the addition of two numbers, 3 and 14, represented in the IEEE 64‐bit floating number system, is performed.
In the process of adding the two numbers illustrated in Figure 1.6, an alignment is made so that the two exponents in their 64‐bit representations equal each other; and it will kick out the part smaller by more than 52 bits, causing some numerical error. For example, adding 2−23 to 230 does not make any difference, while adding 2−22 to 230 does, as we can see by typing the following statements into the MATLAB Command window.
Figure 1.6 Process of adding two numbers, 3 and 14, in MATLAB.
>x=2̂30; x+2̂-22==x, x+2̂-23==x ans= 0(false) ans= 1(true)
1 (cf) Each range has a different minimum unit (LSB value) described by Eq. (1.2.5). It implies that the numbers are uniformly distributed within each range. The closer the range is to 0, the denser the numbers in the range are. Such a number representation makes the absolute quantization error large/small for large/small numbers, decreasing the possibility of large relative quantization error.