IEEE 754

The standard addressed many problems found in the diverse floating-point implementations that made them difficult to use reliably and portably.

[1] It is a minor revision of the previous version, incorporating mainly clarifications, defect fixes and new recommended operations.

The need for a floating-point standard arose from chaos in the business and scientific computing industry in the 1960s and 1970s.

IBM used a hexadecimal floating-point format with a longer significand and a shorter exponent[clarification needed].

CDC 60-bit computers did not have full 60-bit adders, so integer arithmetic was limited to 48 bits of precision from the floating-point unit.

A new version, IEEE 754-2008, was published in August 2008, following a seven-year revision process, chaired by Dan Zuras and edited by Mike Cowlishaw.

The international standard ISO/IEC/IEEE 60559:2011 (with content identical to IEEE 754-2008) has been approved for adoption through ISO/IEC JTC 1/SC 25 under the ISO/IEEE PSDO Agreement[2][3] and published.

It incorporates mainly clarifications (e.g. totalOrder) and defect fixes (e.g. minNum), but also includes some new recommended operations (e.g.

[5][6] The international standard ISO/IEC 60559:2020 (with content identical to IEEE 754-2019) has been approved for adoption through ISO/IEC JTC 1/SC 25 and published.

Due to the possibility of multiple encodings (at least in formats called interchange formats), a NaN may carry other information: a sign bit (which has no meaning, but may be used by some operations) and a payload, which is intended for diagnostic information indicating the source of the NaN (but the payload may have other uses, such as NaN-boxing[10][11][12]).

The standard defines five basic formats that are named for their numeric base and the number of bits used in their interchange encoding.

For example, the smallest positive number that can be represented in binary64 is 2−1074; contributions to the −1074 figure include the emin value −1022 and all but one of the 53 significand bits (2−1022 − (53 − 1) = 2−1074).

An implementation may use whatever internal representation it chooses for such formats; all that needs to be defined are its parameters (b, p, and emax).

These parameters uniquely describe the set of finite numbers (combinations of sign, significand, and exponent for the given radix) that it can represent.

The original IEEE 754-1985 standard also had the concept of extended formats, but without any mandatory relation between emin and emax.

For the exchange of decimal floating-point numbers, interchange formats of any multiple of 32 bits are defined.

The former is more convenient for direct hardware implementation of the standard, while the latter is more suited to software emulation on a binary computer.

In either case, the set of numbers (combinations of sign, significand, and exponent) that may be encoded is identical, and special values (±zero with the minimum exponent, ±infinity, quiet NaNs, and signaling NaNs) have identical encodings.

The standard provides a predicate totalOrder, which defines a total ordering on canonical members of the supported arithmetic format.

The main differences are:[34] The totalOrder predicate does not impose a total ordering on all encodings in a format.

Some decimal floating-point implementations define additional exceptions,[36][37] which are not part of IEEE 754: Additionally, operations like quantize when either operand is infinite, or when the result does not fit the destination format, will also signal invalid operation exception.

The two values behave as equal in numerical comparisons, but some operations return different results for +0 and −0.

Other common functions with a discontinuity at x=0 which might treat +0 and −0 differently include Γ(x) and the principal square root of y + xi for any negative number y.

[41] A property of the single- and double-precision formats is that their encoding allows one to easily sort them without using floating-point hardware, as if the bits represented sign-magnitude integers, although it is unclear whether this was a design consideration (it seems noteworthy that the earlier IBM hexadecimal floating-point representation also had this property for normalized numbers).

The recommended operations also include setting and accessing dynamic mode rounding direction,[50] and implementation-defined vector reduction operations such as sum, scaled product, and dot product, whose accuracy is unspecified by the standard.

As of 2019, the formerly required minNum, maxNum, minNumMag, and maxNumMag in IEEE 754-2008 are now deprecated due to their non-associativity.

Programming languages should allow a user to specify a minimum precision for intermediate calculations of expressions for each radix.

Thus, for instance, a compiler targeting x87 floating-point hardware should have a means of specifying that intermediate calculations must use the double-extended format.

The IEEE 754-1985 version of the standard allowed many variations in implementations (such as the encoding of some values and the detection of certain exceptions).

[60] The standard recommends providing conversions to and from external hexadecimal-significand character sequences, based on C99's hexadecimal floating point literals.