Fixed-point arithmetic

Since most modern processors have fast floating-point unit (FPU), fixed-point representations in processor-based implementations are now used only in special situations, such as in low-cost embedded microprocessors and microcontrollers; in applications that demand high speed or low power consumption or small chip area, like image, video, and digital signal processing; or when their use is more natural for the problem.

Negative values are usually represented in binary fixed-point format as a signed integer in two's complement representation with an implicit scaling factor as above.

Alternatively, negative values can be represented by an integer in the sign-magnitude format, in which case the sign is never included in the number of implied fraction bits.

For greater efficiency, scaling factors are often chosen to be powers (positive or negative) of the base b used to represent the integers internally.

Thus one often uses scaling factors that are powers of 10 (e.g. 1/100 for dollar values), for human convenience, even when the integers are represented internally in binary.

If the range of the values to be represented is known in advance and is sufficiently limited, fixed point can make better use of the available bits.

Specifically, comparing 32-bit fixed-point to floating-point audio, a recording requiring less than 40 dB of headroom has a higher signal-to-noise ratio using 32-bit fixed.

Avoidance of overflow requires much tighter estimates for the ranges of variables and all intermediate values in the computation, and often also extra code to adjust their scaling factors.

A common use of decimal fixed-point is for storing monetary values, for which the complicated rounding rules of floating-point numbers are often a liability.

For example, the open-source money management application GnuCash, written in C, switched from floating-point to fixed-point as of version 1.6, for this reason.

Binary fixed point is used in the STM32G4 series CORDIC co-processors and in the discrete cosine transform algorithms used to compress JPEG images.

Electronic instruments such as electricity meters and digital clocks often use polynomials to compensate for introduced errors, e.g. from temperature or power supply voltage.

Binary fixed-point polynomials can utilize more bits of precision than floating-point and do so in fast code using inexpensive CPUs.

If the result is not exact, the error introduced by the rounding can be reduced or even eliminated by converting the dividend to a smaller scaling factor.

In order to return to the original scaling factor 1/100, the integer 3075 then must be multiplied by 1/100, that is, divided by 100, to yield either 31 (0.31) or 30 (0.30), depending on the rounding policy used.

Similarly, the operation r ← r/s will require dividing the integers and explicitly multiplying the quotient by S. Rounding and/or overflow may occur here too.

Depending on the scaling factor and storage size, and on the range input numbers, the conversion may not entail any rounding.

In such machines, conversion of decimal scaling factors can be performed by bit shifts and/or by memory address manipulation.

Some DSP architectures offer native support for specific fixed-point formats, for example, signed n-bit numbers with n−1 fraction bits (whose values may range between −1 and almost +1).

[citation needed] If the CPU does not provide that feature, the programmer must save the product in a large enough register or temporary variable, and code the renormalization explicitly.

In case of overflow, the high-order bits are usually lost, as the un-scaled integer gets reduced modulo 2n where n is the size of the storage area.

Some processors may instead provide saturation arithmetic: if the result of an addition or subtraction were to overflow, they store instead the value with the largest magnitude that can fit in the receiving area and has the correct sign.

[citation needed] However, these features are not very useful in practice; it is generally easier and safer to select scaling factors and word sizes so as to exclude the possibility of overflow, or to check the operands for excessive values before executing the operation.

Explicit support for fixed-point numbers is provided by a few programming languages, notably PL/I, COBOL, Ada, JOVIAL, and Coral 66.

More modern languages usually do not offer any fixed-point data types or support for scaling factor conversion.

The wide availability of fast floating-point processors, with strictly standardized behavior, has greatly reduced the demand for binary fixed-point support.

In the few situations that call for fixed-point operations, they can be implemented by the programmer, with explicit scaling conversion, in any programming language.

On the other hand, all relational databases and the SQL notation support fixed-point decimal arithmetic and storage of numbers.

Note that storing this value directly into a 32-bit integer variable would result in overflow and loss of the most significant bits.

For a more complicated example, suppose that the two numbers 1.2 and 5.6 are represented in 32-bit fixed point format with 30 and 20 fraction bits, respectively.