Fermi is the oldest microarchitecture from Nvidia that receives support for Microsoft's rendering API Direct3D 12 feature_level 11.
Fermi Graphic Processing Units (GPUs) feature 3.0 billion transistors and a schematic is sketched in Fig.
Each SM features 32 single-precision CUDA cores, 16 load/store units, four Special Function Units (SFUs), a 64 KB block of high speed on-chip memory (see L1+Shared Memory subsection) and an interface to the L2 cache (see L2 Cache subsection).
It is also optimized to efficiently support 64-bit in workstation and server models, but artificially crippled for consumer versions.
Implements the new IEEE 754-2008 floating-point standard, providing the fused multiply-add (FMA) instruction for both single and double precision arithmetic.
Each SM can issue instructions consuming any two of the four green execution columns shown in the schematic Fig.
[citation needed] The theoretical single-precision processing power of a Fermi GPU in GFLOPS is computed as 2 (operations per FMA instruction per CUDA core per cycle) × number of CUDA cores × shader clock speed (in GHz).
The theoretical double-precision processing power of a Fermi GPU is 1/2 of the single precision performance on GF100/110.
768 KB unified L2 cache, shared among the 16 SMs, that services all load and store from/to global memory, including copies to/from CPU host, and also texture requests.
The L2 cache subsystem also implements atomic operations, used for managing access to data that must be shared across thread blocks or even kernels.
Global memory (VRAM) is accessible by all threads directly as well as the host system over the PCIe bus.