Neural processing unit

Graphics processing units designed by companies such as Nvidia and AMD often include AI-specific hardware, and are commonly used as AI accelerators, both for training and inference.

As deep learning and artificial intelligence workloads rose in prominence in the 2010s, specialized hardware units were developed or adapted from existing products to accelerate these tasks.

Later, the successors (DaDianNao,[22] ShiDianNao,[23] PuDianNao[24]) were proposed by the same group, forming the DianNao Family[25] Smartphones began incorporating AI accelerators starting with the Qualcomm Snapdragon 820 in 2015.

[32][33][34] In the 2000s, CPUs also gained increasingly wide SIMD units, driven by video and gaming workloads; as well as support for packed low-precision data types.

During the 2010s, GPU manufacturers such as Nvidia added deep learning related features in both hardware (e.g., INT8 operators) and software (e.g., cuDNN Library).

For example, Summit, a supercomputer from IBM for Oak Ridge National Laboratory,[43] contains 27,648 Nvidia Tesla V100 cards, which can be used to accelerate deep learning algorithms.

Reconfigurable devices such as field-programmable gate arrays (FPGA) make it easier to evolve hardware, frameworks, and software alongside each other.

While GPUs and FPGAs perform far better than CPUs for AI-related tasks, a factor of up to 10 in efficiency[47][48] may be gained with a more specific design, via an application-specific integrated circuit (ASIC).

[52][53] Cerebras Systems has built a dedicated AI accelerator based on the largest processor in the industry, the second-generation Wafer Scale Engine (WSE-2), to support deep learning workloads.

[59] In 2019, researchers from Politecnico di Milano found a way to solve systems of linear equations in a few tens of nanoseconds via a single operation.

[60] In 2020, Marega et al. published experiments with a large-area active channel material for developing logic-in-memory devices and circuits based on floating-gate field-effect transistors (FGFETs).

[61] Such atomically thin semiconductors are considered promising for energy-efficient machine learning applications, where the same basic device structure is used for both logic operations and data storage.

The authors used two-dimensional materials such as semiconducting molybdenum disulphide to precisely tune FGFETs as building blocks in which logic operations can be performed with the memory elements.

There is no consensus on the boundary between these devices, nor the exact form they will take; however several examples clearly aim to fill this new space, with a fair amount of overlap in capabilities.

[65] Inspired from the pioneer work of DianNao Family, many DLPs are proposed in both academia and industry with design optimized to leverage the features of deep neural networks for high efficiency.

Such efforts include Eyeriss (MIT),[66] EIE (Stanford),[67] Minerva (Harvard),[68] Stripes (University of Toronto) in academia,[69] TPU (Google),[70] and MLU (Cambricon) in industry.

[73][77][78] Such architectures significantly shorten data paths and leverage much higher internal bandwidth, hence resulting in attractive performance improvement.