Systolic array

In parallel computer architectures, a systolic array is a homogeneous network of tightly coupled data processing units (DPUs) called cells or nodes.

Systolic arrays were first used in Colossus, which was an early computer used to break German Lorenz ciphers during World War II.

[1] Due to the classified nature of Colossus, they were independently invented or rediscovered by H. T. Kung and Charles Leiserson who described arrays for many dense linear algebra computations (matrix product, solving systems of linear equations, LU decomposition, etc.)

Early applications include computing greatest common divisors of integers and polynomials.

[2] They are sometimes classified as multiple-instruction single-data (MISD) architectures under Flynn's taxonomy, but this classification is questionable because a strong argument can be made to distinguish systolic arrays from any of Flynn's four categories: SISD, SIMD, MISD, MIMD, as discussed later in this article.

Systolic arrays are often hard-wired for specific operations, such as "multiply and accumulate", to perform massively parallel integration, convolution, correlation, matrix multiplication or data sorting tasks.

A systolic array typically consists of a large monolithic network of primitive computing nodes which can be hardwired or software configured for a specific application.

The more general wavefront processors, by contrast, employ sophisticated and individually programmable nodes which may or may not be monolithic, depending on the array size and design parameters.

The other distinction is that systolic arrays rely on synchronous data transfers, while wavefront tend to work asynchronously.

Unlike the more common Von Neumann architecture, where program execution follows a script of instructions stored in common memory, addressed and sequenced under the control of the CPU's program counter (PC), the individual nodes within a systolic array are triggered by the arrival of new data and always process the data in exactly the same way.

There is no need to access external buses, main memory or internal caches during each operation as is the case with Von Neumann or Harvard sequential machines.

The sequential limits on parallel performance dictated by Amdahl's Law also do not apply in the same way, because data dependencies are implicitly handled by the programmable node interconnect and there are no sequential steps in managing the highly parallel data flow.

Systolic arrays are therefore extremely good at artificial intelligence, image processing, pattern recognition, computer vision and other tasks that animal brains do particularly well.

Wavefront processors in general can also be very good at machine learning by implementing self configuring neural nets in hardware.

Because the input is typically a vector of independent values, the systolic array is definitely not SISD.

Since these input values are merged and combined into the result(s) and do not maintain their independence as they would in a SIMD vector processing unit, the array cannot be classified as such.

In spite of all of the above, systolic arrays are often offered as a classic example of MISD architecture in textbooks on parallel computing and in engineering classes.

If the array is viewed from the outside as atomic it should perhaps be classified as SFMuDMeR = single function, multiple data, merged result(s).

Systolic arrays use a pre-defined computational flow graph that connects their nodes.

A systolic array is composed of matrix-like rows of data processing units called cells.

Data processing units (DPUs) are similar to central processing units (CPUs), (except for the usual lack of a program counter,[3] since operation is transport-triggered, i.e., by the arrival of a data object).

The data streams entering and leaving the ports of the array are generated by auto-sequencing memory units, ASMs.

In embedded systems a data stream may also be input from and/or output to an external source.

The consequence is, that only applications with regular data dependencies can be implemented on classical systolic arrays.

One well-known systolic array is Carnegie Mellon University's iWarp processor, which has been manufactured by Intel.

An iWarp system has a linear array processor connected by data buses going in both directions.

Horner's rule for evaluating a polynomial is: A linear systolic array in which the processors are arranged in pairs: one multiplies its input by

) systolically, meaning data flows through the array in a regular, rhythmic manner.

The weights remain stationary within each PE, while the input data and partial sums (

[5] See [5] Figure 12 for an algorithm that performs on-the-fly least-squares using one- and two-dimensional systolic arrays.