Stream processing

The software stack for these systems includes components such as programming models and query languages, for expressing computation; stream management systems, for distribution and scheduling; and hardware components for acceleration including floating-point units, graphics processing units, and field-programmable gate arrays.

Since the kernel and stream abstractions expose data dependencies, compiler tools can fully automate and optimize on-chip management tasks.

Stream processing hardware can use scoreboarding, for example, to initiate a direct memory access (DMA) when dependencies become known.

Various efforts have been spent on finding alternative ways to perform massive amounts of computations but the only solution was to exploit some level of parallel execution.

The result of those efforts was SIMD, a programming paradigm which allowed applying one instruction to multiple instances of (different) data.

Although this is what happens with instruction intrinsics, much information is actually not taken into account here such as the number of vector components and their data format.

The speed up is somewhat limited by the assumption we made of performing four parallel operations (please note this is common for both AltiVec and SSE).

[5] By contrast, ad-hoc stream processors easily reach over 10x performance, mainly attributed to the more efficient memory access and higher levels of parallel processing.

[9] AT&T also researched stream-enhanced processors as graphics processing units rapidly evolved in both speed and functionality.

The most immediate challenge in the realm of parallel processing does not lie as much in the type of hardware architecture used, but in how easy it will be to program the system in question in a real-world environment with acceptable performance.

Machines like Imagine use a straightforward single-threaded model with automated dependencies, memory allocation and DMA scheduling.

This in itself is a result of the research at MIT and Stanford in finding an optimal layering of tasks between programmer, tools and hardware.

Programmers often create representations of enitities in memory, for example, the location of an particle in 3D space, the colour of the ball and its size as below: When multiple of these structures exist in memory they are placed end to end creating an arrays in an array of structures (AoS) topology.

Additionally, a SIMD instruction will typically expect the data it will operate on to be contiguous in memory, the elements may also need to be aligned.

This abstraction provides a means to specify data dependencies implicitly while enabling the runtime/hardware to take full advantage of that knowledge for efficient computation.

Since fetching information and opcodes to those few ALUs is expensive, very little die area is dedicated to actual mathematical machinery (as a rough estimation, consider it to be less than 10%).

A similar architecture exists on stream processors but thanks to the new programming model, the amount of transistors dedicated to management is actually very little.

The stream processor is usually equipped with a fast, efficient, proprietary memory bus (crossbar switches are now common, multi-buses have been employed in the past).

By contrast, standard processors from Intel Pentium to some Athlon 64 have only a single 64-bit wide data bus.

Because of the SIMD nature of the stream processor's execution units (ALUs clusters), read/write operations are expected to happen in bulk, so memories are optimized for high bandwidth rather than low latency (this is a difference from Rambus and DDR SDRAM, for example).

This is conceptually a large cache in which stream data is stored to be transferred to external memory in bulks.

The key concept and innovation here done with Stanford's Imagine chip is that the compiler is able to automate and allocate memory in an optimal way, fully transparent to the programmer.

The dependencies between kernel functions and data is known through the programming model which enables the compiler to perform flow analysis and optimally pack the SRFs.

Commonly, this cache and DMA management can take up the majority of a project's schedule, something the stream processor (or at least Imagine) totally automates.

Tests done at Stanford showed that the compiler did an as well or better job at scheduling memory than if you hand tuned the thing with much effort.

Although an order of magnitude speedup can be reasonably expected (even from mainstream GPUs when computing in a streaming manner), not all applications benefit from this.

Non-commercial examples of stream programming languages include: Commercial implementations are either general purpose or tied to specific hardware by a vendor.