BrookGPU

Other improvements in the v0.5 series include multi-backend usage whereby different threads can run different Brook programs concurrently (thus maximising use of a multi-GPU setup) and SSE and OpenMP support for the CPU backend (this allows near maximal usage of modern CPUs).

For example, a 2.66 GHz Intel Core 2 Duo can perform a maximum of 25 GFLOPs (25 billion single-precision floating-point operations per second) if optimally using SSE and streaming memory access so the prefetcher works perfectly.

However, traditionally (due to shader program length limits) most GPGPU kernels tend to perform relatively small amounts of work on large amounts of data in parallel, so the big problem with directly executing GPGPU algorithms on desktop CPUs is vastly lower memory bandwidth as generally speaking the CPU spends most of its time waiting on RAM.

As a result, if memory bandwidth constrained, Brook's CPU backend won't exceed 2 GFLOPs.

For large datasets, this can greatly diminish the speed increase of using a GPU over a well-tuned CPU implementation.