OpenCL

Programs in the OpenCL language are intended to be compiled at run-time, so that OpenCL-using applications are portable between implementations for various host devices.

In order to open the OpenCL programming model to other languages or to protect the kernel source from inspection, the Standard Portable Intermediate Representation (SPIR)[17] can be used as a target-independent way to ship kernels between a front-end compiler and the OpenCL back-end.

Consistency between the various levels in the hierarchy is relaxed, and only enforced by explicit synchronization constructs, notably barriers.

OpenCL adopts C/C++-based languages to specify the kernel computations performed on the device with some restrictions and additions to facilitate efficient mapping to the heterogeneous hardware resources of accelerators.

Function pointers, bit fields and variable-length arrays are omitted, and recursion is forbidden.

OpenCL C is extended to facilitate use of parallelism with vector types and operations, synchronization, and functions to work with work-items and work-groups.

[20]: § 6.1.2  Vectorized operations on these types are intended to map onto SIMD instructions sets, e.g., SSE or VMX, when running OpenCL programs on CPUs.

[20]: 10–11 The following is a matrix–vector multiplication algorithm in OpenCL C. The kernel function matvec computes, in each invocation, the dot product of a single row of a matrix A and a vector x:

To extend this into a full matrix–vector multiplication, the OpenCL runtime maps the kernel over the rows of the matrix.

This language allows to leverage a rich variety of language features from standard C++ while preserving backward compatibility to OpenCL C. This opens up a smooth transition path to C++ functionality for the OpenCL kernel code developers as they can continue using familiar programming flow and even tools as well as leverage existing extensions and libraries available for OpenCL C. The language semantics is described in the documentation published in the releases of OpenCL-Docs[27] repository hosted by the Khronos Group but it is currently not ratified by the Khronos Group.

[31] A work in progress draft of the latest C++ for OpenCL documentation can be found on the Khronos website.

Most of C++ features are not available for the kernel functions e.g. overloading or templating, arbitrary class layout in parameter type.

Due to the rich variety of C++ language features, applications written in C++ for OpenCL can express complex functionality more conveniently than applications written in OpenCL C and in particular generic programming paradigm from C++ is very attractive to the library developers.

[32] New contributions to the language semantic definition or open source tooling support are accepted from anyone interested as soon as they are aligned with the main design philosophy and they are reviewed and approved by the experienced contributors.

On June 16, 2008, the Khronos Compute Working Group was formed[41] with representatives from CPU, GPU, embedded-processor, and software companies.

This group worked for five months to finish the technical details of the specification for OpenCL 1.0 by November 18, 2008.

[42] This technical specification was reviewed by the Khronos members and approved for public release on December 8, 2008.

According to an Apple press release:[44] Snow Leopard further extends support for modern hardware with Open Computing Language (OpenCL), which lets any application tap into the vast gigaflops of GPU computing power previously available only to graphics applications.

AMD decided to support OpenCL instead of the now deprecated Close to Metal in its Stream framework.

[45][46] RapidMind announced their adoption of OpenCL underneath their development platform to support GPUs from multiple vendors with one interface.

[47] On December 9, 2008, Nvidia announced its intention to add full support for the OpenCL 1.0 specification to its GPU Computing Toolkit.

Most notable features include: On November 18, 2013, the Khronos Group announced the ratification and public release of the finalized OpenCL 2.0 specification.

As of 2016, OpenCL runs on graphics processing units (GPUs), CPUs with SIMD instructions, FPGAs, Movidius Myriad 2, Adapteva Epiphany and DSPs.

Existing implementations have been shown to be competitive when kernel code is properly tuned, though, and auto-tuning has been suggested as a solution to the performance portability problem,[195] yielding "acceptable levels of performance" in experimental linear algebra kernels.

[198] The fact that OpenCL allows workloads to be shared by CPU and GPU, executing the same programs, means that programmers can exploit both by dividing work among the devices.

[199] This leads to the problem of deciding how to partition the work, because the relative speeds of operations differ among the devices.

Machine learning has been suggested to solve this problem: Grewe and O'Boyle describe a system of support-vector machines trained on compile-time features of program that can decide the device partitioning problem statically, without actually running the programs to measure their performance.

[200] In a comparison of actual graphic cards of AMD RDNA 2 and Nvidia RTX Series there is an undecided result by OpenCL-Tests.

Each invocation ( work-item ) of the kernel takes a row of the green matrix ( A in the code), multiplies this row with the red vector ( x ) and places the result in an entry of the blue vector ( y ). The number of columns n is passed to the kernel as ncols ; the number of rows is implicit in the number of work-items produced by the host program.
The International Workshop on OpenCL (IWOCL) held by the Khronos Group
clinfo , a command-line tool to see OpenCL information