In computing, a compute kernel is a routine compiled for high throughput accelerators (such as graphics processing units (GPUs), digital signal processors (DSPs) or field-programmable gate arrays (FPGAs)), separate from but used by a main program (typically running on a central processing unit).
[1][2] Compute kernels roughly correspond to inner loops when implementing algorithms in traditional languages (except there is no implied sequential operation), or to code passed to internal iterators.
This programming paradigm maps well to vector processors: there is an assumption that each invocation of a kernel within a batch is independent, allowing for data parallel execution.
Individual invocations are given indices (in 1 or more dimensions) from which arbitrary addressing of buffer data may be performed (including scatter gather operations), so long as the non-overlapping assumption is respected.
The intention is to facilitate language evolution and provide a more natural ability to leverage GPU compute capabilities, in line with hardware developments such as Unified Memory Architecture and Heterogeneous System Architecture.