The roofline model is an intuitive visual performance model used to provide performance estimates of a given compute kernel or application running on multi-core, many-core, or accelerator processor architectures, by showing inherent hardware limitations, and potential benefit and priority of optimizations.
The most basic roofline model can be visualized by plotting floating-point performance as a function of machine peak performance[vague][clarification needed], machine peak bandwidth, and arithmetic intensity.
is a property of the given kernel or application and thus depend just partially on the platform characteristics.
denotes the number of bytes of memory transfers incurred during the execution of the kernel or application.
is heavily dependent on the properties of the chosen platform, such as for instance the structure of the cache hierarchy.
will be the ratio of floating point operations to total data movement (FLOPs/byte).
The naïve roofline[3] is obtained by applying simple bound and bottleneck analysis.
[1][3] The resulting plot, in general with both axes in logarithmic scale, is then derived by the following formula:[1]
, that is where the diagonal and horizontal roof meet, is defined as ridge point.
is then computed by drawing a vertical line that hits the roofline curve.
[1] The naive roofline provides just an upper bound (the theoretical maximum) to performance.
Although it can still give useful insights on the attainable performance, it does not provide a complete picture of what is actually limiting it.
Their existence is due to the lack of some kind of memory related architectural optimization, such as cache coherence, or software optimization, such as poor exposure of concurrency (that in turn limit bandwidth usage).
[3][4] The in-core ceilings are roofline-like curve beneath the actual roofline that may be present due to the lack of some form of parallelism.
Performance cannot exceed an in-core ceiling until the underlying lack of parallelism is expressed and exploited.
[3][4] If the ideal assumption that arithmetic intensity is solely a function of the kernel is removed, and the cache topology - and therefore cache misses - is taken into account, the arithmetic intensity clearly becomes dependent on a combination of kernel and architecture.
Unlike "proper" ceilings, the resulting lines on the roofline plot are vertical barriers through which arithmetic intensity cannot pass without optimization.
[3][4] Since its introduction,[3][4] the model has been further extended to account for a broader set of metrics and hardware-related bottlenecks.
Already available in literature there are extensions that take into account the impact of NUMA organization of memory,[6] of out-of-order execution,[9] of memory latencies,[9][10] and to model at a finer grain the cache hierarchy[5][9] in order to better understand what is actually limiting performance and drive the optimization process.
Also, the model has been extended to better suit specific architectures and the related characteristics, such as FPGAs.