AVX-512

[2] This policy is a departure from the historical requirement of implementing the entire instruction block.

AVX-512 is not the first 512-bit SIMD instruction set that Intel has introduced in processors: the earlier 512-bit SIMD instructions used in the first generation Xeon Phi coprocessors, derived from Intel's Larrabee project, are similar but not binary compatible and only partially source compatible.

F, CD, ER, PF: introduced with Xeon Phi x200 (Knights Landing) and Xeon Gold/Platinum (Skylake SP "Purley"), with the last two (ER and PF) being specific to Knights Landing.

[8] The VEX prefix used by AVX and AVX2, while flexible, did not leave enough room for the features Intel wanted to add to AVX-512.

For the 32-bit single float or double words, 16 bits are used to mask the 16 elements in a 512-bit register.

The opmask register is the reason why several bitwise instructions which naturally have no element widths had them added in AVX-512.

For instance, bitwise AND, OR or 128-bit shuffle now exist in both double-word and quad-word variants with the only difference being in the final masking.

Together with the general compare into mask instructions below, these may be used to implement generic ternary operations or cmov, similar to XOP's VPCMOV.

Unlike their XOP inspiration, however, they save the result to a mask register and initially only support doubleword and quadword comparisons.

These instructions perform either AND or NAND, and then set the destination opmask based on the result values being zero or non-zero.

Expand operates in the opposite way, by loading as many values as indicated in the mask and then spreading them to the selected positions.

[7] The instructions in AVX-512 conflict detection (AVX-512CD) are designed to help efficiently calculate conflict-free subsets of elements in loops that normally could not be safely vectorized.

A later AVX-VNNI extension adds VEX encodings of these instructions which can only operate on 128- or 256-bit vectors.

Galois field new instructions are useful for cryptography,[12] as they can be used to implement Rijndael-style S-boxes such as those used in AES, Camellia, and SM4.

[12] GFNI is a standalone instruction set extension and can be enabled separately from AVX or AVX-512.

(Availability of the VEX-encoded 128-bit version is indicated by different CPUID bits: PCLMULQDQ and AVX.)

The wider than 128-bit variations of the instruction perform the same operation on each 128-bit portion of input registers, but they do not extend it to select quadwords from different 128-bit fields (the meaning of imm8 operand is the same: either low or high quadword of the 128-bit field is selected).

The wider than 128-bit variations of the instruction perform the same operation on each 128-bit portion of input registers.

[14] ^Note 1 : Intel does not officially support AVX-512 family of instructions on the Alder Lake microprocessors.

In early 2022, Intel began disabling in silicon (fusing off) AVX-512 in Alder Lake microprocessors to prevent customers from enabling AVX-512.

[36][37][24] Intel Vectorization Advisor (starting from version 2017) supports native AVX-512 performance and vector code quality analysis (for "Core", Xeon and Intel Xeon Phi processors).

Along with traditional hotspots profile, Advisor Recommendations and "seamless" integration of Intel Compiler vectorization diagnostics, Advisor Survey analysis also provides AVX-512 ISA metrics and new AVX-512-specific "traits", e.g. Scatter, Compress/Expand, mask utilization.

As a result, gcc and clang default to prefer using the 256-bit vectors for Intel targets.

[40][41][42] C/C++ compilers also automatically handle loop unrolling and preventing stalls in the pipeline in order to use AVX-512 most effectively, which means a programmer using language intrinsics to try to force use of AVX-512 can sometimes result in worse performance relative to the code generated by the compiler when it encounters loops plainly written in the source code.

[43] In other cases, using AVX-512 intrinsics in C/C++ code can result in a performance improvement relative to plainly written C/C++.

[44] There are many examples of AVX-512 applications, including media processing, cryptography, video games,[45] neural networks,[46] and even OpenJDK, which employs AVX-512 for sorting.

[47] In a much-cited quote from 2020, Linus Torvalds said "I hope AVX-512 dies a painful death, and that Intel starts fixing real problems instead of trying to create magic instructions to then create benchmarks that they can look good on,"[48] stating that he would prefer the transistor budget be spent on additional cores and integer performance instead, and that he "detests" floating point benchmarks.

[49] Numenta touts their "highly sparse"[50] neural network technology, which they say obviates the need for GPUs as their algorithms run on CPUs with AVX-512.

[51] They claim a ten times speedup relative to A100 largely because their algorithms reduce the size of the neural network, while maintaining accuracy, by techniques such as the Sparse Evolutionary Training (SET) algorithm[52] and Foresight Pruning.