XOP instruction set

It was changed to be similar but not overlapping with AVX, parts that overlapped with AVX were removed or moved to separate standards such as FMA4 (floating-point vector multiply–accumulate) and CVT16 (Half-precision floating-point conversion implemented as F16C by Intel).

[1] The XOP instructions have an opcode byte 8F (hexadecimal), but otherwise almost identical coding scheme as AVX with the 3-byte VEX prefix.

Commentators[4] have seen this as evidence that Intel has not allowed AMD to use any part of the large VEX coding space.

The use of the 8F byte requires that the m-bits (see VEX coding scheme) have a value larger than or equal to 8 in order to avoid overlap with existing instructions.

After AMD adopted FMA4, Intel canceled FMA4 support and reverted to FMA3 in the AVX/FMA specification version 5 (See FMA history).

[1][5][6] In March 2015, AMD explicitly revealed in the description of the patch for the GNU Binutils package that Zen, its third-generation x86-64 architecture in its first iteration (znver1 – Zen, version 1), will not support TBM, FMA4, XOP and LWP instructions developed specifically for the "Bulldozer" family of micro-architectures.

[2] r0 = a0 * b0 + c0, r1 = a1 * b1 + c1, .. r0 = a0 * b0 + c0, r1 = a2 * b2 + c1 r0 = a1 * b1 + c0, r1 = a3 * b3 + c1 r0 = a0 * b0 + a1 * b1 + c0, r1 = a2 * b2 + a3 * b3 + c1, .. Horizontal addition instructions adds adjacent values in the input vector to each other.

The output size in the instructions below describes how wide the horizontal addition performed is.

r0 = a0+a1, r1 = a2+a3, r2 = a4+a5, ... r0 = a0+a1+a2+a3, r1 = a4+a5+a6+a7, ... r0 = a0+a1+a2+a3+a4+a5+a6+a7, ... r0 = a0+a1, r1 = a2+a3, r2 = a4+a5, ... r0 = a0+a1+a2+a3, r1 = a4+a5+a6+a7 r0 = a0+a1, r1 = a2+a3 r0 = a0-a1, r1 = a2-a3, r2 = a4-a5, ... r0 = a0-a1, r1 = a2-a3, r2 = a4-a5, ... r0 = a0-a1, r1 = a2-a3 This set of vector compare instructions all take an immediate as an extra argument.