The load in this loop executes with less overhead compared to an atomic operation (resp.
E.g., on a system utilizing the MESI cache coherency protocol, the cache line being loaded is moved to the Shared state, whereas a test-and-set instruction or a load-exclusive instruction moves it into the Exclusive state.
This is particularly advantageous if multiple processors are contending for the same lock: whereas an atomic instruction or load-exclusive instruction requires a coherency-protocol transaction to give that processor exclusive access to the cache line (causing that line to ping-pong between the involved processors), ordinary loads on a line in Shared state require no protocol transactions at all: processors spinning in the inner loop operate purely locally.
Cache-coherency protocol transactions are used only in the outer loop, after the initial check has ascertained that they have a reasonable likelihood of success.
By using operating-system provided lock objects, such as mutexes, the OS can schedule exactly the unblocked threads.