Just to clarify: the spinning on CMPXCHG is not also a bad idea, the YieldProcessor is correct (a pause), but inside the CMPXCHG loop it should be spinning on a pure unlocked load. Is that correct?
No you should spin on a read. Once you see the value you want you then try the CMPXCHG. If that succeeds you exit. If it fails you go back to spinning on the read.
Read and load mean the same thing. (I think GP just missed the end of your comment.)
You care about exchange vs read/load because of cache line ownership. Every time you try to do the exchange, the attempting CPU must take exclusive ownership of the cacheline (stealing it from the lock owner). To unlock, the lock owner must take it back.
If the attempting CPU instead only reads, the line ownership stays with the lock holder and unlock is cheaper. In general you want cache line ownership to change hands as few times as possible.
For what it is worth it seems the library in question does both, uses an exponential retry loop, busy read looping 2^i times for the first 7 attempts before then yielding. It seems like there must be some threshold where latency is improved by retrying before yielding, but I don’t test these things for a living.
I've tried to figure out where that goes wrong. Read the bug report linked. I've caught this situation in gdb with 20 threads in spinlocks inside realloc, performance in a game down to 1 frame every 2 seconds, and very little getting done.
But I don't understand how the three levels of locking involved cause that. Nor do the Wine people.
Works fine on Microsoft Windows. Only fails on Wine.
It seems like the loop around InterlockedCompareExchange is a bad idea since this is a bus lock in a tight loop. Rather the inner spinning loop that is yielding should just be reading the value surrounded by the cmpxchg. As for whether sched_yield should just be called in the inner loop or a short nop/pause loop should be attempted for microcontention reasons, the expert opinion here is don't bother with the nop loop. However while the nop loop might not be real world optimal I doubt that would be causing a catastrophic performance issue.
My data says that always yielding is better than ever busy spinning, except on microbenchmarks, where depending on the microbenchmarks you can get any answer your heart desires.