Pikus has a number of C++ Parallelism performance talks that discusses this issue.
In particular, the "cost of synchronization" is roughly the price of a L3 memory read/write, or ~50 cycles. In contrast, a read/write to L1 cache is like 4 cycles latency and can be done multiple times per clock tick, and you can go faster (IE: register space).
So an unsynchronized "counter++" will be done ~4-billion times a second.
But a "counter++; sync()" will slow you down 50x slower. That is to say, the "sync()" is the "expensive part" to think about.
------------
A lock is two sync() statements. One sync() when you lock, and a 2nd sync() when you unlock. Really, half-sync() since one is an acquire half-sync and the other is a release half-sync.
If your lock-free data structure uses more than two sync() statements, you're _probably_ slower than a dumb lock-based method (!!!). This is because the vast majority of lock()/unlocks() are uncontested in practice.
So the TL;DR is, use locks because they're so much easier to think about. When you need more speed, measure first, because it turns out that locks are really damn fast in practice and kind of hard to beat.
That being said, there's other reasons to go lock-free. On some occasions, you will be forced to use lock-free code because blocking could not be tolerated. (Ex: lock-free interrupts. Its fine if the interrupt takes a bit longer than expected during contention). So in this case, even if lock-free is slower, the guarantee for forward progress is what you're going for, rather than performance.
So the study of lock-free data-structures is still really useful. But don't always think of for performance reasons, because these data-structures very well will be slower than std-library + lock() in many cases.
A sync, assuming it is your typical memory barrier, is not bound by the L3 latency. You pay (in first approximation) the L3 cost when you touch a contended cache line, whether you are doing a plain write or a full atomic CAS.
Separately fences and atomic RMWs are slower than plain read/writes, but that's because of the (partially) serialising effects they have on a CPU pipleline, and very little todo with L3 (or any memory) latency.
Case in point: A CAS on intel is 20ish cycles, the L3 latency is 30-40 cycles or more. On the other hand you can have multiple L3 misses outstanding, but CAS hardly pipelines.
So here's what I know. L1 / L2 caches are "inside the core". To talk to other cores, your data must leave L1/L2 cache and talk to a network. It is on _THIS_ network that the L3 cache exists.
Its not really "L3 cache", its just the memory-network that implements the MOESI protocol (or whatever proprietary variant: MOESIF or whatever) that sits between L2 cache, L3 cache, and core-to-core communications. So I don't want to make it sound like "You're waiting on L3 cache", because you're right. Its not really related to L3 cache.
So I don't really know how that network works (I know the high-level details of MOESI, but I also know that's the textbook explanation and the real world is way more complex), aside from "It'd be roughly on the same order-of-magnitude cost" as the L3 cache.
-------
What's really going on, is that Core#1 is talking to all the other cores and basically performing "git pull" and "git push" to the data asynchronously. Core#1 prefers to perform "git checkin" and "git checkout" (aka: talk to L1/L2 cache), because that's faster.
But when Core#1 does a full sync (git push/git pull), it is required to talk on the network that leads to other cores and/or L3 cache.
Those "git push / git pull" messages are MOESI (F and others), meaning Modified/Owned/Exclusive/Shared/Invalid (and Forwarding, and other proprietary states). Which line up to version-control way better than most people expect.
Indeed L3 being shared and also often working as the MOESI directory works out to the interthread latency being the same order of magnitude as the L3 latency.
My point is that sync has nothing to do with caches. Caches are coherent all the time and do not need barriers. In particular I don't think the git pull/push maps well to MOESI as it is an optimistic protocol and only require transfering opportunistically on demand what is actually needed by a remote core, not conservatively everything that has changed like in git.
The explicit sync model is more representative of non coherent caches, which are not really common as they are hard to use.
Memory barriers are for typical CPUs, purely an i internal matter of the core and synchronize internal queues with L1. In a simple x86 model, where the only visible source of reordering is StoreLoad, a memory barrier simply stalls the pipeline until all preceding stores in program order are flushed out of the write buffer into L1.
In practice things these days things are more complex and a fence doesn't fully stall the pipeline, potentially only needs to visibly prevents loads from completing.
Other more relaxed CPUs also need to synchronise load queues, but still for the most part fences are a core local matter.
Some architectures have indeed remote fences (even x86 is apparently getting some in the near future) but these are more exotic and, AFAIK, do not usually map to default c++11 atomics.
> The explicit sync model is more representative of non coherent caches, which are not really common as they are hard to use.
Not that I'm a professional GPU programmer. But I'm pretty certain that GPU caches are non-coherent.
But yeah, cache-coherence is just assumed on modern CPUs. Your clarification on store-queues and load-queues is helpful (even if the caches are coherent, the store-queue and load-queue can still introduce an invalid reordering. So it sounds like your point is that the various sync() instructions are more about these queues?)
> Not that I'm a professional GPU programmer. But I'm pretty certain that GPU caches are non-coherent.
Yes, I was specifically referring to general purpose CPUs; I'm quite unfamiliar with GPUs, but I don't think anybody has ever accused them of being easy to program. Also I understand that GPUs (and CPU-GPU links) is an area where remote atomics already exist.
> So it sounds like your point is that the various sync() instructions are more about these queues?
for the most part yes, specifically fences enforce ordering on any operation that can execute out of order (even on in-order CPUs memory ops can be reordered), but only up to the coherence layer (i.e. L1). Ordering from the coherence layer on is enforced by the coherence protocol. You could of course have a model where fences are needed for for global coherence, but it would be too slow (having to flush the whole cache), too hard to use (as you would need to specify which lines need to be sync'd) or both.
You could see something like the store buffer as a non-coherent cache (as reads can be fulfilled form it), with fences restoring the coherence, but I don't think it is a terribly useful model.
In particular, the "cost of synchronization" is roughly the price of a L3 memory read/write, or ~50 cycles. In contrast, a read/write to L1 cache is like 4 cycles latency and can be done multiple times per clock tick, and you can go faster (IE: register space).
So an unsynchronized "counter++" will be done ~4-billion times a second.
But a "counter++; sync()" will slow you down 50x slower. That is to say, the "sync()" is the "expensive part" to think about.
------------
A lock is two sync() statements. One sync() when you lock, and a 2nd sync() when you unlock. Really, half-sync() since one is an acquire half-sync and the other is a release half-sync.
If your lock-free data structure uses more than two sync() statements, you're _probably_ slower than a dumb lock-based method (!!!). This is because the vast majority of lock()/unlocks() are uncontested in practice.
So the TL;DR is, use locks because they're so much easier to think about. When you need more speed, measure first, because it turns out that locks are really damn fast in practice and kind of hard to beat.
That being said, there's other reasons to go lock-free. On some occasions, you will be forced to use lock-free code because blocking could not be tolerated. (Ex: lock-free interrupts. Its fine if the interrupt takes a bit longer than expected during contention). So in this case, even if lock-free is slower, the guarantee for forward progress is what you're going for, rather than performance.
So the study of lock-free data-structures is still really useful. But don't always think of for performance reasons, because these data-structures very well will be slower than std-library + lock() in many cases.