Here's one I had: I was trying to build a Bloom filter in parallel. Each thread had large-ish batches of hashes it wanted to insert into the filter. Naively, you'd just have each thread iterate through the batches and do __sync_fetch_and_or for each of the hashes (this was a register-blocked Bloom filter so we only needed to perform one 8-byte or operation per hash).
What ended up being MUCH faster was to partition the filter, and to have a lock per partition. Each thread would attempt to grab a random lock, perform its inserts into that partition, then release the lock and try to grab another random lock that it hasn't grabbed yet. Granted, these locks were just atomic booleans, not std::mutex or anything like that. But I think this illustrates that partitioning+locking can be better for throughput. If you want predictable latency for single inserts, then I'd imagine the __sync_fetch_and_or strategy would work better. Which maybe brings up a broader point that this whole discussion relies a lot on exactly what "faster" means to you.
This seems to me like a parallel accumulation problem, why not have each thread accumulate a filter on a subset of the data (so no locking involved), and then reduce the results (which is just an OR of all the local accumulations)?
Parallel reductions are more heavy-weight synchronizations than locks. Say we have 64 partitions, then we need to perform 6 levels of tree reduction, or avoid parallelism completely and perform the reduction on a single thread. Either way it was slower.
The locking strategy very rarely had any reduction in parallelism due to the randomized lock-taking.
There were also other reasons, such as not wanting to replicate the filter per-thread.
What ended up being MUCH faster was to partition the filter, and to have a lock per partition. Each thread would attempt to grab a random lock, perform its inserts into that partition, then release the lock and try to grab another random lock that it hasn't grabbed yet. Granted, these locks were just atomic booleans, not std::mutex or anything like that. But I think this illustrates that partitioning+locking can be better for throughput. If you want predictable latency for single inserts, then I'd imagine the __sync_fetch_and_or strategy would work better. Which maybe brings up a broader point that this whole discussion relies a lot on exactly what "faster" means to you.