Here's one I had: I was trying to build a Bloom filter in parallel. Each thread ...

ot · on May 10, 2023

This seems to me like a parallel accumulation problem, why not have each thread accumulate a filter on a subset of the data (so no locking involved), and then reduce the results (which is just an OR of all the local accumulations)?

sakras · on May 11, 2023

Parallel reductions are more heavy-weight synchronizations than locks. Say we have 64 partitions, then we need to perform 6 levels of tree reduction, or avoid parallelism completely and perform the reduction on a single thread. Either way it was slower.

The locking strategy very rarely had any reduction in parallelism due to the randomized lock-taking.

There were also other reasons, such as not wanting to replicate the filter per-thread.