Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I have an Actor framework which uses a vanilla std::deque for method pointers, and to add messages to the queue, the locking technique is a Benaphore (the original Futex, which uses an atomic and a locking primitive, with the twist that my locking primitive is a combo of spinlock/mutex based on retry count). Nothing special. Benchmarks show that very rarely does the message push function block, and the chance of an OS context switch is also very low, so even though occasionally we get a locked thread swapping out, it is so infrequent that it doesn't justify the lock-free algorithm cost.

In a nutshell, non lock free queues are much faster than lock free queues, however you must be prepared to accept the very infrequent longer delay due to a context switch (where the lock is not available to anyone). My benchmarks show that the price is acceptable since so much more performance is available with non lock free queues. 10 million messages per second per work thread can be queued on modern hardware.



> the original Futex

The "Benaphore" was a pretty old idea, but the Futex isn't just "Oh it's a Benaphore but Linux" the essential trick is that you don't actually need a kernel object - your "locking primitive" at all, and that idea is where this goes from "Yeah, everybody knows that" to OK, our OS should add this feature ASAP.

Instead of an OS synchronisation object which is used to handle conflicts, with the futex design the OS carries a list of address -> thread mappings. If a thread T is asleep on a futex at address X, the address X goes in the list pointing to thread T. When the OS is asked to wake the X futex, it walks the list and wakes T.

The give away is the limits. For something like Benaphores you're constrained, these are a costly OS wide resource, I think BeOS only allowed 65536 per machine or something. But a Futex is just memory, so there's no reason to have any limit at all.


With a post like this, however interesting, you really should link the code. Not only will it prove your claim, but others like me will find the idea intriguing and immediately want to see how it's done.



It really depends on the specifics. If there's a lot of contention then performance will drop off a cliff. Even atomic instructions can become a bottleneck ( https://stackoverflow.com/q/2538070 ).

I think what you're observing, i.e. that in many cases just use a lock and don't worry about it (or your variation) is true. But there are certain applications/situations where you can do better. Having a consumer pull out everything from the queue with one locking operation and being careful with how you signal the consumer from the producer(s), assuming there's a signal, can also make the queue more efficient/have higher throughput (e.g. you shouldn't signal for every item you put in the queue, only when it becomes non-empty).


Aren't lock free data structures more about reducing the impact of contention than throughput under low contention?


Under a lock free algorithm, we promise that across our entire system, eventually some progress is made. I can't necessarily point at any specific thread of execution and say "This thread makes progress" but I can say that forward progress is made for the system as a whole.

Under a wait free algorithm, we promise that every individual thread eventually makes progress.

Suppose I have one really dumb thread which just goes to sleep for 10 seconds in a loop, as well as other threads which do some actual work. If we're a lock free algorithm, it is OK if the 10-second-sleep thread is always chosen to make progress, its "progress" consists of sleeping for 10 seconds, too bad, try again next time.

With a wait free algorithm, the threads which actually do useful work will also make at least some progress, eventually despite the 10 second sleeper.

Generally we can't say for sure that we e.g. "reduce impact of contention" only that we definitely make some forward progress, on at least one thread eventually.


Lock free is kind of an overloaded term that can mean a variety of things depending on what the user is thinking. Usually the goal is being able to make forward progress even if one thread is context switched out by the OS scheduler, which might be called wait-free or non-blocking. Using mutexes (unless you can prevent OS scheduling, interrupts, etc) makes this property impossible to achieve. In general, MPSC queues are super fast and there's no real reason to prefer a locked queue.


> In general, MPSC queues are super fast and there's no real reason to prefer a locked queue.

There's one significant advantage that a locked vector or deque has over MPSC/MPMC queues: the consumers can dequeue all messages in a single operation by locking the vector, swapping it with an empty vector (typically, that's just 3 words), and locking it again. That's such a simple operation that it will typically be as fast or even faster than a single pop-one-message operation in an MPSC/MPMP. Similarly, if the vector is empty, a producer can push any number of messages in a single, constant-time operation.


All true, although I would quibble with:

> That's such a simple operation that it will typically be as fast or even faster than a single pop-one-message operation in an MPSC/MPMP.

Not if the lock is blocked because one of the writers context switched out! The typical case is good, but the worst case is pretty bad.


Depends what you mean by “the impact of contention”. It is possible, if you build a lock-free system, to end up with threads that keep churning and never make forward progress.


My benchmarks show that the price is acceptable

Well, except to people who have a hard requirement that there never be this unpredictable, infrequent longer delay you mention.


It's probably fair to claim that zero* game developers have this hard requirement. At best they have a strong desire which might be partially backed up by benchmarks and squinting hard enough.

*: ignoring deliberately esoteric cases, like Doom on a hard real-time system.


It's a requirement for any game that needs a reliable frame rate. Unpredictable delays during frame rendering cause jank.

(This is not to say the OP has any such issues in its context.)


This is usually called "soft realtime" -- you want it to be fast but it's not wrong if it misses deadlines. It doesn't violate correctness of the system to miss a deadline. In a "hard realtime" system, it is a fatal error if the system misses a deadline because correctness is violated. I presume this difference is what OP is talking about. Games are never hard realtime; missing frame deadlines reduces user experience but it doesn't break the game. Hard realtime is things like industrial motion control, automotive and aviation electronics, pacemakers, etc.


I would say that an exception could be made for VR games where any frame rate 'jank' causes an uncomfortable experience and can lead to increased simulator sickness.


VR games almost all use Unity and Unreal. They drop frames left and right. VR platforms use extremely complex time warp algorithms to hide the jank.

So you're not wrong. VR games are much more susceptible to dropped frames causing problems. But it both happens and is hidden remarkably well.


> Games are never hard realtime

Depends how high your standards are! One of the things that makes playing old games on the original hardware really satisfying is how consistent they are.


Those old games could be so consistent because the software and hardware were both so incredibly simple compared to today.

Today, a PC game has to work on a large range of hardware that has come out over the past 5+ years. And there are GPU features that are only available on certain cards, like hardware ray tracing and things like DLSS and FSR for upscaling.

And the game engines are incredibly more complex today to handle modern expectations, with dynamic lighting and shadows, huge maps, etc.

It doesn’t matter what your standards are. Hard realtime just isn’t realistic or even possible any more, except maybe in a game that would be considered truly primitive by today’s standards.


I am sure I have memories of how hard it was to support a range of hardware when I had to write specifically for each piece. When I had to write separately for a Soundblaster card and an Adlib and a Disney SoundSource and a Roland and a Gravis.

And that was just the sound cards. Writing for different hardware became so much easier when OpenGl and DirectX came into being. Suddenly I just had to write to these APIs.

I think I'm disagreeing with you. Supporting multiple hardware configuration way back when was so much harder than doing it today.


Yeah, and Tandy graphics were different from EGA or VGA, just enough that you needed to write different display code for them. And you had to get the user to tell you which ports/IRQ numbers half of their hardware was configured at.

We have way better hardware abstractions in the OS today, which I would agree makes modern development easier overall.


Oh yes, that brings it all back. Having the user tell me the IRQ and also DMAs and... I'm sure there was more. Sometimes the user would end up having to open up the PC and reseat physical jumpers on hardware cards to have the cards use values that were both offered by the software and not clashing with anything else.

Different world. So much easier now.


Oh gosh, the many different variants of SVGA that existed back in the days...

/me shivers at the recollection


> Games are never hard realtime; missing frame deadlines reduces user experience but it doesn't break the game.

It depends on whether the game is multi-player and, if so, how it keeps the different players in sync with each other.

Games that rely on deterministic gameplay can desync (two players don't see the same world state) and abort if one player's simulation drops a frame while the other doesn't.


If you have multiplayer like this, you absolutely do not have a hard requirement on frame latency.

You don't control network latency spikes. Latency tolerance is a hard requirement, or you will have constant problems and likely be unplayable. Or custom networking and hardware stacks, which is deep into esoteric territory.


That's still just a degradation of user experience and not a fatal fault. Indeed, support for running in desynced mode is written is because they know deadlines can be missed. In hard realtime, deadlines can't be missed. There's no recovery; it's a critical fault and you have to halt or failover to a backup.


> That's still just a degradation of user experience and not a fatal fault.

I don't know how you'd describe a game spontaneously aborting not a "fatal fault". Yes, it's not turning off someone's pacemaker, but within the scope of what a game is able to do, kicking the player back to the matchmaking screen in the middle of a game is about as fatal as it gets.


That's only going to happen if the client has such a massive network latency spike that the server thinks it's disconnected. At least half a second, possibly more. You're never going to get that kind of delay from synchronizing threads, unless the process totally deadlocks.

EDIT: Well, there is one situation where you might get delays like that: if the computer is so woefully inadequate to run the game that it consistently misses frame deadlines by several hundred milliseconds. Of course, in such a situation a different concurrent algorithm wouldn't have solved anything anyway.


Literally any multiplayer game will have a "fatal fault" if its connection to the other players and/or server is interrupted for long enough, but it seems disingenuous to describe a system tolerant of delays variable across several orders of magnitude, up to hundreds of milliseconds or even full seconds, as "hard real time" in the sense in which the term is generally understood.


What I don’t understand here is how missing a frame deadline in this situation (<1/60 of a second or 16ms) is intolerable when typical network latency varies constantly by much more than this. It seems to me that if the latency requirements were this strict you could only support LAN multiplayer.


The only scenario I can imagine is some interactive game installation at Disney/etc. But there'd still be no actual benefit over an architecture more lenient to timing deviations.


> Games that rely on deterministic gameplay can desync (two players don't see the same world state) and abort if one player's simulation drops a frame while the other doesn't.

Right. This is why lockstep deterministic (LSD) games are bound to the SLOWEST player's machine.

No LSD game in existence crashes the game if one player's machine falls behind. Instead you either pause the simulation for all players until they catch up or you slow down time itself.

Source: shipped RTS games that were lockstep deterministic.


If you want to support players with ping over 16ms, there's no way you're gonna be synchronizing input from one frame to the output of the same frame for another player; there'll necessarily be some latency hiding, at which point any freeze should be coverable via just treating it as a temporarily-high ping and thus temporarily more latency hiding.


Lockstep simulation can't be hard tied to frame rate. Two computers will never have precisely the same frame timing -- even if they are running identical video modes, clocks are only accurate to within a tolerance and will diverge. The simulation has to allow for simulation ticks running independently from rendering frames.


Since you can’t control latency of individual packets let alone that the hardware timing is identical this isn’t practical as a requirement.

Deterministic gameplay means after the same x frames with the same external inputs you will end up with the same state y on different hardware. Therefore we only need to make sure the clients stay within a reasonable margin. We can do this by waiting for all players inputs so moving at the slowest speed possible which is the lock-step method. Or we can change our measured frame time so each tick (which still ticks the same amount of sim time) is seen as faster or slower. That way we can track the other frames that other clients have completed from the input they send and slow or speed ourselves up to maintain a decent margin. This is the scheme used in rollback networking as the naïve implementation where this isn’t the case forces all rollbacks onto the faster machine.


If your program state tracking/management is so tightly coupled to renderer that missing a frame can desync, then you have much bigger architectural problems anyway.

Network latency is a thing and your state management subsystem absolutely has to be fault tolerant.


"Unpredictable delays during frame rendering cause jank"

This is usually not serious inflicted and the presentation threads will be higher priority than the simulation in order to minimize visual 'jank'

Lockfree game code ain't fixing jank from the OS.


>Lockfree game code ain't fixing jank from the OS.

Latency is like a chain. Every link matters.

A game engine programmer can't do a thing about the OS, but can still keep latency bounded in the code under their control.


These people are probably not game developers because frame dips are all over the place on console and PC gaming.


Well do we have a benchmark where the price of locking is unacceptable? I think the linked project is a fun theoretical exercise, but I personally don’t accept the assertion that locking is unacceptable without seeing evidence.


I agree, lock free is cool and all, but often times it is more complex and not every case justifies the increased complexity.


The problem with vanilla std::deque is that the only acceptable implementation is in clang libc++. On MSVC it's basically an overcomplicated linked list.


Thank you for sharing this.

Could you tell me if your 10 million figure includes batching or are they a loop that tries to enqueue as many items as possible?


I gather from your comment that lock-free is actually the best choice for videogames. Where consistent frame timing should be paramount.


When you protect an std::deque with a mutex you would need at least two atomic operations: to lock the queue before pushing, and to unlock the queue after pushing. Because you're using an std::deque it may need to allocate memory during a push, which would happen under the lock, which makes it more likely for a thread to suspend with the lock taken. While the queue is locked other threads will have to wait, possibly even suspend on a futex, and then the unlocking thread would have to wake another thread up.

The most expensive part of any mutex/futex is not locking, it's waking other threads up when the lock is contended. I'm actually surprised you only get 10 million messages per second, is that for a contended or an uncontended case? I would expect more, but it probably depends on the hardware a lot, these numbers are hard to compare.

My actor framework currently uses a lockfree intrusive mailbox [1]_, which consists of exactly two atomic exchange operations, so pushing a node is probably cheaper than with a mutex. But the nicest part about it is how I found a way to make it "edge triggered". A currently unowned (empty) queue is locked by the first push (almost for free, compared to a classic intrusive mpsc queue [2]_ the second part of push uses an exchange instead of a store), which may start dequeueing nodes or schedule it to an executor. The mailbox will stay locked until it is drained completely, after which it is guaranteed that a concurrent (or some future) push will lock it. This enables very efficient wakeups (or even eliding them completely when performing symmetric transfer between actors).

I actually get ~10 million requests/s in a single-threaded uncontended case (that's at least one allocation per request and two actor context switches: a push into the target mailbox, and a push into the requester mailbox on the way back, plus a couple of steady_clock::now() calls when measuring latency of each request and checking for soft preemption during context switches). Even when heavily contended (thousands of actors call the same actor from multiple threads) I still get ~3 million requests/s. These numbers may vary depending on hardware though, so like I said it's hard to compare.

In conclusion it very much depends on how lockfree queues are actually used, and how they are implemented, they can be faster and more scalable than a mutex (mutex is a lockfree data structure underneath anyway).

I'd agree with you in that mutexes are better when protecting complex logic or data structures however, because using lockfree interactions to make it "scalable" often makes the base performance so low, that you'd maybe need thousands of cores to justify the resulting overhead.

.. [1] https://github.com/snaury/coroactors/blob/a599cc061d754eefea... .. [2] https://www.1024cores.net/home/lock-free-algorithms/queues/i...


how do you deal with msvc's `std::deque`? (or maybe you're not using literal std::dequeue)


What is special about msvc's `std::deque`?


The memory chunks it allocates to store entries are so small that it ends up allocating one chunk for every element for almost every type. The result is a lot of allocation overhead, pointer indirection and various other things that are bad for performance.


It's block size is something stupid like 16 bytes, so it effectively becomes a slower linked list.


See https://devblogs.microsoft.com/oldnewthing/20230810-00/?p=10... which includes confirmation & a brief commentary of the tiny block size.


Whatever the issue is, they support Clang on linux, i.e. no MSVC standard library.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: