I wonder how it handles the stricter memory ordering from x86? For example, this...

monocasa · on June 23, 2020

That's the big piece I've been wondering about too.

Three options as I see it (none of them great):

1) Pin all threads in an x86 process to a single core. You don't have memory model concerns on a single core.

2) Don't do anything? Just rely on apps to use the system provided mutex libraries, and they just break if they try to roll their own concurrency? Seems like exactly the applications you care about (games, pro apps), would be the ones most likely to break.

3) Some stricter memory model in hardware? Seems like that'd go against most of the stated reason for switching to ARM in the first place.

BeeOnRope · on June 23, 2020

There's another option:

4) Translate x86 loads and stores to acquire load and release stores, to align with the x86 semantics. These already exist in the ARM ISA, so it's not much of a stretch at all.

This is the one I'm betting on.

monocasa · on June 23, 2020

Expect you'd need to do that with every single store since the x86 instruction stream doesn't have those semantics embedded in it.

That'd kill your memory perf by at least an order of magnitude, and kill perf for other cores as well. It'd be cheaper to just say "you only get one core in x86 mode". Essentially you'd be marking every store as an L1 and store buffer flush and only operating out of L2.

phire · on June 23, 2020

ARMv8.1 adds a bunch of improved atomic instructions that basically implement the same functionality as x86. Because x86 has atomic read-modify-write instructions by default; You need emulate those too.

The ARMv8.1 extensions look like they have been explicitly designed to allow emulation of x86.

The implication is that implementations of ARMv8.1 can (and perhaps should) implement cache/coherency subsystems with high preformance atomic operations.

And I'm willing to bet Apple has made sure their implementation is good at atomics.

monocasa · on June 23, 2020

So the ARMV8.1 extensions aren't for emulating x86, they're a reflection of how concurrency hardware has changed over the years.

It used to be that (and you can see this in the RISC ISAs from the 80s/90s, but AIUI this is what happend in x86 microcode as well)

* the CPU would read a line and lock it for modifications in cache.

* the CPU core would do the modification

* the CPU would write the new value down to L2 with an unlock, and L2 would now allow this cache line to be accessed

So you're looking at ~15 cycles of contention from L2 through the CPU core and back down to L2 of a lock for that line.

If this all looks really close to a subset of hardware transnational memory when you realize that the store can fail and the CPU has to do it again if some resource limit exceeded, you're not alone.

Then somebody figured out that you can just stick an ALU directly in L2, and send the ALU ops down to it in the memory requests instead of locking lines. That reduces the contention time to around a couple cycles. These ALU ops can also be easily included in the coherency protocol, allowing remote NUMA RMW atomics without thrashing cache lines like you'd normally need to.

This is why you see both in the RISC-V A extension as well. The underlying hardware has implementations in both models. I've heard rumors that earlier ARM tried to do macro op fusion to build atomic RMWs out of common sequences, but there were enough versions of that in the wild that it didn't give the benefits they were expecting.

However, all that being said, atomics are orthogonal to what I'm talking about. It's x86's TSO model, and how it _lacks_ barriers in places that ARM requires them, with no context just from the instruction stream about where they're necessary that's the problem here, not emulating the explicit atomic sequences.

brigade · on June 23, 2020

On the other hand, the ARMv8.5 flag manipulation instructions almost certainly were added specifically for x86 emulation.

BeeOnRope · on June 24, 2020

Yes, you need it for every store.

It absolutely does not imply that it would kill your memory perf by an order of magnitude! Remember that x86 does this as their normal store behavior and they don't take an order of magnitude hit.

The easiest approach that leads to fast release stores is just to have a stronger store pipeline that doesn't allow store-store or load-store reordring. The latter basically comes for free and most weak pipelines already preserve it.

The store-store ordering, on the other hand, does have a cost: primarily in requiring stores to drain order, and handling misses in order. Nothing close to an order of magnitude.

A higher performance design would allow store store reordering but not around release stores. Allowing most ARM code to the full benefit while allowing fast release stores for x86.

I think you are mixing up release stores with more expensive atomics like seq-cst stores or atomic RMWs. There is no need for a store buffer drain, ever, for release stores.

my123 · on June 23, 2020

What I can say is that they aren't pinning to only a single core, so the answer is elsewhere.

gpderetta · on June 24, 2020

That would be too expensive, unless the emulation in general is ao high not to matter.

If store releases and load aquires were very cheap already why make them distinct from normal one anyway?

I suspect that either the CPU is already TSO in practice or has some TSO mode.

saagarjha · on June 24, 2020

Then why not expose that to ARM applications running on it?

gpderetta · on June 24, 2020

Apple might not want to promote this to an architectural guarantee. It is only needed for the few years required to transition away from x86, but if applications start relying on it it will need to maintain it forever.

saagarjha · on June 24, 2020

If they run with it for long enough it will all but become an architectural guarantee as people unintentionally write incorrect programs that happen to still work right.

BeeOnRope · on June 24, 2020

> If store releases and load aquires were very cheap already why make them distinct from normal one anyway?

Perhaps if the ARM ISA was designed today, they would, I'm not sure.

My impression is that ARM added them when (a) it became obvious the way the wind was blowing with respect to modern memory models, like Java (sort of), C and C++, where acquire and release are the dominant paradigm (and I doubt any other langue will stray very far), and (b) they could see that these operations implementations can be implemented at a relatively low complexity cost and reasonable performance, compared to the existing barrier approach.

That said, there is a lot of room between "very cheap" and "too expensive", and it is probably hardware dependent. On some simple microcontroller-like design that doesn't want to do any extra work to support these, they might be very expensive, just relying on something like a full barrier which they have to support anyways.

However, on bigger designs, it is not necessarily very expensive to implement efficient release stores. They will still be slower than plain stores, but may not all that much slower. It's nothing like sequentially consistent stores, or barrier types that require a full store buffer drain.

Mostly all you need to need is to ensure that the stores drain in order. Actually the requirement is even weaker: you can't let release stores pass older stores. So it depends on your store buffer design and prevents some optimizations like merging and out-of-order commit to L1D, but those are far from critical optimizations (and you can still do them for plain stores at the cost of a bit more complexity).

If you are designing a core where release stores only occur in concurrent code, e.g,. as a result of a release store in C++, or as a lock release or whatever, you don't need to make them that fast. If they are 10x slower than regular stores, it's probably OK.

However, if you are designing a chip where you know you are going to be running a lot of emulated x86 code where every store is a store release, then yeah you are doing to do a bit more work to make these fast.

What are the other options? (1) and (2) in the GPs list seem very unlikely to me. (1) would almost certainly be much worse than release stores for multithreaded apps and destroy performance in popular creator apps.

(3) is certainly plausible and one variant of what I'm suggesting here: it means making plain stores and release stores the same thing (and I guess for loads). Definitely possible but seems less likely to me than (4).

Another possibility is a TSO mode, as you suggest. Perhaps this is somewhat easier than dynamically handling release stores, I'm not sure.

gpderetta · on June 25, 2020

sorry for the confusion, I'm aware why ARM added those instructions (a very good decision compared to the bizarre barriers available on classic RISCs); what I meant is, if store release can be implemented to be as fast as a normal store, why wouldn't Apple just give release semantics to normal stores? I guess this allow them to be forward compatible with future more relaxed architectures, but I don't think they care about forward compat of translated code. Still it is certainly possible that you are right.

Re TSO mode, I thought I remembered that Power8/9 had a TSO mode (explicitly for compatibility with x86 code), but I can't find any reference to it right now.

BeeOnRope · on June 26, 2020

I think understood the first time, but my answer is:

"I never said release stores can be implemented as fast as regular stores, but they might well be fast enough. Maybe half the throughout (1/cycle) with some other ordering related slowdowns.

In particular, maybe you can implement them as fast an acq/rel all the time CPU mode would be".

Or something like that.

gpderetta · on June 26, 2020

Yes, I could see a slightly slower but fast enough store and load be good enough (and in fact it would be great in general, not just translation).

The reason I'm thinking the cpu might actually be tso is that I haven't seen significant evidence that, in an high performance cpu, tso is a significant performance bottleneck. For the last 15 year or so Intel had the best performing memory subsystem and didn't seem significantly hampered by reordering constraints compared to, say, POWER.

BeeOnRope · on June 27, 2020

Yes, it's not a devastating impact, but my thinking on this has shifted a bit lately to "somewhat significant" impact. For example, I believe the strong store-store ordering requirement significantly hurts Intel chips when cache misses and hits are mixed and an ABA scenario occurs as described at [1].

Also, it seems that Apple ARM chips exhibit essentially unlimited memeory level parallelism, while until very recently Intel chips had a hard limit of 10 or 12 outstanding requests, and in a very-hand wavy way some have claimed that this may be related to the difficulty of maintaining ordering.

More recent, Ice Lake can execute two stores per cycle, but can only commit one store per cycle to the L1D, unless two consecutive stores are to the same cache line. The "consecutive" part of that requirement comes directly from the store-store ordering requirement, and is a significant blow for some high store throughput workloads.

Similarly, I believe the whole "memory ordering mis-speculation on exiting a spin lock spin" thing, which was half the reason for the pause instruction, also comes from the strong memeory model.

None of these are terrible restrictions on performance, but they aren't trivial either. Beyond that it is hard to estimate the cost of the memeory ordering buffer in terms of power use, etc.

I agree that Intel has made chips with power memory subsystems despite this, but it is really hard to compare across vendors anyway: R&D and process advantage can go a long way to papering over many faults.

[1] https://www.realworldtech.com/forum/?threadid=173441&curpost...

gpderetta · on June 28, 2020

Thanks for the RWT link, I had missed that discussion back then. I normally assume that store-store reordering is not a huge deal as the store buffer hides any latency and blocking, but I failed to appreciate that the store buffer filling up is an issue.

But which architectures do not actually drain the buffer in order in practice? Even very relaxed RISCs (including ARM) normally respect causality (i.e. memory_order_consume) and it seems to me that if, say, an object pointer is made visible before the pointed object, that can violate this guarantee, right?

You say that Apple CPUs show near unlimited MLP, do you have any pointers?

gpderetta · on June 28, 2020

Never mind, of course memory order consume still requires a store-store fence between the two stores so reordering stores is still possible

andoma · on June 23, 2020

> 3) Some stricter memory model in hardware? Seems like that'd go against most of the stated reason for switching to ARM in the first place.

I would assume that a more strict memory model would be enabled only for processes that needs it (ie, Rosetta translated ones). So a cpu-flag is set/cleared when entering/exiting user mode for those processes. Does this require a separate/special cache coherency protocol? A complete L1d flush when entering/leaving these processes (across all CPUs)? Not and expert in this field and it feels complicated for sure. Is it worth it for just emulating "legacy" applications during a transitional period? Perhaps Apple can pull it off though.

monocasa · on June 23, 2020

> I would assume that a more strict memory model would be enabled only for processes that needs it (ie, Rosetta translated ones). So a cpu-flag is set/cleared when entering/exiting user mode for those processes. Does this require a separate/special cache coherency protocol?

The benefits you'd get from going to a weaker memory model are by not having that extra coherency in the critical path in the first place. Adding extra muxes in front of it to make it optional would be worse than just having it on all the time.

> A complete L1d flush when entering/leaving these processes (across all CPUs)?

That wouldn't help because two threads could be running at the same time on different cores against their respective L1 and store buffers.

andoma · on June 23, 2020

> The benefits you'd get from going to a weaker memory model are by not having that extra coherency in the critical path in the first place. Adding extra muxes in front of it to make it optional would be worse than just having it on all the time.

Indeed true, good point.

> That wouldn't help because two threads could be running at the same time on different cores against their respective L1 and store buffers.

Of course, this was related to the cost of switching coherency protocol during context switch. But as you say the overhead of just making it switchable is prohibitive in itself.

phire · on June 23, 2020

It can be enabled per instruction.

Atomic instructions (and ARMv8.1 added a bunch of new atomic read-modify-write instructions that line up nicely with x86) use the new cache coherency protocol, while the older non-atomic instructions keep the relaxed memory model.

Though, I'm not sure if it's worth it to keep two concurrency protocols around. I wouldn't be surprised if the non-atomic instructions get an undocumented improvement to their memory model.

AndrewStephens · on June 24, 2020

Elsewhere in the documentation[0] Apple explicitly calls out that code that relies on x86 memory ordering will need to be modified to contain explicit barriers. All sensible code will do this already.

[0] https://developer.apple.com/documentation/apple_silicon/addr...

gpderetta · on June 24, 2020

Sure, for source translation that's sorta fine (although I wouldn't want to be the one debugging it). The issue is binary translation, there is really no such a thing as a acquire or release barrier on x86.