> Rosetta can translate most Intel-based apps, including apps that contain just-in-time (JIT) compilers.
How on Earth does it do that? If executable code is being generated at runtime, it's going to be x86_64 binary machine code still (there are too many ways to generate valid machine code, and it won't know right away whether you're JITting, or cross compiling and actually want x86_64), so Rosetta would need to detect when the code's about to be run, or when it's marked as executable, and translate the machine code in that bit of memory just in time. The length of the ARM code might be longer, so it would have to be in a different block of memory, with the original x86_64 code replaced with a jump to the new code or something.
It's late at night here, so maybe I'm missing a simpler approach, but I'm a bit surprised they have it working reliably enough to make such a general statement (there being a great variety of JIT systems). From a quick search I can't tell if Microsoft's x86-on-ARM translation in Windows 10 ARM supports JITs in the program being run.
They might be using something like an NX bit on the generated x86_64 page, so that whenever the code attempts to jump into it, a page fault is generated, and the kernel is able to handle that, kicking in the JIT compilation and translating the code / address. This is essentially a "double JIT" so there will likely be a performance hit.
Since they control the silicon, Apple might also be leveraging a specialized instruction / feature on the CPUs (e.g. a "foreign" bit that's able to mark memory pages as being from another architecture, and some addressing scheme that links them to one or more native pages behind the scenes)
Maybe the A series chips even have "extra" registers / program counters / interrupts that aid in accelerating this emulation process.
Think of binary translation as just another kind of compiler. It parses machine code, generates an IR, does things to the IR, and then codegens machine code in a different ISA. (Heck, Rosetta 2 is probably built on LLVM. Why wouldn't it be? Apple already put so much work into it. They could even lean on similar work like https://github.com/avast/retdec .)
During the "do things to IR" phase of compilation, you can do static analysis, and use this to inform the rest of the compilation process.
The unique pattern of machine-code that must occur in any implementation of JIT, is a jump to a memory address that was computed entirely at runtime, i.e. with a https://en.wikipedia.org/wiki/Use-define_chain for that address value that leads back to a call to mmap(2) or malloc(2). Static analysis of the IR can find and label such instructions; and you can use this to replace them in the IR with more-abstract "enter JITed code" instrinsic ops.
Then, in the codegen phase of this compiler, you can have such intrinsics generate a shim of ARM instructions. Conveniently, since this intrinsic appears at the end of the JIT process, the memory at the address passed to the intrinsic will almost certainly contain finalized x86 code, ready to be re-compiled into ARM code. So the shim can just do a (hopefully memoized) call to the Rosetta JIT translator, passing the passed-in address, getting back the address of some ARM code, and jumping to that.
> The unique pattern of machine-code that must occur in any implementation of JIT, is a jump to a memory address that was computed entirely at runtime, i.e. with a https://en.wikipedia.org/wiki/Use-define_chain for that address value that leads back to a call to mmap(2) or malloc(2).
This is probably mostly true. But actually reliably finding such cases, in the most hostile environment imaginable (namely, on x86-64 binary code), would require heroic program analysis that I as a compiler engineer would judge unrealistic.
I would find it much more plausible for the system to watch for something like "jump to a target in a memory page that at some point had the W flag set, has since had W removed and X set, and has not been jumped to since that", and then trigger a compilation of that page starting at the given jump address, and a rewrite of the jump instruction. As proposed in https://news.ycombinator.com/item?id=23614894 this could be done by watching for page faults, though presumably it would mean intercepting places where the application tries to set the X bit on a page and setting NX instead so that the page fault can be caught.
> Rosetta 2 is probably built on LLVM. Why wouldn't it be?
LLVM is kind of slow; JavaScriptCore abandoned it years ago for their FTL backend.
> The unique pattern of machine-code that must occur in any implementation of JIT, is a jump to a memory address that was computed entirely at runtime, i.e. with a https://en.wikipedia.org/wiki/Use-define_chain for that address value that leads back to a call to mmap(2) or malloc(2).
Re: your rebuttal code — that’s still a use-define chain. You don’t need the same literal pointer; you just need to know that the value of the pointer ultimately depended on a the output of mmap(2). Since the mmap(2) region address is passed into memcpy(2)—and memcpy(2) can fail, producing NULL—the output of memcpy(2) does then depend on the input. (Even if it didn’t, you could just lie to the compiler and tell it to pretend that memcpy(2) always depends on its input.)
Huh, this is news to me. The memcpy(3) man page on my Linux box (there is no memcpy(2) here) doesn't mention this either, is this some special MacOS or BSD feature of memcpy? Under what circumstances would it determine that it should fail?
Your reasoning is strange anyway since the memcpy has nothing to do with anything, the implicit information flow from mmap to mprotect would exist even if the memcpy and the region variable were removed:
The original PowerPC on Intel Rosetta was pretty amazing.
First, most programs do much of their work inside the OS - rendering, network, interaction, whatever, so that's not emulated, Rosetta just calls the native OS functions after doing whatever input translation is necessary. So, nothing below a certain set of API's is translated.
You have to keep a separate translated binary in memory, and be able to compile missing bits as you encounter them, while remembering all your offset adjustments. It worked amazingly well during the PowerPC transition. Due to so many things running natively on x86, the translated apps frequently faster than running native on PowerPC macs!
> the translated apps frequently faster than running native on PowerPC macs!
This gets repeated a lot and it's generally false. In fact, most of the time they were slower, and realistically you would expect this. On a clock for clock basis, an OG Mac Pro 2.66GHz was about 10-20% slower than a Quad G5 2.5GHz running the same PowerPC software. In some benchmarks, the Quad G5 was still faster at running PowerPC software than the 3.0GHz (see https://barefeats.com/quad06.html ). When the Core Duo had to pay the Rosetta tax, even upper-spec G4s could get past it (https://barefeats.com/rosetta.html) and it stood no chance against the G5.
Where I think this misconception comes from is that on native apps (and Universal apps have native code), these first Intel Macs cleaned the floor with the G4 and most of the time nudged past even the mighty Quad, and by the second generation it wasn't a contest anymore. But for the existing PowerPC software that was available during the early part of the Intel transition, Power Macs still ran PowerPC software best overall. It wasn't Rosetta that made the Intel Macs powerful, it was just bridging things long enough to buy time for Universal apps to emerge. Rosetta/QuickTransit was marvelous technology, but it wasn't that marvelous.
I had a quad 2.5GHz water cooled Mac Pro on my desk, and the OG Intel mac pro, as well as Apple's dev system. I can tell you with 100% certainty, that Google Earth ran faster on the Intel Mac Pro under Rosetta than natively on the PPC Mac Pro, as I'm the person who ported it. When I had the native version working on Intel, that was the fastest of all, by far.
History repeats itself. Even though Apple didn’t make the claim this time, they did claim that PPCs would run 68K apps faster because they would spin much of their time running native code. This wasn’t true then either - at least not for the first gen PPC Macs.
Typically the way systems do this is by translating small sections of straightline code, and patching the exits as they are translated. So you start by saying translate the block at address 0x1234. That code may go until a jump to address 0x4567. When translating that jump, they instead make a call to the runtime system which says "where is the translated code starting at address 0x4567?" If the code doesn't exist, it goes ahead and translates that block and patches the originally jump to skip the runtime system next time around.
This means early on in the program's run you spend a lot of time translating code, but it pretty quickly stabilizes and you spend most of your time in already translated code.
Of course, if your program is self modifying then the system needs to do some more work to invalidate the translation cache when the underlying code is modified.
Right, if you look at the example JIT project I posted, they start by creating the memory with mmap() and PROT_EXEC before generating any code. So yes, you'd need to trap subsequent writes to allow for retranslation [I assume they have hooks in the Darwin kernel for this].
Or they just enforce W^X in the rosetta runtime by intercepting client mmap calls and fix it up as far as the client code is concerned by catching SIGBUS first.
I don't see anything here that requires extra kernel hooks.
Pages containing x86 code are never actually being marked executable in the (ARM) pagetables, because that would be nonsensical: The CPU itself does not know how to run x86 code. mmap/mprotect from x86 with the exec bit therefore does something else than native code.
Right, but the x86 code _thinks_ it does, and those semantics have to be maintained.
What that practically means is that x86 executable pages get mapped as RO by default as seen by the ARM side even if they would normally be RWX. Modifications trap, the handler mprotects the pages as writeable, flush the translation cache for that region, then returns from the trap. Then on an x86 jump to that region, the JIT cache sees no traces for that region, marks the page as RO again so that modifications will trap, and starts recompiling (or maybe just interpreting; they may be using a tiered approach in the JIT).
This is some serious black magic, especially if it actually works. Even if it only works some of the time I give it an A for effort, since they could easily have just said "JITs will not work with Rosetta." Most applications do not contain JITs.
My guess is there are enough important Mac apps that contain JITs of some form to merit this work, probably high performance math and graphics kernels that JIT processing pipelines or something like that. I wonder if some pro studio editing applications do this.
True, but those are trivial to port as there are already mature JDKs and JavaScript VMs for ARM64. Porting an electron app should be a matter of checking it out and building it with ARM support. Applications with internal JITs custom-built for X64 are going to be a lot tougher.
Electron doesn't support aarch64 builds compiled from aarch64 hosts yet (so you have to cross compile), and doesn't support cross compiling if you have any native code addons.
Some apps out there are between a rock and a hard place if it weren't for rosetta.
There's still a non trivial amount of time to get that propagated. First merged into chromium, the merged into electron, then actually used by the client programs.
If there were ever a time and a place for black magic, it would be this use case: swapping out an enormous layer (the hardware) at the bottom of the stack, while keeping everything at the top of the stack working seamlessly.
I'm not sure why they're making a big deal about this, couldn't the original Rosetta do this too? QEMU has been doing this since (I think) even before the original Rosetta, they call it user mode emulation. You run as if it was a normal emulator but also trap syscalls and forward them to the native kernel instead of emulating a kernel too.
I'm more interested in how they're doing the AOT conversion and (presumably) patching the result to still be able to emulate for JITs. That'd be (comparatively) simple if it was just for things from the iOS and Mac App Stores since Apple has the IR for them still but they made it sound like it was more generic than that.
The original Rosetta couldn’t do it, or couldn’t do some part of it. I remember because Sixtyforce couldn’t run. I believe that it could have to do with self-modifying code.
I'm not sure why they're making a big deal about this
Because they're trying to get the message across to regular people (not HN types) that their software will continue working with the new chips, and they don't have to flee the Apple ecosystem.
> I'm not sure why they're making a big deal about this, couldn't the
That's par for course with Apple. They never acknowledge competitors including when its themselves. Everything they do or describe is awesome and magical, right now! It could be an incremental improvement, it could be a half decade old established technology, it could be something completely unexpected and science fiction turned into reality. Thats just how Apple works.
My gut feeling is that it will be about the same as Itanium/HP Envy x2 emulation. Emulation of highly optimized hardware where code is generated by highly optimized compilers without an order of magnitude slowdown is just too good to be true.
Valgrind does something similar (x86->intermediate language->x86), and is only about 4x slower than native with all the analyses disabled. I’d guess they left some optimizations out to make it easier to implement the dynamic checks it supports.
Valgrind leaves most instructions as they were doesn't it? If you're not touching the dynamic memory it should be as fast. You wouldn't be able to do that with complex MMX or SSE2 instructions with Arm translation.
It lifts them to a simplified version of x86 so they can be instrumented / transformed more easily. I think that implies you get different instructions when it lowers back to x86, but I could be wrong. (I’ve written a specialized valgrind instrumentation tool or two, but didn’t look too carefully at the execution half of their codebase.)
> If executable code is being generated at runtime, it's going to be x86_64 binary machine code still
JITs have to explicitly make the memory they write the generated code to executable. The OS "just" needs to fail to actually make the page executable, and then handle the subsequent page fault by transpiling or interpreting the x86 code therein.
And it wouldn't make sense to make any x86 process originating pages actually executable on a page table level anyway, as the CPU can never actually "execute" it, as machine code.
How on Earth does it do that? If executable code is being generated at runtime, it's going to be x86_64 binary machine code still (there are too many ways to generate valid machine code, and it won't know right away whether you're JITting, or cross compiling and actually want x86_64), so Rosetta would need to detect when the code's about to be run, or when it's marked as executable, and translate the machine code in that bit of memory just in time. The length of the ARM code might be longer, so it would have to be in a different block of memory, with the original x86_64 code replaced with a jump to the new code or something.
It's late at night here, so maybe I'm missing a simpler approach, but I'm a bit surprised they have it working reliably enough to make such a general statement (there being a great variety of JIT systems). From a quick search I can't tell if Microsoft's x86-on-ARM translation in Windows 10 ARM supports JITs in the program being run.