Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Are Jump Tables Always Fastest? (cipht.net)
188 points by luu on Oct 3, 2017 | hide | past | favorite | 34 comments


> Threaded code should have better branch prediction behavior than a jump table with a single dispatch point

This is not the case anymore, at least for modern Intel processors. Starting with the Haswell micro-architecture, the indirect branch predictor got much better and a plain switch statement is just as fast as the "computed goto" equivalent. Be wary of any references about this that are from before 2013.

For more info, I would recommend "Branch Prediction and the Performance of Interpreters - Don’t Trust Folklore", by Rouhou, Swamy and Seznec. Pdf link: https://hal.inria.fr/hal-01100647/document

> although indirect branch prediction should help even the field.

Indeed :)

-----

Fun story: I experimented with adding indirect threading to the Lua interpreter and was excited to find an improvement of up to 30% on some selected microbenchmarks (running on my IvyBridge workstation). But the improvement dropped to 0% when I tested it on a machine with a Haswell processor. Measuring with perf indicated that the improved branch predictor was indeed responsible for this.


That's really good to know. It's still annoying though that C semantics apparently require there to be a bounds-check for every iteration of the switch(): https://eli.thegreenplace.net/2012/07/12/computed-goto-for-e... (see "Doing less per iteration")

I've tried using "default: __builtin_unreachable()" but this doesn't seem to help.


IIRC GCC is capable of dropping the check if it knows the possible range of the switch variable. An example would be using a single byte and having all 256 possibilities exist in the switch. Another would be switch (x & 7) with 8 cases. I have not checked in a while, but I spent a lot of time on code with this issue. It's at the core of my ray tracer of all things.


There is no requirement for range checking in the C standard. That was a missed optimization that was only very recently fixed on GCC: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51513


> Starting with the Haswell micro-architecture, the indirect branch predictor got much better

Looking at the results I get, I feel like there's also an inlined switch lookup so that the CPU can get ahead of it.

Basically if I had to the same, I would try to inline the top of the switch down into the break; jump directly instead of jumping to a common location.

So now the operator case ends with looks like

  op1_offset:
  <invariant=pc> (operator code)
  pc2 = pc+1
  jmp_off = ops_table+pc2
  jmp jmp_off
Because there's no jump between the operator and the pc2 = pc+1, the CPU can compute jmp_off during any time from entering the op1_offset (and this is why you have an HLT opcode, also 32 bit opcodes are much more compact than 64 bit direct threaded CGOTO).

I'm not sure if the inlining of the jmp to_top_of_loop can be done by the compiler itself or if it is done by the lower levels of the CPU.

Compilers used to ruin the immediate dispatch of computed goto, to extract the common jump code - I've had to force GCC to leave my CGOTO alone to make this work before[1].

[1] - http://notmysock.org/blog/php/optimising-ze2.html


> Because there's no jump between the operator and the pc2 = pc+1, the CPU can compute jmp_off during any time from entering the op1_offset (and this is why you have an HLT opcode, also 32 bit opcodes are much more compact than 64 bit direct threaded CGOTO).

If you're relying heavily on the indirect branch predictor of a Haswell-class CPU for performance, the exact details of computing the address probably don't matter that much. The CPU predicts the branch and continues executing, validating the prediction later. While it's definitely possible to exhaust some internal CPU resource waiting for old predictions to be validated, that is unlikely to be affected much by minor scheduling changes of the address generation sequence.

> I'm not sure if the inlining of the jmp to_top_of_loop can be done by the compiler itself or if it is done by the lower levels of the CPU.

A C compiler can definitely do it if it so desires, but it would also have to inline the required bounds check in addition to the indirect branch. If the compiler properly scheduled this branch, the fallthrough would always be success, and it would just be an always-false branch away that any branch predictor can handle. Of course, you could hit a CPU resource limit for outstanding branches and eventually stall, even if the branch is always predicted.

I'm not aware of any currently available out-of-order CPU that does the sort of linear trace formation that might be considered "inlining" of the switch logic. Pentium 4 famously used such a mechanism.


Yeah, I should have finished this article when it was still fresh. Hardware always seems to catch up with codegen from the '70s.

(Thanks for the excellent paper reference.)


How about on ARM? Is its branch prediction keeping pace? I'm curious because you can't JIT to native on iOS, so finding the best way to build a VM there is interesting.


> Is its branch prediction keeping pace?

Well, most ARM cores are not designed by ARM Holdings, and even among their own cores there are many models.

The answer to this question depends entirely on the specific design.


How about the chip in any recent model iPad or iPhone?


Who knows, it would be trivial to check if Apple allowed customers to run an operating system of their choice. I don't have the patience to operate one.


Wouldn't the VM violate the rules even if it's just an interpreter?


You're allowed to bundle a custom VM for any language if it only executes code included with the app itself.


This myth is even in Python source code[1]

  However, since the indirect jump instruction is shared
  by all opcodes, the CPU will have a hard time making
  the right prediction for where to jump next (actually,
  it will be always wrong except in the uncommon case of
  a sequence of several identical opcodes).
[1] https://github.com/python/cpython/blob/ff8ad0a576c6cf375e682...


It seems likely that Python is targeting a wider field than just the latest few generations of desktop/server CPUs.


I think that this might just be because that code is a bit older now.

I actually have to thank the documentation there for explaining all the flags you need to turn on GCC to get the computed goto to work. By default gcc's optimizer will mess things up and I would have never figured out the flags by myself to be able to run my benchmarks


That's super interesting, thank you for that.

Although every time I write an interpreter I'm frustrated that I can't simply pass this information directly to the CPU instead of having it guess haphazardly where the indirect branch will lead to. Modern CPUs let us control the prefetcher, why not the branch predictor?

I'm currently working on an MIPS emulator where I can predict the next executed opcode with perfect accuracy, yet I have no way to give this information to the processor and have to rely on the branch predictor to guess right.


Thank you so much for sharing this paper, which answered all my questions on the opportunity of using "computed goto" in interpreters like Python, considering recent CPU architectures.


As yes, the classic "I can outsmart the compiler".

I've been down the exact rabbithole, before, but with java. There are 2 different bytecodes for representing a switch statement: tableswitch, which is dense (has a case for every key from X to Y), and lookupswitch, which is sparse. Of course the dense one must be better, I thought: O(1) vs O(log n) ! Maybe if I added a few more cases manually to my switch statement to cover missing holes, my lookupswitch would become a tableswitch and my hot loop would be faster.

Turns out, of course, that not only is O(1) not necessarily any faster than O(log n) when n is small and the constant factor is large (see this article), but in fact its irrelevant since the hotspot uses the same function to generate the IR for both bytecodes (http://hg.openjdk.java.net/jdk9/jdk9/hotspot/file/b756e7a2ec... ), and thus the decision about whether to use a jumptable or binary search is entirely unreleated to the bytecode that represents the switch statement :)


Not only that, but the O(log n) sparse table has multiple branch points that are more predictable, so the branch predictor works better. It can be faster overall than a single highly unpredictable branch point.


One nice thing about a classic jump table is that you can change the function pointers dynamically. That might seem like a questionable thing to do, but it's pretty handy to intercept or wrap functions. Calling different functions depending on state/context can be more efficient than calling one function that has to check that same state/context on every single call, and it's slightly easier to set breakpoints that way too. Forget about doing any of that with the switch-statement or threaded versions. Unless you're writing something that has to run billions of times per second, like an interpreter's inner loop, the extra flexibility is worth it.


On my first attempt of writing an IRC parser by following a specification at an RFC, I wrote the whole tokenizer in a single switch statement. But as I needed further branches I resorted to using nested switch statements. It quickly became a mess.

Then I switched to calling seperate functions within the switch statement for tokens. If the char is a colon and state is prefix, call parsePrefix. If the char is a whitespace, advance state etc.

I didn't know about jump tables back then, but relying on enums for states and branching proved to be a good idea. e.g If state is < 3, parseParameters(); if state is <2, parseMessage() etc.

Using nested conditionals and loops in the same function on the other hand proved to be a terrible idea.


Am I reading this right? The performace difference seems to be unaccounted for in the data

    Performance counter stats for './x86_64-binary 5000000' (5 runs):

        6,883,819,114      cycles                    #    2.090 GHz                      ( +-  0.43% )
          232,004,486      instructions              #    0.03  insns per cycle          ( +-  0.06% )
           56,828,213      branches                  #   17.257 M/sec                    ( +-  0.04% )
            1,262,892      branch-misses             #    2.22% of all branches          ( +-  0.05% )

          3.299025345 seconds time elapsed                                          ( +-  0.43% )

    Performance counter stats for './x86_64-vtable 5000000' (5 runs):

        7,709,225,443      cycles                    #    2.087 GHz                      ( +-  0.95% )
          217,283,422      instructions              #    0.03  insns per cycle          ( +-  0.03% )
           51,631,368      branches                  #   13.976 M/sec                    ( +-  0.03% )
              957,553      branch-misses             #    1.85% of all branches          ( +-  0.10% )

         3.706410106 seconds time elapsed                                          ( +-  1.04% )
One would assume with all else being equal that code which ran fewer instructions with fewer branches and a better branch prediction rate would be faster.

Given that is not what we get, can we assume that 'all else was not equal' Where did the cycles get used? Average cost of branch-miss? Cache miss? Loading the pointers from the jump table with rep movsb?


I wouldn't scrutinize it too much as the benchmarking approach is wildly inaccurate, but I'm curious about that too. I might investigate it later. (I am the author of the post.) This was also run on a pretty ancient AMD machine, which isn't terribly representative of modern branch prediction hardware.


Obviously, a faster way would be to place your functions at predictable addresses that could be computer with simple bit arithmetic. No double lookup or conditional branches required.


Computed goto is also not always fastest. Someone benched it (I think it was in the context of interpreter main loops); any of the 3 approaches can win depending on arch and usage patterns.


I'm not sure. It would create holes in the code, and it tends not to be a good thing for the cache. I also don't know how branch prediction will react with that. I suppose it should be tried and tested.


How would that be possible in position-independent code (which you need for ASLR and such)? You'd need to fetch some base address from memory, even if you can compute the offset from that.


In java large jump tables prevents JIT from compiling methods.


dispatch(-1)


Yes, I thought the exact same thing. The value being passed in is an int, which can be negative. So, the statement "if (state > 4) abort();" isn't enough of a guard.

I managed a team of C/C++ programmers for many years in Silicon Valley. I always encouraged my engineers to write clean code without fancy tricks. Fancy tricks lead to bugs that spend a lot of time to debug. If I had an engineer write that dispatch() function with the vtable, I'd have beaten them with a wet noodle until they promised never to do that again.


C++ already has virtual tables built-in.

You still need to select the correct table for every incoming packet though. But again C++ has idiomatic ways for that, e.g. std::unordered_map<uint8_t, IPacketHandler* > for sparse values, or std::array<IPacketHandler* , n> for dense zero-based values.

While this approach is slightly more complex (need to register handlers somehow), debuggers are happy with that kind of dynamic dispatch because standard OO C++. IMO over time it’s more maintainable, e.g. it’s trivial to add another handling method, and the compiler will check that you implement that for each protocol you support.


Site is awful on mobile chrome


Same on mobile Firefox. The text only uses 50% of the width of the narrow screen, and quotes go down to 25%, so it's no more than two words (or a single long one) per line.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: