It's really more like 6000, if you do the accounting right. And, as someone who ...

mhh__ · on April 21, 2021

Which one?

thechao · on April 21, 2021

_mm256_movemask_epi8, i.e., the "fast lexer" instruction. That instruction takes ~3c (depending on uarch), and the ARM equivalent (7-8 instructions) takes ~5-6c (depending on uarch). It's just annoying.

glangdale · on April 22, 2021

PMOVMSKB is a great instruction, and 3c understates how cheap it is - if you have a throughput problem (rather than a latency problem) it's even more efficient relative to the ARM equivalent.

I have a blog post about coping strategies for working around the absence of PMOVMSKB on NEON:

https://branchfree.org/2019/04/01/fitting-my-head-through-th...

We used these techniques in simdjson (which I presume still uses them; the code has changed considerably since I built this): https://github.com/simdjson/simdjson

The best techniques for mitigating the absence of PMOVMSKB require that you use LD4, which results in interleaved inputs. This can sometimes make things easier, sometimes harder for your underlying lexing algorithm - sadly, it's not a 1:1 transformation of the original x86 code.

thechao · on April 22, 2021

Yep. My use-case is from your work. Brilliant stuff!

brandmeyer · on April 22, 2021

Have you had a chance to experiment with the SVE and/or AVX512 mask systems, yet?

glangdale · on April 22, 2021

No and yes, respectively.

I'm somewhat curmudgeonly w.r.t. SVE, insisting that while the sole system in existence is a HPC machine from Fujitsu, that for practical purposes it doesn't really exist and isn't worth learning. I will likely revise this opinion when ARM vendors decide to ship something (likely soon, by most roadmaps). There's only so much space in my brain.

AVX-512's masks are OK. They're quite cheap. There are some infelicities. I was irate to discover that you can't do logic ops on 8b/16b lanes with masking; as usual the 32b/64b mafia strike again. This may be a symptom of AVX-512's origin with Knights*.

It would be nice if the explicit mask operations were cheaper. Unfortunately, they crowd out SIMD operations. I suppose this is inevitable given that they need to have physical proximity to their units - so explicit mask ops are on the same ports as the SIMD ops.

I also wish that there were 512b compares that produced zmm registers like the old compares used to; sometimes that's the behavior you want. However, you can reconstruct that in another cheap operation iirc.

brandmeyer · on April 23, 2021

> I'm somewhat curmudgeonly w.r.t. SVE, insisting that while the sole system in existence is a HPC machine from Fujitsu, that for practical purposes it doesn't really exist and isn't worth learning. I will likely revise this opinion when ARM vendors decide to ship something (likely soon, by most roadmaps).

Fair enough. I have high hopes for SVE, though. The first-faulting memory ops and predicate bisection features look like a vectorization godsend.

> There's only so much space in my brain.

I'm still going to attempt a nerd-sniping with the published architecture manual. Fujitsu includes a detailed pipeline description including instruction latencies. Granted its just one part, and its an HPC-focused part at that. But its not every day that this level of detail gets published in the ARM world.

https://github.com/fujitsu/A64FX/tree/master/doc

> I was irate to discover that you can't do logic ops on 8b/16b lanes with masking; as usual the 32b/64b mafia strike again.

SVE is blessedly uniform in this regard.

> It would be nice if the explicit mask operations were cheaper. Unfortunately, they crowd out SIMD operations.

This goes both ways, though. A64FX has two vector execution pipelines and one dedicated predicate execution pipeline. Since the vector pipelines cannot execute predicate ops, I expect it is not difficult to construct cases where code gets starved for predicate execution resources.

glangdale · on April 23, 2021

The Fujitsu manuals are really good. Like you say, it's not often that you see that level of detail in the ARM world - or, frankly, the non-x86 world in general. From my prehistory as a Hyperscan developer back in the days before the Intel acquisition and the x86-only open source port, I have a lot of experience chasing around vendors for latency/throughput/opt-guide material. Most of it was non-public and/or hopelessly incomplete.

I salute your dedication to nerd-sniping. I need my creature comforts these days too much to spend days out there in the nerd-ghillie-suit waiting for that one perfect nerd-shot. That may be stretching the analogy hopelessly, but working with just architecture manuals and simulators is tough.

I am more aiming for nerd-artillery ("flatten the entire battlefield") these days: as my powers wane, I'm hoping that my superoptimizer picks up the slack. Despite my skepticism about SVE, I will retarget the superoptimizer to generate SVE/SVE2.

ConcernedCoder · on April 21, 2021

This instruction does quite a bit of leg work, and it becomes obvious why a RISK architecture would need 7-8 instructions to do the same, see: https://software.intel.com/sites/landingpage/IntrinsicsGuide...

irdc · on April 21, 2021

This looks like a task for gorc: https://five-embeddev.com/riscv-bitmanip/draft/bext.html#gen...

(granted, that’s only 32/64 bits, but still…)