Do Low-Level Optimizations Matter? (2020)

orlp · on July 18, 2021

This article is missing the important reference to the prior work by Edelkamp and Weiss, on branchless quicksort [1].

I've long since implemented that in pdqsort [2], along with other techniques described by me in [3] which is a drop-in replacement for std::sort that's ~twice as fast as std::sort for small data with branchless comparisons. It's also available in Boost.sort [4].

[1] https://arxiv.org/abs/1604.06697 [2] https://github.com/orlp/pdqsort [3] https://arxiv.org/abs/2106.05123 [4] https://www.boost.org/doc/libs/1_76_0/libs/sort/doc/html/sor...

ncmncm · on July 19, 2021

> This article is missing the important reference

I don't understand why you would complain it is missing reference to Edelkamp and Weiss when its second and third references are to papers by Edelkamp and Weiss. My [3] is identically the paper labeled [1] in the complaint.

orlp · on July 19, 2021

Ah, I'm sorry. I looked over it because I didn't see it mentioned in the text. You did technically link to it as "active literature on sorting" without further details, but then the article goes on to describe exactly their idea without mentioning that it is in fact their idea. There really ought to be a contextual reference that gives proper credit.

Waterluvian · on July 18, 2021

An ignorant question:

Why don’t default sort implementations provide multiple approaches under the hood and switch based on collection size?

Someone · on July 18, 2021

I expect many do. Recursive ones in particular, switch when things get relatively easy. For example glibc’s qsort switches to insertion sort for small sizes. https://sourceware.org/git/?p=glibc.git;a=blob_plain;f=stdli...:

“Only quicksorts TOTAL_ELEMS / MAX_THRESH partitions, leaving insertion sort to order the MAX_THRESH items within each partition. This is a big win, since insertion sort is faster for small, mostly sorted array segments.”

The comment on the definition of MAX_THRESH is very funny/something worth crying about.

“This particular magic number was chosen to work best on a Sun 4/260.”

That was a 16.67MHz machine with 128 megabytes of RAM, tops. If that magic constant has been checked since 1990, maybe, it’s worth updating that comment.

kragen · on July 18, 2021

In a lot of cases, parameters like this have a fairly broad plateau where they provide almost optimal performance, and then drop off sharply when you get too far away. MAX_THRESH is 4, but I'm guessing that you'd get less than a factor of 2 difference in performance for any value from 4 to 32 on modern hardware, and above that you'd start seeing a quadratic slowdown, and maybe you'd see a factor of 2 or 3 slowdown if you set it to 1. If you decide to do the experiment, I'm interested in hearing your results.

Someone · on July 19, 2021

I don’t think they archived their benchmarks. Even then, I think that would be a multi-month project, to verify whether the test cases still are representative of real-world usage.

A factor of 2 would IMO be worth it, though.

Cursory reading https://github.com/freebsd/freebsd-src/blob/de1aa3dab23c06fe..., I get the impression FreeBSD switches over to a ¿bubble sort? (I didn’t look close) at <7 elements.

That function also seems newer (it claims to implement https://cs.fit.edu/~pkc/classes/writing/papers/bentley93engi..., which is from 1993)

kragen · on July 19, 2021

You could certainly spend multiple months benchmarking it and trying to figure out what real-world usage looks like, but you could get preliminary results in a few hours. It's mostly a matter of writing a short shell script.

I agree that a factor of 2, or even 1.03, would be worthwhile. But the biggest problem facing standard library authors is not frequently performing 1.5× or 2× worse than optimal; it's occasionally performing 10000× worse than optimal, like the story Bentley & McIlroy start out with in that paper about the organ-pipe array. (And that's why the Golang stdlib uses introsort.)

BTW, just because someone was optimizing their library parameters on a machine from 01987 (like the Sun-4/260) doesn't necessarily mean they were doing it before 01993. In 01993 I was using a VAX-11/785 from 01984. Hardware lasted longer in those days (it was less common to drop your SPARC on the sidewalk and spiderweb the touchscreen) and its higher cost forced it into longer service. https://books.google.com.ar/books?id=JRgDwCkMX_cC&pg=PP121&l... says the VAX-11/785's list price in 01984 was US$200k, equivalent to US$513k in 02021 according to the BLS CPI. Not counting the terminals and printers! (About 2000 kids learned to program on that beast, so it was probably a good expenditure for the school district.) In 01988 Sun listed the Sun-4/260 at a base price of US$40k (http://www.bitsavers.org/pdf/sun/Sun_Price_List_Dec88.pdf) but probably more realistically US$80k if you wanted a disk and some way to back it up. US$80k in 01988 is US$180k in 02021. So you might find yourself running a big machine long after it was obsolete.

But surely that sort of penny-pinching was unique to high schools like mine, not serious professionals like those writing FreeBSD's libc, GNU libc, and Bentley's SP&E article, written from the lofty sanctuary of Bell Labs? Well, Bentley's 01993 paper reports benchmark results (from 01992) on a MIPS R3000 (introduced 01988) and a VAX 8550 (introduced 01986).

Why didn't he benchmark it on any machine from that decade? Probably he didn't have one. And the glibc folks were mostly volunteers at that point. CSRG was still DoD-funded, but they disbanded in 01995 after 4.4. (I think a bunch of them were at BSDI for a while.)

— ⁂ —

The FreeBSD code is doing an insertion sort. Insertion sort looks a lot like bubble sort, and it took me a long time to appreciate the difference. One difference that's sometimes relevant for this sort of thing is that, on nearly-sorted data, insertion sort is O(N) and bubble sort is O(N²). But in this case that's not relevant because it's not running a single insertion-sort pass over the whole array after nearly-sorting it via quicksort; it's insertion-sorting each tiny partition. (The sorted part of the partition is the range from a up to pm.)

After N iterations of the outer loop of either insertion sort or bubble sort, one end of your array has N sorted items in it. (The bottom of the partition from a to a+n, in this case.) The difference is that insertion sort grows that sorted region by inserting the nearest unsorted element into it, while bubble sort grows it by inserting the smallest unsorted element into it (or largest, if you're sorting toward the top). So bubble sort only makes as much progress per pass as insertion sort, but it does a bunch of unnecessary extra work to do it (basically, the same work selection sort does); when sorting M randomly ordered items, insertion sort does on average ¼M† comparisons and swaps on each of the M passes, while bubble sort does ½M comparisons (and still ¼M swaps).

Never use bubble sort.

______

† Actually the number is something like ¼√(M² - M), I think. I'd have to go calculate it to be exact.

kevincox · on July 18, 2021

Well the code hasn't been updated either. So either the comment is probably fine or the constant should be reconsidered. Unless of course someone has re-evaluated and found the same constant to work well since then.

ot · on July 18, 2021

They do, to a degree. Modern implementations generally use hardcoded routines for very small numbers (say, up to 5), then some quadratic sort like insertion sort, then quicksort. The dispatching can happen at all levels of the recursion, so quicksort's leaves are actually optimized sorts for small collections.

anaphor · on July 18, 2021

This is a thing, see https://www.youtube.com/watch?v=FJJTYQYB1JQ (talk by Andrei Alexandrescu at CppCon 2019 about this exact idea)

jiggawatts · on July 19, 2021

Upvoted because this is a fantastic talk all by itself, and also highly topical.

ncmncm · on July 19, 2021

As I recall, this was the talk that triggered me to begin the experiments that led to the article.

abatilo · on July 18, 2021

There's at least some precedent for conditional execution in standard libraries.

At least, in Python, the default sort is https://en.m.wikipedia.org/wiki/Timsort

ralfd · on July 19, 2021

Old, but gold:

https://ridiculousfish.com/blog/posts/array.html

gumby · on July 19, 2021

This is quite good, and includes source code.

amelius · on July 18, 2021

Hopefully not just collection size, but also based on storage medium. You might want to sort stuff which is stored on disk or tape. Why should a standard library only offer solutions for objects stored in memory?

Waterluvian · on July 18, 2021

That seems like something you’d want the developer to handle. Not interested in leaking hardware details into the sort interface.

amelius · on July 19, 2021

You want the sort interface to have call-backs for loading/writing stuff from/to disk. But most sort interfaces don't have this. They only support memory.

wpietri · on July 18, 2021

For those interested in the question, it's worth looking at how zstd was made. I have to archive hundreds of TB of JSON about tweets, so I did a compressor bake-off last year. I was surprised at how much better zstd did. Turns out designing for the low-level details of modern processors can make a big difference: https://engineering.fb.com/2016/08/31/core-data/smaller-and-...

userbinator · on July 18, 2021

Somehow, I suspect not using JSON in the first place might yield better absolute results since the compressor wouldn't have to deal with all the redundancy of JSON and instead focus on the content only.

wpietri · on July 18, 2021

For many applications, this one included, "the first place" is out of reach. What I've got is an awful lot of incoming JSON data. I'd like to keep a copy in case we need it later.

Is it possible for me to write and test a robust custom compressor that does better in both storage and compression costs than any of the modern compressors? Possibly.

Is that true even though one of FB's goals was specifically optimizing JSON compression? Maybe.

Can I build it for the few hundred bucks per month in S3 costs I might save? Definitely not.

kragen · on July 18, 2021

Also, surprisingly often, doing something reasonably workable but far from optimal, and then throwing generic compression algorithms at it, can work very nearly as well as carefully doing the precisely optimal thing. SNMP uses BER, which has lots of carefully thought out compact representations for data types, and COAP has CBOR, and in both cases you can usually do as well or better on space by running a simple ASCII syntax through LZW. Usually not that much slower, either.

rixed · on July 19, 2021

Not fair, unless you also compress the BER or CBOR (which are primarily intended to serialize/deserialize faster, not to compress)

kragen · on July 19, 2021

I don't know if you've ever benchmarked BER and CBOR implementations, but I have, and they're astoundingly shitty at serializing and deserializing faster, even when compared against things like JSON. They do usually beat LZW though.

dragontamer · on July 18, 2021

Sorting is a good 'random problem' but matrix multiplication is WOW when you get all the low level details right.

L1 blocking, prefetching, column vs row, etc etc. You can probably spend a whole semester just optimizing matrix multiplication and understanding each incremental low level optimization.

duped · on July 18, 2021

And it's incredibly frustrating that precious few languages bake in matrix arithmetic so compilers can do this optimization for you. There's Octave and MATLAB but they're slow for other reasons. It's very frustrating when you have to either use some crazy template library that takes forever to compile or do the optimizations by yourself.

Julia is definitely promising for that reason

jpeloquin · on July 18, 2021

I thought baking matrix arithmetic into the language was more about making matrix-related code look nice than making it faster. That is, most languages just use LAPACK for matrix arithmetic, even Matlab and Julia, and they are distinguished only in that they provide special syntax for it. "In Julia (as in much of scientific computation), dense linear-algebra operations are based on the LAPACK library, which in turn is built on top of basic linear-algebra building-blocks known as the BLAS" [0]. How does baking matrix arithmetic into the language itself (maybe I misunderstood what you meant by this) provide new opportunities for compilers to optimize the code? Is it that if every bit of a program was implemented in the same language (e.g., Fortran), the compiler can do whole-program optimizations it otherwise could not?

[0] https://docs.julialang.org/en/v1/stdlib/LinearAlgebra/#BLAS-...

duped · on July 18, 2021

With dedicated semantics it becomes possible for a compiler to optimize things explicitly rather than implicitly. It's sort of equivalent to how a language without generics or inheritance can't perform certain optimizations (like inlining method calls) because the compiler isn't aware of those semantics in the first place.

The overhead of BLAS is significant for small matrices. However, a compiler can tell if matrices are small at compile time (or JIT time) and inline the expressions which presents enormous speed gains. It can also make judgement calls about when to use sparse or dense arrays but that's a bit trickier. Compilers can do this without knowing that matrices or linear algebra exist, but it usually takes PGO which can be difficult to use.

Iirc there's a linear algebra library for Julia that can beat the LAPACK backends for small matrices, which is a pretty big deal.

There are also classes of optimizations like SIMD that don't just depend on matrix size but the target platform, which may not be the platform the software was compiled on. The way to do this is with function multiversioning but not all compilers/languages support it, this is kind of a classic usecase for a JIT compiler.

One of the reasons Eigen is great is because it can use template semantics to make these optimizations at compile time. It's just slightly annoying in how it harms compile times.

adgjlsfhk1 · on July 18, 2021

yeah. It's Octavian https://raw.githubusercontent.com/JuliaLinearAlgebra/Octavia... It can beat MKL convincingly up to 100x100 and remains slightly ahead up to 1000x1000 (and performance is still improving).

_hl_ · on July 18, 2021

I think the point they were making is that LAPACK et al are a pain to use directly (they really are!), and abstractions are limited by the capabilities of the language or heavy to compile (not sure I agree, Eigen is pretty okay).

ncmncm · on July 19, 2021

Matrix multiplication and sorting are at polar opposite ends of the spectrum from, on one side, utterly independent of input (except gross size) to utterly dependent on details of input. Optimizations that help one case tend to hurt the other, but we unfairly expect our computers to be good at both. What we get is extremely complex implementations whose performance on any particular problem is unpredictable.

adrian_b · on July 18, 2021

The history of the CMOV instructions mentioned in the article is incorrect.

While ARM had such instructions from the beginning and other computers had them even earlier, in the Intel/AMD x86 instruction set the CMOV instructions (including FCMOV for floating-point) have been introduced in the Pentium Pro, in November 1995 and all later Intel models have them. AMD has added the CMOV/FCMOV instructions in Athlon, in June 1999.

The 64-bit extension of x86 has just inherited them from the 32-bit ISA and Itanium had them later than Pentium Pro.

The CMOV instructions were an important new feature of Pentium Pro, besides the out-of-order execution and triple instruction decoding and execution.

ncmncm · on July 19, 2021

Thank you, I will fix this.

reader_mode · on July 18, 2021

>Why doesn’t G++ generate cmov instructions? Older releases did, in fact, often enough to generate bug reports11. It turned out that, any time a subsequent loop iteration depends on the result, and the branch would have been predicted correctly, cmov may be slower than a branch, sometimes much slower. cmov can stall speculation all by itself.

>The latest designs from Intel and AMD are said to avoid the cmov pipeline stall, often, but Gcc has not caught up with them yet.

This is why you don't optimize first - even things that you think you know might not be true in the next HW iteration, compiler version, etc. This pattern shows up all the time. The only sane way to do optimization is to measure, write tests to catch regressions, otherwise always write for readability first.

pjmlp · on July 18, 2021

I wish I got a penny for each C and C++ developer that fails to follow that advice.

alex_smart · on July 18, 2021

>This is why you don't optimize first

The internals of an optimizing compiler might very well be the worst place to apply that principle. Who else is supposed to implement these optimizations, if not the compiler?

setr · on July 19, 2021

GP is talking to application developers, not compiler devs.

alex_smart · on July 19, 2021

I don't know how many application developers resort to manually generating the assembly, so not sure how that lesson can be gleaned out from the quoted passage.

setr · on July 20, 2021

the set of over-optimizing app developers will do just that :)

But more generally, optimizations eventually stop being true in general, and start being true only in a specific context -- and the lower that optimization, the more specific that context gets. If you want to "teach" GCC how to optimize something properly (more commonly in the form of attribute notations e.g. inlining, but more extreme is poking around godbolt), you start falling into this trap

ot · on July 18, 2021

> let us begin with a file of a hundred-million totally random integers [...] This is objectively the worst case, containing the greatest possible entropy; but also the best case, hiding no surprises.

There is no single worst case for sorting, and in fact most benchmarks use a several different inputs, and even the best algorithms win some and lose some. It's important not to overfit to a specific distribution.

For example, fully random integers can benefit MSB-radix-type sorts, because it's enough to look at a log(n) prefix of the integer to have the data almost sorted, so the later passes have very predictable behavior.

Also the chance of duplicates is negligible, and duplicates are problematic for some quicksort implementations.

ot · on July 18, 2021

For example, the radix sort implementation is cheating here:

    for (auto& v : buckets) v.reserve((end - begin)/128);

It's assuming uniform distribution of the buckets, so it can make a very accurate guess and avoid extra allocations. If that line is removed for the general case, I would expect things to get a lot slower. I would be curious to see a comparison with a good implementation of in-place radix sort. For example there is one available in boost.

dragontamer · on July 18, 2021

The radix sort isn't really the crux of the article. They're trying to just see how fast radix sort is ("unfairly"), and then try to improve a different comparison-sort instead.

ncmncm · on July 19, 2021

Exactly. Quicksort, also, isn't the crux of the article. Both are ways to sample how the CPU behaves. That is why the quicksort is also not production-grade. If it were more complicated, it would teach us less.

ncmncm · on July 19, 2021

If you are designing a sorting algorithm component for production, it is critical to take into account all the blips and wrinkles that real components will face.

But when you are investigating how and why your CPU has the performance characteristics it has evolved, all those complications directly interfere with learning. The goal here was not to make a production-grade sorting tool; it was to understand what affects performance, using the sorting problem as a microscope.

The method is generally useful. Some years back I spent months refining a one-page program[1] to generate a list of word puzzles. After the first day, the list of puzzles was of no interest, but refining the means to produce it faster taught me a great deal.

[1] https://github.com/ncm/nytm-spelling-bee

inciampati · on July 18, 2021

For sorting, not really. The big development in compute power is parallelism. ips4o facto (https://github.com/SaschaWitt/ips4o), if you want to sort large vectors really fast it makes more sense to sort in-place and in parallel. Parallel in-place radix sort is also choice, but way less flexible than comparison sort.

MrYellowP · on July 18, 2021

Depends on the necessity. If you really need more performance, then they absolutely matter.

In general it makes sense to program for the platform/CPU, not the compiler. Respect the CPU first, the compiler second. This turns into second nature just like everything else.

From my experience, 99% of the people out there disagree, doing so for a good reason. They code for the compiler, not the system. When that's good enough, then that it is.

the_only_law · on July 19, 2021

A couple of dumb questions:

1) In practice, how much of low level optimization is writing code differently, vs compiler tweaking? How much is done by offloading to other (specialized) hardware?

2) How could someone learn to do something like this. I’ve never had any situations where micro optimization is necessary or even useful. Never had to work with extremely large amounts of data, worry about tight timing of concerns, etc. and it seems like outside of an industrial setting it seems like getting access to the types of problems requiring this is the most difficult part. I mean, I suppose simply doing sorts on massive quantities of numbers is one thing, but it feels limited.

ncmncm · on July 19, 2021

All those things are done, mostly at different times by different people.

Optimizations are very slow to move into compilers, but once there improve potentially millions of programs, so there is a lot of value in it. But there is practically no value to be extracted from a better compiler, so the money to pay for that is very limited.

Any optimization that can be pushed into a compiler can be done "by hand" instead. All are done that way at first, and just a few move into compilers, eventually, after enough people get tired of doing them by hand, or seeing them not being done by hand when they ought to be.

The only way anyone learns to do it is to first need to do it. Most people don't need to, and so never do. But the highest-paying jobs are the ones that either make somebody a lot of money or save them a lot of money. The former have more upside; there is no upper limit to how much money could be made. The latter is strictly limited: you can't save anybody more money than the difference between what they are paying and what they pay you. But the latter is easier to get paid for, because people know what they are paying and hate it. Money you save them is money into their pocket over and over again.

Sorting is something people write about optimizing because everybody understands what it means. Few people depend on how fast ordinary sorting is (or, anyway, know they do), but the techniques you need to make sorting fast are the same as you need to make other things fast. Studying lots of optimizations makes you better able to reason about what you can do to optimize.

meisel · on July 18, 2021

It appears that sorting is done on the memory-mapped buffer? That seems like it could have extra costs (lazily paging in from disk) that we wouldn’t see for an in-memory buffer, and which could affect the benchmark.

jandrewrogers · on July 18, 2021

The MAP_POPULATE flag instructs mmap() to load the underlying pages into physical memory immediately. There should be no paging from disk during the benchmark run.

ncmncm · on July 19, 2021

Not only that: after the first run, all the pages are still in memory, for all the subsequent runs. But you are making a private copy each time; that copying shows up as "system" time in the output.

ncmncm · on July 19, 2021

Author here. I didn't know this was on HN until just now.

api · on July 18, 2021

Branches matter a lot in many cases. In some cases compilers are smart enough to optimize them out and in some cases they aren’t.

TchoBeer · on July 18, 2021

The headline is misleading; the article isn't really about whether or not low-level optimisations matter. With that being said, the idea that one of the big speedups that computers have gotten in the past two decades is better caching and branch prediction is interesting, and the idea that we use neural nets for it is known to me but still really cool.

atq2119 · on July 18, 2021

My understanding is that the use of neural nets in branch prediction is greatly overstated for marketing purposes. It's really more that different branch predictors are combined into a single one using a simple function (read: linear combination) that is "learned" (read: coefficients are updated sightly whenever the branch is encountered, based on the predicted vs actual branch direction).

dragontamer · on July 18, 2021

AMD Zen 1 had a neural net branch predictor for real.

But Zen2 / Zen3 don't seem to have this idea referenced anywhere. I can only imagine that AMD managed to beat their neural-net branch predictor with some boring methodology.

EDIT: https://fuse.wikichip.org/news/2458/a-look-at-the-amd-zen-2-...

Wikichip has done an article on it.

ncmncm · on July 19, 2021

They all (still) have it, but it is no longer worth mentioning. Zen 2 has another scheme running at the same time, and they compete. Zen 3 must have refinements, and more bits thrown at it.

ncmncm · on July 19, 2021

Neural nets are themselves greatly overstated for marketing purposes. Any time you have an array of coefficients that are adjusted according to input history and combined for a result, you might as well call it that.

einpoklum · on July 18, 2021

> Do Low-Level Optimizations Matter?

1. They matter a _lot_.

2. (Mentioned but not covered in the article) adaptation to known features of your input distribution also matters a lot.

ncmncm · on July 19, 2021

They matter to the degree a factor-of-two in performance (or -of-ten, or -of-100, for other optimizations) matters. On most problems, for most people, that matters not at all. But there are people who pay a lot of money for smaller gains. Certain people make a lot of money delivering gains.

Often, it didn't matter before, and now does. Facebook was coded in PHP, so they obviously didn't care about performance when they wrote it. But now Facebook is stuck with millions of lines of PHP, and has people going to crazy lengths to get it a few percent faster.

makapuf · on July 18, 2021

The article is indeed interesting underlining several cache miss aspects and performance impacts, but sorting 1e8 ints in 7s with standard lib vs 4s (with complex and subtle insights, and maybe sensitive to cpu arch) might not be worth it if you're dealing with a database, or with not that experienced coworkers, or deaking with a more complex computation. So does 2x matter? The article doesn't say, but it is interesting nonetheless.