ok, owner of the repo here. So this project was purely to show the macro differences between interpreted ruby and compiled crystal to beginner ruby devs at a meet up.
I decided to add the top languages on github to help give some idea of how crystal performed against them.
I'm happy to see so much discussion about it and was taken by surprise when my inbox was full this morning. Thanks @anonfunction. ;-)
The breaking benchmark examples were just that, examples of how to break the benchmark. I didn't expect to get a memoized version of every language and really don't think comparing them from a performance benchmark makes much sense. Let me know if i'm wrong about that.
I am fine adding all of your change requests and will try to keep the benchmarks up to date.
Breaking benchmarks... Is that like the time I fiddled with the corresponding C program trying to get it to run at least in the same ballpark as the Rust version? That was until I determined that the Rust optimizer noted that the results of the 'meat' of the test were never used so it just optimized the whole thing away. :) (I thought it was a little disingenuous for the Rust folks to use this as an example of how performant Rust was.)
I used to benchmark using fibonacci, but the recursive method is just awful because it does a lot of function calls and you're essentially benchmarking that. Then I switched to finding primes using the Sieve of Sundaram. It uses arrays, hashes/maps/dicts and two loops. Also wastes a lot of memory if you don't split your search domain into ranges. The surprise was that Go and D (in that order) turned out to be faster than Rust mainly due to Rust's HashMap SipHash algorithm. I gave up trying to use other hash libraries (SeaHash specifically) that are not part of the standard lib, because it was quite frustrating compared to D.
So... before looking at the code, I compared the elapsed time between the fib.rs program compiled with `rustc -O` and `rustc -O -C lto`. This yielded, interestingly, a ~500ms difference: ~6.45s vs ~6.95s (with LTO winning).
Then I looked at the code, and it began to make no sense. Then I looked at the assembly, and it made even less sense: the main function, fib::fib, which is where the vast majority of the time is spent, is identical, except for addresses.
I was starting to think, well, this might be related to instruction-cache lines... and it looks like it's not that:
$ perf stat ./fib-lto
2971215073
Performance counter stats for './fib':
6474.410894 task-clock (msec) # 1.000 CPUs utilized
17 context-switches # 0.003 K/sec
0 cpu-migrations # 0.000 K/sec
112 page-faults # 0.017 K/sec
22,607,598,238 cycles # 3.492 GHz
55,325,512,417 instructions # 2.45 insn per cycle
13,021,149,180 branches # 2011.171 M/sec
76,737,179 branch-misses # 0.59% of all branches
6.474991613 seconds time elapsed
6.474816000 seconds user
0.000000000 seconds sys
$ perf stat ./fib
2971215073
Performance counter stats for './fib':
6956.534790 task-clock (msec) # 1.000 CPUs utilized
11 context-switches # 0.002 K/sec
0 cpu-migrations # 0.000 K/sec
113 page-faults # 0.016 K/sec
24,290,924,647 cycles # 3.492 GHz
55,325,864,841 instructions # 2.28 insn per cycle
13,021,213,404 branches # 1871.796 M/sec
84,392,454 branch-misses # 0.65% of all branches
6.957260487 seconds time elapsed
6.956974000 seconds user
0.000000000 seconds sys
I didn't know an address difference of 0x30 could influence the number of branch misses so much.
Interestingly, the C and C++ versions have different performance for the same reason: they produce the exact same machine code, at different addresses. Thus one runs faster than the other. Edit: actually, there isn't a difference in branch-misses for those... only in cycles, which is even more intriguing.
Another interesting fact: valgrind's branch simulator doesn't show a difference in branch mispredictions between both (it also shows a misprediction rate much larger than reality, it's likely based on an old model Edit: valgrind manual says: Cachegrind simulates branch predictors intended to be typical of mainstream desktop/server processors of around 2004.).
More edit: I hacked a linker script to place the fib function from the C++ implementation at a fixed address, and tried different addresses, with interesting results:
0x10000: 5.117084095 seconds time elapsed, 17,866,364,962 cycles
0x10010: 5.206242916 seconds time elapsed, 18,090,712,907 cycles
0x10020: 6.096635146 seconds time elapsed, 21,285,484,583 cycles
0x10030: 4.955020420 seconds time elapsed, 17,297,162,836 cycles
0x10040: 5.146043048 seconds time elapsed, 17,954,722,919 cycles
0x10050: 5.252477508 seconds time elapsed, 18,335,804,193 cycles
0x10060: 6.100806292 seconds time elapsed, 21,300,089,284 cycles
0x10070: 4.936397948 seconds time elapsed, 17,216,051,020 cycles
Even more edit: I vaguely remember there was someone doing some analysis (with performance counters) of some similar performance difference depending where the function was located, that was on HN a few months ago, but I can't find it anymore.
Essentially, some of the loop heuristics work on a 32-byte window, not a 16-byte window. An interesting statistic would be to find where all the loops are in the code as relative offsets, and find how these shift based on starting alignment.
Edit: It's interesting to note that there's only one recursive call to the function, not two (you get two calls if you compile with -O1). It's also interesting to note that -Os generates an even smaller version with only one recursion, but that ends up slower (~6.5s).
The compiler has determined that the second call `fib(n - 2)` is "almost" tail-recursive (there is probably a better term for this), and so was able to convert that call into the loop you see at 10081 to 10095. This loop repeatedly calls the `fib(n - 1)` part.
In other words, the function was transformed from:
uint64_t fib2(uint64_t n) {
if (n <= 1) return 1;
uint64_t sum = 1;
for (; n > 1; n -= 2) {
sum += fib2(n - 1);
}
return sum;
}
Unfortunately, the result isn't too great as the initial part of the function is pretty heavy, and this bottlenecks computation. It would be better to inline some copies of fib into itself which might even allow some small amount of CSE, but importantly would reduce the number of calls.
The alignment you show is "special" (fast) because it's the point at which the top of the main loop (10081) is near the start of a 32-byte region, which allows multiple uops to be efficiently dispatched after jumping there.
It looks like gcc is going backwards on this one (at least in performance, the gcc-8 code is smaller): gcc-8, which generates the compact code you show, is the slowest of all the versions I tested (at -O3). Here's what I got:
gcc 4.8: 4.1 s
gcc 4.9: 4.3 s
gcc 5.5: 4.3 s
gcc 7.3: 4.8 s
gcc 8.1: 6.6 s
The difference between the compilers isn't just alignment: gcc-8 is generating very different code than the others, which are generally generating a large fib function with several calls of fib inlined into it, so there are many fewer calls (around 300k calls for gcc 4.8 through 7, but around 3 million for gcc 8).
Many versions of GCC do, including the version of 5.5 I tested above, but the mitigation optional (enabled by specific command line arguments) and off by default, so they don't impact this test.
Sorry, but this is not how you do numeric benchmarks. Most real programs that deal with numbers also manipulate data structures and call many standard library methods. If those things are slow, your app is slow. This benchmark doesn't demonstrate anything that would translate into real-life performance.
It's definitely not a realistic measure of overall language performance, but I think it certainly has value if taken with the right caveats. It's probably a reasonably good measure of function call overhead in different languages, and I think it's good for getting a better intuition about the order of magnitude difference to expect from, say, C vs JS (~3x slower than C in simple cases) vs Python (~100x slower than C).
> It's probably a reasonably good measure of function call overhead in different languages
It's actually not. Fibonacci has a near-tail call in it that can convert some of the recursion into a loop. Furthermore, the runtime is dominated by useless recomputation that can be handled by memoization. So the distinction between languages is going to be dominated by their ability to do some moderately complex optimizations rather than by any intrinsic performance characteristics of their implementation.
Moreover, because the code is so small, there is likely to be major side effects as a result of effectively random differences--consider the effects of the code placement that cause a spurious difference between C and C++ kernels of about 20%. The JS version is going to get hit with a deoptimization at the very end (it might not impact performance) because the final result does not fit in an int32_t, and suddenly fib is no longer a well-typed function.
Microbenchmarking is _hard_, and it is all too easy for the microbenchmark to cease measuring the things that you want to measure.
But each has to return to the original caller because a summation still has to happen. The actual tail call is to the +/2 function.
F(N) can't complete until F(N-1) and F(N-2) have completed so it can sum up the values. If you pass the earlier computation, F(N-1), down to F(N-2) as a second parameter, you could get a tail call on the last part.
fib(0, Val) -> Val + 1;
fib(1, Val) -> Val + 1;
fib(N, Val) ->
T = fib(N-1,Val),
fib(N-2,T).
(I think I wrote that correctly, can't test here.)
But that's a non-obvious tranformation. I really would like to see the compiler that recognized that this was a valid (computational) equivalent to the original.
I was intrigued to learn more about the winning language Nim to see how it beats C/C++, so I did a web search and was confounded to see Nim compiles to C/C++! What gives? Starting to doubt the methodology somewhat, though it was an interesting read.
Languages without tail recursion will perform worse. Memoization is borderline cheating, because it's a different implementation.
Fib is the poster boy for tail recursion but the reason for that is that recursion to implement fib is simply a bad choice. It's cute but that's about it. If the point is to measure function call overhead, then measure _that_?
But recursive Fibonacci isn't tail recursive. The final function call is to `+` (addition), which means that the two recursive calls must each be put on the stack and later returned so the sum can be computed. Tail recursion requires that there is no final operation other than exactly a single recursive call.
It can be a gateway for learning interesting things about the mechanics of a language, its compilation, and in turn how to make similar cases in other languages faster/cleaner/more-secure. e.g. why is nim so fast in this case? Is it tail call optimisation, not doing overflow checks, static inlining, or something else? Nim compiles to c to it is especially odd.
Why would languages without tail recursion optimisation perform worse when all languages use the naive implementation? It's not tail recursive, so it shouldn't make a difference, right?
The benchmark includes a C++ constexpr version, at https://github.com/drujensen/fib/blob/master/fib-constexpr.c... , with timing numbers under the section "Optimized code that breaks the benchmark" and the comment "all benchmarks will have some caveat."
Somewhat off topic, but I love the fact that the first post regarding compile-time Fibonacci in C++ was submitted in 2000, another user expanded on the original in 2008, and now it's being usefully referenced in 2018.
I stumbled across Everything2 recently, but in many ways it strikes me as a tiny bit of the golden age of the internet, unexpectedly preserved.
It would be interesting to include skip[1] as it has language level memoization. There was a recent HN discussion about the language [2]. I do not think that the naive recursive Fibonacci is a useful benchmark in any way though, it's a way too pessimized implementation.
> It's available in Haskell as a first class citizen.
Not really. Haskell's laziness makes it easier to write functions that are memoized, but it does not automagically memoize functions. And how could it without incurring a non-trivial runtime space/time cost?
Haskell's purity is what makes it easier to write libraries that facilitate building memoized versions of functions in a transparent way (and that are obviously correct). For instance, I usually reach for [data-memocombinators][0], which happens to have a fibonacci example at the top of the docs:
import qualified Data.MemoCombinators as Memo
fib = Memo.integral fib'
where
fib' 0 = 0
fib' 1 = 1
fib' x = fib (x-1) + fib (x-2)
[0]: http://hackage.haskell.org/package/data-memocombinators-0.5.1/docs/Data-MemoCombinators.html
But is there any reason the Haskell compiler couldn't decide to memoize a function call itself if it determined that it would speed things up? To me that sounds hard for the compiler to do heuristically but a Haskell compiler should still have more scope to do optimizations like that than a C compiler has.
> I do not think that the naive recursive Fibonacci is a useful benchmark in any way though, it's a way too pessimized implementation.
probably, but I would still like to see more benchmarks that use recursion. It's a fundamental technique from functional programming and so few benchmarks bother with it at all.
Why do we need language level memorization again? In python memoizing a function is just one line `@lru_cache()` I can't see why this needs an improvement by adding yet another thing to the core language.
Just adding a @lru_cache() to a python isn't guaranteed to be correct; for example, if you're reading from an API the response could change between calls.
Skip tracks side effects, and will either (a) memoize automatically a pure function or (b) recognize that a function is impure and avoid the memoization.
Some time ago, facebook introduced a javascript compiler which pre-compiled intermediate values. I think the point was to extend the spirit to the backend apps. If memoization is first class, implementing pre-compilation should be a blast
During the my last technical interview, they asked my to write a fibonacci implementation in golang, i wrote the code, then they asked me to test it for like fb(180), i was surprised how slow it, then i was asked to come up with an optimisation in order to make it fast,
i come exactly with same fib-mem.go provided in this benchmark,
i was hired !
I guess if the point is to optimize fibonacci then wouldn't the smartest move be to use the closed form solution where Fib(n) is the closest integer to phi^n / sqrt(5), where phi = (1 + sqrt(5))/2?
This is one of the examples in the beginning of SICP, if I remember correctly.
The identity involves going through real numbers to get the integer result, and if you're using floating point or fixed point calculations, you need a good amount of accuracy to get the correct integer out, which also won't be the fastest thing. Using https://en.wikipedia.org/wiki/Fibonacci_number#Matrix_form you can get a different way to calculate Fibonacci numbers recursively, which involves far fewer computations.
Fundamentally, the thing to do is to understand the N-th Fibonacci number as a + b, where φ^N = a + b * φ, where a and b are natural numbers and φ^2 = φ + 1.
It's not important to understand phi as a floating point value; we can just work in the abstract arithmetic where we've added to the natural numbers a new entity phi defined to satisfy φ^2 = φ + 1 (just like complex numbers are the abstract arithmetic where we've added to ordinary arithmetic an i defined to satisfy i^2 = -1).
Call these Fomplex numbers; a Fomplex number a + b * φ amounts to just a pair of natural numbers, and it's easy enough to add and multiply them with natural number arithmetic. To calculate the N-th Fibonacci number, just calculate φ^N and add together its coefficients, as noted. As for how to efficiently calculate φ^N, use the usual addition chain approach to exponentiation (e.g., "repeated squaring"), thus getting a result in Θ(log N) many additions and multiplications.
This is incidentally the same as the matrix approach, essentially, but perhaps a cleaner perspective on it; at any rate, it is a way of thinking which will serve as a useful tool in your back pocket for other general problems about linear recurrences.
How neat! Among its properties not the least interesting is that it has finally convinced me that the initial condition f(0) = f(1) = 1 is not arbitrary but perfectly natural.
Ah, but actually that part is still a little arbitrary. This technique is fully general; variants describe ANY linear recurrence with ANY initial values. In particular, the Nth value of the Fibonacci-type sequence with 0th value x and 1th value y is xa + yb where a + bϕ = ϕ^N. I just happen to have chosen the weights x and y to both be 1 in that discussion.
Or you write an iterative version of it that will not only be the simplest solution, it will be fast enough to compute fib(10000). There are constant time ones IIRC, but if someone comes up with such a solution in an interview without knowing it from before, they are probably over qualified for any job that uses the Fibonacci sequence as an interview question.
Any matrix implementation (with fast exponentiation) is going to be much faster, even in a "slow" language. For instance, in ruby, `Matrix[[1,1],[1,0]] * * 46` is instantaneous (real time: 0.000131s).
We have to go up to the ten million'th term to get something that is not instantaneous (0.486s) for a result which has 2089876 digits.
Are you sure `Matrix[[1,1],[1,0]] * * 46` isn't optimized into a constant value by the bytecode compiler? Idk Ruby much, but it seems like a possiblity.
Nice to see the different implementations side-by-side. I do believe however that this is mainly a benchmark of startup time and initial JIT of the CRT/runtime/virtual machine/execution environment.
If you remove that time from the equation, I expect that the execution time will not differ a lot from each other.
Of all the benchmarks out there this is actually reasonable (though obviously limited). It runs the same thing in different languages and the processing time is long enough for VM startup not to matter.
Also the performance shows the bytecode languages C# and Java to do fairly well compared to C and GO.
That has mostly to do with the fact that the code isn't manipulating objects at all. As soon as you're starting pointer chasing in languages where almost everything is an object like javascript performance will suffer.
I thought so as well, and benchmarked the Julia version inside the REPL (after calling the function once to make sure JIT compilation was done), and the difference was insignificant (maybe 1 to 5 percent - basically same order of magnitude as the measurement error, ie usual run time fluctuations). So Julia runtime was still a bit less than 2x the C/C++ runtime (but faster than C/C++ without the -O3 switch).
This shouldn't be surprising though. The point of Julia is that compilation time stays relatively constant while runtimes can grow enormous very easily. Normally with microbenchmarks you run multiple times to get rid of compilation time because microbenchmarks run in <1 second so compile time matters. But this show that, in any case where the user does begin to care about speed, that compilation time is really minimal. The only place where it truly matters in practice is in the REPL: it can cause a bit of lag which can get annoying but it's a tradeoff.
rjmacmini:~$ sbcl
This is SBCL 1.4.2, an implementation of ANSI Common Lisp.
More information about SBCL is available at <http://www.sbcl.org/>.
SBCL is free software, provided as is, with absolutely no warranty.
It is mostly in the public domain; some portions are provided under
BSD-style licenses. See the CREDITS and COPYING files in the
distribution for more information.
* (defun fib (n)
(declare (fixnum n)
(optimize (speed 3) (debug 0) (safety 0)))
(if (<= n 1)
1
(the fixnum
(+ (fib (- n 1))
(fib (- n 2))))))
FIB
* (time (fib 46))
Evaluation took:
14.957 seconds of real time
14.947616 seconds of total run time (14.934818 user, 0.012798 system)
99.94% CPU
38,799,794,836 processor cycles
0 bytes consed
2971215073
* (SAVE-LISP-AND-DIE "/tmp/fiblisp" :toplevel (lambda (&rest args) (print (fib 46))) :executable t)
; in: SAVE-LISP-AND-DIE "/tmp/fiblisp"
; (LAMBDA (&REST ARGS) (PRINT (FIB 46)))
;
; caught STYLE-WARNING:
; The variable ARGS is defined but never used.
;
; compilation unit finished
; caught 1 STYLE-WARNING condition
[undoing binding stack and other enclosing state... done]
[defragmenting immobile space... 643+15263+734+344+25029+16756 objects... done]
[saving current Lisp image into /tmp/fiblisp:
writing 0 bytes from the read-only space at 0x20000000
writing 848 bytes from the static space at 0x20100000
writing 1863680 bytes from the immobile space at 0x20300000
writing 11472480 bytes from the immobile space at 0x21b00000
writing 26542080 bytes from the dynamic space at 0x1000000000
done]
rjmacmini:~$ time /tmp/fiblisp
2971215073
real 0m14.785s
user 0m14.755s
sys 0m0.020s
rjmacmini:~$
I don't think adding debug/safety 0 are really worth it. In this:
(declaim (optimize speed)
(ftype (function (fixnum) fixnum) fib))
(defun fib (n)
(if (<= n 1)
1
(+ (fib (- n 1))
(fib (- n 2)))))
(print (fib 46))
adding `(safety 0) (debug 0)` took the time from 13.17s (with just the `speed` declaration) down to 12.94s for me. Is a 2% speed increase really worth the danger of `(safety 0)`?
EDIT: After some disassembly and experimenting, it seems like what we see here is some clever unrolling and memoisation. If I change 46 from a constant to a variable, it becomes twice as slow. Plus there is this in disasm:
Microbenchmarks like this can be difficult to perform in practice, as gccgo can perform optimizations on pure functions that prevent Go's benchmarks from actually fully repeating a test. Also, this is a special case in which gccgo shines, specifically because it is completely cpu-bound. Generally, there isn't nearly as much of a performance difference.
~$ sbcl
This is SBCL 1.4.11, an implementation of ANSI Common Lisp.
More information about SBCL is available at <http://www.sbcl.org/>.
SBCL is free software, provided as is, with absolutely no warranty.
It is mostly in the public domain; some portions are provided under
BSD-style licenses. See the CREDITS and COPYING files in the
distribution for more information.
* (defun fibonacci-tail-recursive ( n &optional (a 1) (b 1))
(declare (optimize (speed 3) (safety 0) (debug 0))
(type fixnum n a b))
(if (< n 1)
a
(fibonacci-tail-recursive (- n 1) b (+ a b))))
FIBONACCI-TAIL-RECURSIVE
* (time (fibonacci-tail-recursive 46))
Evaluation took:
0.000 seconds of real time
0.000001 seconds of total run time (0.000001 user, 0.000000 system)
100.00% CPU
2,513 processor cycles
0 bytes consed
2971215073
*
Because Fortran was listed as x.xxx, I figured I'd try it.
Taking the minimum of 100 runs of gcc/gfortran, 200 runs of g++, and 4 runs of Julia 1.1-dev:
gcc: 3.4866
g++: 3.4428
gfortran: 3.4953
Julia: 8.0378
I was doing other things while the benchmarks ran, which added noise. For the first run of g++, the mean was slightly higher than C's mean, while the standard deviation was much higher than C or Fortran's. So I reran it for C++, and then decided to just report the minimums for everything instead, because the minimum is probably the least biased measure.
So long as there are no allocations that occasionally get cleaned up by a garbage collector. If there were, the GC cost should get amortized over all the runs that contributed to it. We don't have to worry about this here.
While the C and C++ assembly for the fib function was the same, Fortran's was different.
Julia's assembly was extremely brief in comparison, because it doesn't have any recursion optimizations -- it is just the check and then two calls to itself.
Yes, Julia doesn't do tail call optimizations (TCO) which it could do with LLVM. Just not enabled yet. I find it funny when people try to say that Julia's website benchmarks are cherrypicked to look good when the first example shows that it's tracking an optimization which it's missing...
One can improve the JAVA result by allowing the JIT compiler to inline the recursive calls.
-XX:MaxRecursiveInlineLevel=1 (default): 6.667 s
-XX:MaxRecursiveInlineLevel=2: 6.141 s
-XX:MaxRecursiveInlineLevel=3: 5.768 s
-XX:MaxRecursiveInlineLevel=4: 5.400 s
-XX:MaxRecursiveInlineLevel=5: 5.361 s
-XX:MaxRecursiveInlineLevel=6: 5.072 s
What’s the point of even mentioning the memoized versions? OF COURSE the naive recursive version is slow and it’s very easy to write something much faster in any language whatsoever.
I would submit an entry for my own project, Snabl [0]; but it intentionally doesn't support the algorithm since naive recursion beyond a few levels doesn't make much sense in practice. I added a separate keyword to force tail calls since that's usually what you want.
- I thought that in Julia, startup and compile time could be a factor, but they're pretty negligible (maybe 1 to 5% or so of the runtime, for Julia 0.7).
- Without the -O3 switch, the runtime more or less doubles for C and C++, making these languages slower than Swift, Go, Java, Dart, Julia, etc. That surprised me.
Yeah, startup and compilation time in Julia is not a big deal. One could even make the compiler work a bit harder, but you would have to ask a lawyer if this still counts as recursion ;-)
julia> function fib(::Val{n}) where n
if n <= 1 return 1 end
return fib(Val(n - 1)) + fib(Val(n - 2))
end
julia> fib(n) = fib(Val(n))
julia> @time fib(46) # Compilation and execution
0.095409 seconds
As for me, these benchmarks are strange because there are solutions that are not omptimized -- in node.js implementation the memoized approach has been used (which outperforms the recursion), but in Java solution (as an example) -- the recursive approach has been implemented.
I think the intent is that the main implementations are those in files like `[Ff]ib.*`. There are additional memoized implementations for some languages, but presumably these are only used in the "Optimized code that breaks the benchmark" section.
Agreed. Everybody uses TCO in Elixir. It's the standard way to write servers and store state. So it makes sense to use it the benchmark or it will be very un-Elixir.
There's a PR there to do just that. I don't follow the reasoning that it's not been accepted. It's also using Elixir script instead of precompiled Elixir.
The algorithm being benchmarked is the naive double-recursive fibonacci algorithm, which ends up performing over a billion addition operations in this case. The tail-recursive function above is the iterative bottom-up fibonacci algorithm that does it in about 50 addition operations. It's true that both use recursion, but the second is definitely not the tail-recursive version of the first.
This reduces the number of total function calls but not the number of additions (since it's not changing the algorithm, just the way the function invocation strategy when running the algorithm).
Something isn't quite right. Nim 'compiles' into C code ready for compilation. It should not be faster than the raw C and C++ implementation. Frankly even the C/C++ implementation shouldn't be so different.
I think Rust also does not have tail call optimisation. Nim and Crystal I'd heard why relying on underlying compilers, and could sometimes fail to get it (perhaps more than you'd expect using a compiler more "directly"?).
-- Memorized variant is near instant even after 10000
memoized_fib :: Int -> Integer
memoized_fib = (map fib [0 ..] !!)
where fib 0 = 0
fib 1 = 1
fib n = memoized_fib (n-2) + memoized_fib (n-1)
or
fibM = \n -> values !! n
where values = [fibAux m | m <- [0..]]
fibAux n | n <= 1 = n
| otherwise = fibM (n-2) + fibM (n-1)
or
fibM2 :: Int -> Integer
fibM2 = \n -> values !! n
where values = [fibAux m | m <- [0..]]
fibAux 0 = 0
fibAux 1 = 1
fibAux n = fibM2 (n-2) + fibM2 (n-1)
Run the following at the ghci Haskell prompt:
memoized_fib 47
fibM 47
fibM2 47
If you want to wait try:
-- Traditional implementation of fibonacci, hangs after about 30
slow_fib :: Int -> Integer
slow_fib 0 = 0
slow_fib 1 = 1
slow_fib n = slow_fib (n-2) + slow_fib (n-1)
I'd be interested to see how Elixir does with an OTP-optimised redo. Probably still not great, but I feel like the example isn't playing to its strengths.
OTP allows for work to be distributed across cores and get a speedup in some way proportional to that. It wouldn't be a fair comparison though, exactly like memoisation or constexpr.
It's also an option that isn't unique to Elixir. Go is the obvious other candidate where the programming language helps, but all of the languages have support for parallelism.
Otp won't help. The benchmark itself is too synthetic to really matter as an analog for real world use cases where you would deploy elixir (or go or swift for that matter if we're being honest). Elixir cares about developer ease, high up time, lots of connections (network I/o). Do you need these things? Then you shouldn't care about the benchmark.
That's obviously a better choice if you actually want Fibonacci numbers, but it defeats the point of a benchmark if you implement a completely different algorithm in one of the languages.
What is OTP? Fibonacci optimised by memorising already calculated values is linear time and constant time if you use the analytical formular for it. But then it stops being an interesting micro benchmark (as is pointed out on the website).
Since we’re talking about recursion, it would be neat to add benchmarks for languages that encourage recursion over iteration (not top 10, but could be fun):
You can't really compute it in constant time since the number of bits in the nth Fibonacci number is O(n), so you need to take at least that long just to write the result out.
Computing with Binet's formula is also rather tricky. You just need to round φ^n/√5, but how many bits of √5 do you need to use?
This is bullshit.
IMHO, the recursive pattern is one of the worst constructs, this is really bad engineering and should not be used for benchmarking...
So you're saying never use recursion? Or never benchmark it? I don't get it, it's a tool; a pretty good tool when combined with tail call optimization. Yes, the algorithm they use is naive; I believe that's part of the point.
Where did you hear that? 3 hasn’t even been competitive vs 2 until very recently.
Reasons for the switch have always been better Unicode handling and not being left behind as the community matches on (and soon lack of security updates to 2).
I'm just going to put this out there: this benchmark is silly, and only good for making languages with specific types of calling styles look good. It doesn't predict any real world performance. It doesn't speak to real world optimizer outcomes. It really just measures how shallow your call abstraction is. Several languages here perform much worse than equivalent code because the way the language interprets and executes naive recursion is different.
I've seen this game before, so the very first place I looked was the Haskell version. Sure enough, it doesn't even try to force the call graph at all. There's even a section entitled, "breaks the benchmark" and it doesn't note Haskell or other languages with different call semantics are not doing the same thing at all. It just says Haskell doesn't terminate.
Confusingly, there is then a "mem" benchmark which seems to ask "What happens if we do this with even the vaguest damn about algorithmic complexity?" But these are so all-over-the-map they don't even come close to measuring the same thing and have different memory usage.
What's frustrating about this is that the double-uncached is always the wrong way to write this code. It's never good, it really doesn't benchmark anything real world. It's not even a very good compiler benchmark because many optimizers actually go under the hood and rewrite code that is in this obvious style to be something else entirely, just to do better on these benchmarks.
For the Haskell benchmark as an example, the best way to to write this lazily is to write a function that consumes and drops a list like so:
fibF n = head $ drop n fibs
where
fibs = 0 : 1 : next fibs
next (first : rest) = (first + head rest) : next rest
-- Or using a library function most Haskell Fp devs know.
fibZip :: Int -> Int
fibZip n = head $ drop n fibz
where fibz = 0 : 1 : zipWith (+) fibz (tail fibz)
This is fast (it gets fused down to essentially a for loop in Haskell, and others like Javascript & Clojure can use this technique for a memory-efficient approach) it's very straightforward, and it's also completely outside the world of something you can write naturally in C or Java (there is a natural Golang expression, but I don't see people using it).
This approach isn't even particularly fair to other languages that are good at producing performant machine code, like Rust.
These games are not really indicative of anything. They waste energy. Folks should understand what their language runtimes and compilers are capable of, rather than asking, "How well does this fare on the worst possible algorithm for an operation who's semantics are only loosely defined?"
I couldn't find a list of the hardware used in the benchmarks, so comparing is difficult, though before testing, I'd lean towards agreeing with you. Luajit is often on par with C, D or Go.
However, as C was one of the faster, I'll use it as a comparison.
fib.c, compiled with 03: 10.49user 0.28system 0:11.62elapsed
#include <stdio.h>
long fib(long n) {
if (n <= 1) return 1;
return fib(n - 1) + fib(n - 2);
}
int main(void) {
printf("%li\n", fib(46));
return 0;
}
Lua:
function fib(n)
if n <= 1 then
return 1
else
return fib(n - 1) + fib(n - 2)
end
end
print(fib(46))
Luajit: 48.66user 0.11system 0:52.95elapsed
Lua 5.3: 717.21user 8.40system 14:19.47elapsed
As Luajit was so much slower than C for this, which can be somewhat surprising.
Luajit would probably beat Ruby, for it's interpreted crown, but without optimisation, it won't beat the big boys.
I would say, that Lua isn't a good fit for solving this kind of problem, with these constraints, because every function call requires a hash lookup, which is irritating.
Of course, you could use Luajit's FFI to use C's implementation, which would be somewhat faster. Or expose the C implementation as a Lua library.
However, Lua is probably also a really good fit for memoization, and other techniques like that.
local nums = {}
local fib
fib = function(n)
if n <= 1 then
return 1
else
if nums[n] then
return nums[n]
else
nums[n] = fib(n - 1) + fib(n - 2)
return nums[n]
end
end
end
print(fib(46))
This is a fairly naive implementation, but has the same final result as the previous examples... And 'time' is unable to measure how fast it is (Both Luajit and Lua5.3). For all intents and purposes, it's instant.
Using generators in Python you can do this much much faster :D.
#!/usr/bin/env python
"""
Calculate fibonacci numbers using a generator
Usage: fibonacci.py [options] <N>
Arguments:
<N> Print all Fibonacci numbers up to <N>-th
Options:
-h, --help This help
"""
from docopt import docopt
def fibonacci():
"""generate fibonacci numbers"""
a, b = 0, 1
while 1:
yield a
a, b = b, a + b
if __name__ == '__main__':
try:
args = docopt(__doc__)
fib = fibonacci()
for i in range(int(args["<N>"])):
print fib.next()
except ValueError:
print "You must specify valid integer"
except KeyboardInterrupt:
print "Good-bye"
I'm perfectly aware it's meant to be recursive, but it's also a completely pointless test. It's not tail recursive, so you are measuring function call overhead in various languages. But various languages have options to perform much faster anyway, but solution is of course going to be language specific.
I decided to add the top languages on github to help give some idea of how crystal performed against them.
I'm happy to see so much discussion about it and was taken by surprise when my inbox was full this morning. Thanks @anonfunction. ;-)
The breaking benchmark examples were just that, examples of how to break the benchmark. I didn't expect to get a memoized version of every language and really don't think comparing them from a performance benchmark makes much sense. Let me know if i'm wrong about that.
I am fine adding all of your change requests and will try to keep the benchmarks up to date.