My takeaway is, that the reason that this is possible is that they care about data structure. A language can give you an order of magnitude performance, but - according to Ville - you can almost get infinitive improvement if you rethink the algorithms and data structure.
Cool use of Numba. Does anyone know if there's more information available about their query language and what kinds of Python expressions they dynamically generate?
From my understanding of the video, the queries are generated by the analysts via a frontend-generated or hand-written SQL query. The SQL query is parsed by PostgreSQL which forwards it via FDW[1] to Multicorn[2]. Their custom data storage and processing backend implements the API expected by Multicorn (e.g., you implement the multicorn.ForeignDataWrapper interface); this is where they transform from the parsed, serialized SQL into their custom DSL (the metaprogramming bit) which compiles to LLVM.
Yeah, that's pretty much how it works. The frontend is anything that supports PostgreSQL as a database. Right now we use Tableau but we also used to have a custom WebUI on top of this service, it was very functional-inspired.
Unfortunately Ville is on vacation right now, otherwise he'd be glad to dive more into the details of how that piece worked.
Any interest in also trying Parakeet (https://github.com/iskandr/parakeet) for the backend? I'm curious to see how the performance would compare with Numba. I also have a semi-usable Builder API which constructs typed functions at a higher-level than llvmpy.
Thanks! Our approach supports both discrete and continuous values. It is mainly optimized for the use case where you want to aggregate continuous variables over discrete filters.
Sorry, not yet. I hope to be able to make more information available soon.
The query language is very straightforward. More interestingly, this approach makes it easy to implement various algorithms for machine learning / data mining, at least compared to MapReduce.
So in essence they do tons of pre-processing on their data. I wonder how long the pre-processing takes compared to the amount of speed gains it produced for them.
Doesn't take so long actually, takes about 1 hour per day of data and every day we process about 10TB of uncompressed log files. The result of this can be stored and reused as many times as you need.