A Billion Rows per Second: Metaprogramming Python for Big Data [video]

pjvds · on Sept 27, 2013

My takeaway is, that the reason that this is possible is that they care about data structure. A language can give you an order of magnitude performance, but - according to Ville - you can almost get infinitive improvement if you rethink the algorithms and data structure.

mathattack · on Sept 27, 2013

Structure of the database matters a lot. Going column-oriented [1] will improve performance dramatically before you scale up the processing power.

[1] http://en.wikipedia.org/wiki/Column-oriented_DBMS

iskander · on Sept 27, 2013

Cool use of Numba. Does anyone know if there's more information available about their query language and what kinds of Python expressions they dynamically generate?

lcampbell · on Sept 27, 2013

From my understanding of the video, the queries are generated by the analysts via a frontend-generated or hand-written SQL query. The SQL query is parsed by PostgreSQL which forwards it via FDW[1] to Multicorn[2]. Their custom data storage and processing backend implements the API expected by Multicorn (e.g., you implement the multicorn.ForeignDataWrapper interface); this is where they transform from the parsed, serialized SQL into their custom DSL (the metaprogramming bit) which compiles to LLVM.

--

[1] http://wiki.postgresql.org/wiki/Foreign_data_wrappers

[2] http://multicorn.org/

dialtone · on Sept 27, 2013

Yeah, that's pretty much how it works. The frontend is anything that supports PostgreSQL as a database. Right now we use Tableau but we also used to have a custom WebUI on top of this service, it was very functional-inspired.

Unfortunately Ville is on vacation right now, otherwise he'd be glad to dive more into the details of how that piece worked.

vtuulos · on Sept 28, 2013

Author here: Fortunately HN works even in the Finnish countryside. I am happy to answer any questions.

iskander · on Sept 30, 2013

Any interest in also trying Parakeet (https://github.com/iskandr/parakeet) for the backend? I'm curious to see how the performance would compare with Numba. I also have a semi-usable Builder API which constructs typed functions at a higher-level than llvmpy.

dev360 · on Sept 28, 2013

Very inspiring talk! Is it possible to deal with continuous data in those matrices or is it more oriented around discrete values?

vtuulos · on Sept 28, 2013

Thanks! Our approach supports both discrete and continuous values. It is mainly optimized for the use case where you want to aggregate continuous variables over discrete filters.

iskander · on Sept 27, 2013

Why did you guys choose to compile through Numba rather than directly to LLVM or C?

vtuulos · on Sept 28, 2013

Numba was just the fastest way to get it working. LLVM(py) is very low-level. C is still an option but Numba made interfacing with a no-brainer.

vtuulos · on Sept 28, 2013

Sorry, not yet. I hope to be able to make more information available soon.

The query language is very straightforward. More interestingly, this approach makes it easy to implement various algorithms for machine learning / data mining, at least compared to MapReduce.

rtkwe · on Sept 27, 2013

So in essence they do tons of pre-processing on their data. I wonder how long the pre-processing takes compared to the amount of speed gains it produced for them.

dialtone · on Sept 27, 2013

Doesn't take so long actually, takes about 1 hour per day of data and every day we process about 10TB of uncompressed log files. The result of this can be stored and reused as many times as you need.

aborochoff · on Sept 28, 2013

Interesting, you're processing ~500GB (10TB / 24 hour) of uncompressed loglines in 1 hour? Is the set up the same as the presentation?

lmm · on Sept 27, 2013

How does this compare to something like spark/shark?