I am certain I'm in the wrong here but I'm struggling to understand Arrow's USP. I (originally) assumed it meant python/R users would be able to get around memory limitations when model-fitting but all the examples I've come across are just data manipulation and none of the main modeling packages support it. Those who are using it, what am I missing?
Arrow eliminates ser/der, and if all actors in the workflow use the format you could see drastic performance improvements for a wide variety of workloads. I've seen ser/der for multi-GB+ processes take up half of the total clock time of the task.
Arrow is a language-independent memory layout. It’s designed so that you could stream memory from (for example) a Rust data source to a spark/DataFusion/Python/whatever else/etc with faster throughout and support for zero-copy reads, and no serialisation/deserialisation overhead. Having the same memory model ensures better type and layout consistency as well, and means that query engines can get on with optimising and running queries rather than also having to worry about IO optimisations as well.
I’m using DataFusion (via Rust) and it’s pretty fantastic. Would love to swap out some Spark stuff for DataFusion/Ballista stuff at some point as well.
As a data data scientist, I also found this pretty confusing so I spent some time trying to understanding it better. I wrote it up as a blog post:
Demystifying Apache Arrow
https://www.robinlinacre.com/demystifying_arrow/
My use case is that since Arrow keeps all data types and dumps the in memory table to disk, this allows me to backup my work and later reload the data and keep going. Loading and writing the data to disk is very fast It’s much better than using hdf5 for me in that regard.
And in most cases, if you memory map (mmap on Linux/BSD, MapViewOfFile on windows) it’s way faster than reading the file - because you only ever read what’s needed on one hand, and it stays in cache between invocations.