I am certain I'm in the wrong here but I'm struggling to understand Arrow's USP....

tomnipotent · on Oct 27, 2021

Arrow eliminates ser/der, and if all actors in the workflow use the format you could see drastic performance improvements for a wide variety of workloads. I've seen ser/der for multi-GB+ processes take up half of the total clock time of the task.

Adoption is slow, but it'll get there.

FridgeSeal · on Oct 27, 2021

Arrow is a language-independent memory layout. It’s designed so that you could stream memory from (for example) a Rust data source to a spark/DataFusion/Python/whatever else/etc with faster throughout and support for zero-copy reads, and no serialisation/deserialisation overhead. Having the same memory model ensures better type and layout consistency as well, and means that query engines can get on with optimising and running queries rather than also having to worry about IO optimisations as well.

I’m using DataFusion (via Rust) and it’s pretty fantastic. Would love to swap out some Spark stuff for DataFusion/Ballista stuff at some point as well.

RobinL · on Oct 27, 2021

As a data data scientist, I also found this pretty confusing so I spent some time trying to understanding it better. I wrote it up as a blog post: Demystifying Apache Arrow https://www.robinlinacre.com/demystifying_arrow/

arjenpdevries · on Oct 27, 2021

Arrow is meant to share data as-is instead of requiring a copy, and, often, serialization/deserialization.

(This requires both ends to be able to handle the Arrow representation.)

Eg, it has the potential to speed up query processing in PySpark by a lot, because of its Java/Python interoperability.

twobitshifter · on Oct 27, 2021

My use case is that since Arrow keeps all data types and dumps the in memory table to disk, this allows me to backup my work and later reload the data and keep going. Loading and writing the data to disk is very fast It’s much better than using hdf5 for me in that regard.

beagle3 · on Oct 27, 2021

And in most cases, if you memory map (mmap on Linux/BSD, MapViewOfFile on windows) it’s way faster than reading the file - because you only ever read what’s needed on one hand, and it stays in cache between invocations.