The premise around arrow is that when you want share data with another system, or even on the same machine between processes, most of the compute time spent is in serializing and deserializing data. Arrow removes that step by defining a common columnar format that can be used in many different programming languages. Theres more to arrow than just the file format that makes working with data even easier like better over the wire transfers (arrow flight). How this would manifest for your customers using your applications? They'd like see speeds increase. Arrow makes a lot of sense when working with lots of data in analytical or data science use cases.
Not only in between processes, but also in between languages in a single process. In this POC I spun up a Python interpreter in a Go process and pass the Arrow data buffer between processes in constant time. https://github.com/nickpoorman/go-py-arrow-bridge
Read a Parquet file into a Pandas DataFrame. Then read the Pandas DataFrame into a Spark DataFrame. Spark & Pandas are using the same Arrow memory format, so no serde is needed.
Is this merely planned, or does pandas now use Arrow’s format? I was under the impression that pandas was mostly numpy under the hood with some tweaks to handle some of the newer functionality like nullable arrays. But you’re saying that arrow data can be used by pandas without conversion or copying into new memory?
Pandas is still numpy under the hood, but you can create a numpy array that points to memory that was allocated elsewhere, so conversion to pandas can be done without copies in nice cases where the data model is the same (simple data type, no nulls, etc.): https://arrow.apache.org/docs/python/pandas.html#zero-copy-s...
I hope you don’t mind me asking dumb questions, but how does this differ from the role that say Protocol Buffers fills? To my ears they both facilitate data exchange. Are they comparable in that sense?
Better to compare it to Cap'n Proto instead. Arrow data is already laid out in a usable way. For example, an Arrow column of int64s is an 8-byte aligned memory region of size 8*N bytes (plus a bit vector for nullity), ready for random access or vectorized operations.
Protobuf, on the other hand, would encode those values as variable-width integers. This saves a lot of space, which might be better for transfer over a network, but means that writers have to take a usable in-memory array and serialize it, and readers have to do the reverse on their end.
Think of Arrow as standardized shared memory using struct-of-arrays layout, Cap'n Proto as standardized shared memory using array-of-structs layout, and Protobuf as a lightweight purpose-built compression algorithm for structs.
> Think of Arrow as standardized shared memory using struct-of-arrays layout, Cap'n Proto as standardized shared memory using array-of-structs layout
I just want to say thank you for this part of the sentence. I understand struct-of-arrays vs array-of-structs, and now I finally understand what the heck Arrow is.
Protobuf provides the fixed64 type and when combined with `packed` (the default in proto3, optional in proto2) gives you a linear layout of fixed-size values. You would not get natural alignment from protobuf's wire format if you read it from an arbitrary disk or net buffer; to get alignment you'd need to move or copy the vector. Protobuf's C++ generated code provides RepeatedField that behaves in most respects like std::vector, but in as much as protobuf is partly a wire format and partly a library, users are free to ignore the library and use whatever code is most convenient to their application.
TL;DR variable-width numbers in protobuf are optional.
protobufs still get encoded and decoded by each client when loaded into memory. arrow is a little bit more like "flatbuffers, but designed for common data-intensive columnar access patterns"
A second important point is the recognition that data tooling often re-implements the same algorithms again and again, often in ways which are not particularly optimised, because the in-memory representation of data is different between tools. Arrow offers the potential to do this once, and do it well. That way, future data analysis libraries (e.g. a hypothetical pandas 2) can concentrate on good API design without having to re-invent the wheel.
And a third is that Arrow allows data to be chunked and batched (within a particular tool), meaning that computations can be streamed through memory rather than the whole dataframe needing to be stored in memory. A little bit like how Spark partitions data and sends it to different nodes for computation, except all on the same machine. This also enables parallelisation by default. With the core count of CPUS this means Arrow is likely to be extremely fast.
Re this second point: Arrow opens up a great deal of language and framework flexibility for data engineering-type tasks. Pre-Arrow, common kinds of data warehouse ETL tasks like writing Parquet files with explicit control over column types, compression, etc. often meant you needed to use Python, probably with PySpark, or maybe one of the other Spark API languages. With Arrow now there are a bunch more languages where you can code up tasks like this, with consistent results. Less code switching, lower complexity, less cognitive overhead.
> most of the compute time spent is in serializing and deserializing data.
This is to be viewed in light how hardware evolves now. CPU compute power is no longer growing as much (at least for individual cores).
But one thing that's still doubling on a regular basis is memory capacity of all kinds (RAM, SSD, etc) and bandwidth of all kinds (PCIe lanes, networking, etc). This divide is getting large and will only continue to increase.
Which brings me to my main point:
You can't be serializing/deserializing data on the CPU. What you want is to have the CPU coordinate the SSD to copy chunks directly -and as is- to the NIC/app/etc.
Short of having your RAM doing compute work*, you would be leaving performance on the table.
> What you want is to have the CPU coordinate the SSD to copy chunks directly -and as is- to the NIC/app/etc.
Isn't that what DMA is supposed to be?
Also, there's work in getting GPUs to load data straight from NVME drives, bypassing both the CPU and system memory. So you could certainly do similar things with the PCIE bus.
A big problem is that a lot of data isn't laid out in a way that's ready to be stuffed in memory. When you see a game spending a long time loading data, that's usually why. The CPU will do a bunch of processing to map on disk data structures to a more efficient memory representation.
If you can improve the on-disk representation to more closely match what's in memory, then CPUs are generally more than fast enough to copy bytes around. They are definitely faster than system RAM.
This is backward -- this sort of serialization is overwhelmingly bottlenecked on bandwidth (not CPU). (Multi-core) compute improvements have been outpacing bandwidth improvements for decades and have not stopped. Serialization is a bottleneck because compute is fast/cheap and bandwidth is precious. This is also reflected in the relative energy to move bytes being increasingly larger than the energy to do some arithmetic on those bytes.
An interesting perspective on the future of computer architecture but it doesn't align well with my experience. CPUs are easier to build and although a lot of ink has been spilled about the end of Moore's Law, it remains the case that we are still on Moore's curve for number of transistors, and since about 15 years ago we are now also on the same slope for # of cores per CPU. We also still enjoy increasing single-thread performance, even if not at the rates of past innovation.
DRAM, by contrast, is currently stuck. We need materials science breakthroughs to get beyond the capacitor aspect ratio challenge. RAM is still cheap but as a systems architect you should get used to the idea that the amount of DRAM per core will fall in the future, by amounts that might surprise you.
I'm curious too. Does this mean data is first normalized into this "columnar format" as the primary source and all applications are purely working off this format?
I do see yet clearly how the data is being transferred if no serializing/deserializing is taking place if someone here can help fill in further. It almost sounds like there is some specialized bridge for the data transfer and I don't have the right words for it.
I think you've got it. Data is shared by passing a pointer to it, so the data doesn't need to be copied to different spots in memory (or if it is it's an efficient block copy not millions of tiny copies).
is there a strong use case around passing data from a backend to a frontend, e.g. from pandas data frame on the server into a js implementation on the client side, to use in a UI? As opposed to data pipelining among processing servers.
Yes definitely. For example, the perspective pivoting engine (https://github.com/finos/perspective) supports arrow natively, so you can stream arrow buffers from a big data system directly to be manipulated in your browser (treating the browser as a thick client application). It's a single copy from network buffer into the webassembly heap to get it into the library.
From https://arrow.apache.org/faq/:
"Parquet files cannot be directly operated on but must be decoded in large chunks... Arrow is an in-memory format meant for direct and efficient use for computational purposes. Arrow data is... laid out in natural format for the CPU, so that data can be accessed at arbitrary places at full speed."