Uhh.. maybe. It's a serde that's trying to be cross-language / platform. I guess...

wesm · on Feb 3, 2021

There's no serde by design (aside from inspecting a tiny piece of metadata indicating the location of each constituent block of memory). So data processing algorithms execute directly against the Arrow wire format without any deserialization.

waynesonfire · on Feb 3, 2021

Of course there is. There is always deserialization. The data format is most definitely not native to the CPU.

wesm · on Feb 3, 2021

I challenge you to have a closer look at the project.

Deserialization by definition requires bytes or bits to be relocated from their position in the wire protocol to other data structures which are used for processing. Arrow does not require any bytes or bits to be relocated. So if a "C array of doubles" is not native to the CPU, then I don't know what is.

throwaway894345 · on Feb 3, 2021

Perhaps "zero-copy" is a more precise or well-defined term?

waynesonfire · on Feb 3, 2021

CPUs come in many flavors. One area where they differ is in the way that bytes of a word are represented in memory. Two common formats are Big Endian and Little Endian. This is an example where a "C array of doubles" would be incompatible and some form of deserilaziation would be needed.

My understanding is that an apache arrow library provides an API to manipulate the format in a platform agnostic way. But to claim that it eliminates deserialization is false.

dlnovell · on Feb 3, 2021

I wish I had your confidence, to argue with Wes McKinney about the details of how Arrow works

mumblemumble · on Feb 4, 2021

Iunno. That looks like "where angels fear to tread" territory to me.

rscho · on Feb 4, 2021

Hiya, a bit of OT (again, last one promise!): I saw your comment about type systems in data science the other day (https://news.ycombinator.com/item?id=25923839). From what I understood, it seems you want a contract system, wouldn't you think? The reason I'm asking is that I'm fishing for opinions on building data science infra in Racket (and saw your deleted comment in https://news.ycombinator.com/item?id=26008869 so thought you'd perhaps be interested), and Racket (and R) dataframes happen to support contracts on their columns.

ska · on Feb 3, 2021

You are right that if you want to do this in a heterogeneous computing environment, one of the layouts is going to be "wrong" and require an extra step regardless of how you do this.

But ... (a) this is way less common than it was decades ago (rare use cases we are talking about here ) and (b) it seems to be addressed in a sensible way (i.e. Arrow defaults to little-endian, but you could swap it on a big-endian network). I think it includes utility functions for conversion also.

So the usual case incurs no overhead, and the corner cases are covered. I'm not sure exactly what you are complaining about, unless it's the lack of liberally sprinkling ("no deserialization in most use cases") or whatever around the comments?

johncolanduoni · on Feb 3, 2021

Big endian is pretty rare among anything you’d be doing in-memory analytics on. Looks like you can choose the endianess of the format if you need to, but it’s little endian by default: https://arrow.apache.org/docs/format/Columnar.html. I’d suggest reading up on the format, it covers the properties it provides to be friendly to direct random access.

TuringTest · on Feb 3, 2021

If it works as a universal intermediate exchange language, it could help standardize connections among disparate systems.

When you have N systems, it takes N^2 translators to build direct connections to transfer data between them; but it only takes N translators if all them can talk the same exchange language.

waynesonfire · on Feb 3, 2021

can you define what at translator is? I don't understand the complexity you're constructing. I have N systems and they talk protobuf. What's the problem?

TuringTest · on Feb 3, 2021

By a translator, I mean a library that allows accessing data from different subsystems (either languages or OS processes).

In this case, the advantages are that 1) Arrow is language agnostic, so it's likely that it can be used as a native library in your program and 2) it doesn't copy data to make it accessible to another process, so it saves a lot of marshalling / unmarshalling steps (assuming both sides use data in tabular format, which is typical of data analysis contexts).

mumblemumble · on Feb 3, 2021

It's not just a serde. One of its key use cases is eliminating serde.

est · on Feb 4, 2021

Arrow had it selling points as non-serde. But I am wondering how does it achieve no-serde with Python? By allocating PyObject cleverly with a network packet buffer?

If I convert an Arrow int8 array to normal python list of int's, will this involve copying?

int_19h · on Feb 4, 2021

I think the idea is that you'd have a Python object that behaves exactly like a list of ints would, but with Arrow as its backing store.

kylebarron · on Feb 4, 2021

This is similar to Numpy. You operate on Python objects that describe the Numpy structure, but the actual memory is stored in C objects, not Python objects.

dahinds · on Feb 4, 2021

It would have to, but to a numpy vector, maybe not.

waynesonfire · on Feb 3, 2021

I just don't believe you. My CPU doesn't understand Apache Arrow 3.0.

kristjansson · on Feb 3, 2021

Not GP post, but it might have been better stated as 'eliminating serde overhead'. Arrow's RPC serialization [1] is basically Protobuf, with a whole lot of hacks to eliminate copies on both ends of the wire. So it's still 'serde', but markedly more efficient for large blocks of tabular-ish data.

[1]: https://arrow.apache.org/docs/format/Flight.html

wesm · on Feb 3, 2021

> Arrow's serialization is Protobuf

Incorrect. Only Arrow Flight embeds the Arrow wire format in a Protocol Buffer, but the Arrow protocol itself does not use Protobuf.

kristjansson · on Feb 3, 2021

Apologies, off base there. Edited with a pointer to Flight :)

mumblemumble · on Feb 3, 2021

So, there are several components to Arrow. One of them transfers data using IPC, and naturally needs to serialize. The other uses shared memory, which eliminates the need for serde.

Sadly, the latter isn't (yet) well supported anywhere but Python and C++. If you can/do use it, though, data are just kept as as arrays in memory. Which is exactly what the CPU wants to see.

superdimwit · on Feb 4, 2021

Shared memory format is supported in Julia too!

mumblemumble · on Feb 4, 2021

Oh, that's fantastic to hear. Right now I'm living in Python because that's the galactic center, but I've also been anxious to find a low-cost escape hatch that doesn't just lead to C++.

oscardssmith · on Feb 4, 2021

You should definitely check out Julia then. There are a few parts of the language that use C/C++ libraries (blas and mpfr are the main ones), but 95% of the time, your stack will be Julia all the way down.

macksd · on Feb 4, 2021

In the best case, your CPU needn't really be involved beyond a call to set up mapped memory. It is (among other things) a common memory format, so you can be working with it in one framework, and then start working with it in another, and it's location and format in memory hasn't changed at all.

chrisweekly · on Feb 3, 2021

For those wondering what a SerDe is: https://docs.serde.rs/serde/

nevi-me · on Feb 3, 2021

The term likely predates the Rust implementation. SerDe is Serializer & Deserializer, which could be any framework or tool that allows the serialization and deserialization of data.

I first came across the concept in Apache Hive.

spennant · on Feb 4, 2021

I first came across the concept when AWS Glue was messing up my CSVs on import

ecnahc515 · on Feb 4, 2021

In case you weren't aware, Glue's metastore is an implementation of Apache Hive, so not too different from the person you're replying to.

waynesonfire · on Feb 3, 2021

I meant serialize / deserialize in the literal sense.