Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If you don't speak press-release, this is a cool project to create an in-memory interop format for columnar data, so various tools can share data without a serialize/deserialize step which is very expensive compared to copying or sharing memory on the same machine or within a local network.

https://git-wip-us.apache.org/repos/asf?p=arrow.git;a=blob;f...

(edited post because I fail reading git and didn't notice the java implementation)



Disclosure I am a committer on Apache Drill and Apache Arrow.

This isn't actually true. The java implementation has been complete and used in Apache Drill, a distributed SQL engine, for the past few years. While we anticipate a few small changes to make sure the standard works well across new systems, this is by no means an announcement without tested code.

https://git-wip-us.apache.org/repos/asf?p=arrow.git;a=commit...


I stumbled upon Google's whitepaper on Dremel. IIRC it explained how to store the data in columnar format, but I didn't quite get how that translated into quick queries. Happen to know where I can look to better understand how it works?


I'll take a stab at explaining it myself: "transactional queries" are faster in a traditional format because they access many columns in few rows. For instance, if you want to log in to a website, you access the username, password, and possibly other authentication factors for a single user: this ends up being faster if you can go to the user's row, and then read all those fields in a contiguous scan.

"Analytical queries" are faster in a column-based format, because you're doing things like computing the correlation between 2 variables. Instead of looking at many columns in a specific row, you're looking at most of the data in a few columns. So instead of reading a whole row at once, it would be nice to skip the columns you don't care about and just grab big chunks of two or three columns.

Does that make sense?

edit: For a long time I was confused about why HBase was described as a columnar data store when access was still pretty row-based. I think the reason is because you group columns into column families which can be stored and retrieved separately, so you still get some of the benefit of a true column-oriented store.


> For a long time I was confused about why HBase was described as a columnar data store when access was still pretty row-based

The term was overloaded by column-family stores, which were often referred to as just 'column stores', probably by people who were not aware of systems like Vertica and MonetDB.

http://dbmsmusings.blogspot.co.uk/2010/03/distinguishing-two...


Column-based table store the values of each column together. Two properties make them suitable for fast query against a column: data locality and compressible data.

Since the column values are stored together, more of them can be crammed into a data page. Querying the data of a column can process much more data per page load than row-based, whereas a row-based table's data page has all other columns.

The values of a column can be stored in a sorted order, which is highly compressible, enabling cramming even more data into a data page. You can get much more data with each data page loading. E.g. The column Name is stored as sorted along with the associated row id.

    Name, (row id)
    --------------
    Joan, (101)
    Joanna, (307)
    John, (15)
    John, (32)
    John, (6)
    Johnson, (31)
    Johnson, (44)
The duplicate names can be stored as one:

    Name, (row id)
    --------------
    Joan, (101)
    Joanna, (307)
    John, (15,32,6)
    Johnson, (31,44)
Prefix encoding can further compress the sorted data:

    Name, (row id)
    --------------
    Joan, (101)
    4+na, (307)
    2+hn, (15,32,6)
    4+son, (31,44)
Now imagine the query: select count(Name) from Table. Scanning the Name column values only touches 4 records right next to each other in a data page and adds up each record's reference row ids.

Select Name from Table where Name like "John%" would do a binary search down the sorted names, load two records 2+hn and 4+son, and expand them into 5 names.


This is a good high level overview from Twitter (who created Parquet): https://blog.twitter.com/2013/dremel-made-simple-with-parque...


To be fair this isn't the best UI for git. We were waiting for the mirror to Github to be set up, so we decided to just include the link to the Apache git repo UI to be safe the link was live when the press announcements went out.

The Github mirror is now live, you can see it here: https://github.com/apache/arrow


As best I can tell, you don't have a "git clone" URL visible anywhere on the site. I have the vague impression that gitweb is capable of showing one but you have to configure it(?).


We are just using the site as set up by the Apache Infrastructure team. I'll file a ticket to see if they can add the clone URL there.

Here is a page with the clone URLs: https://git.apache.org/


Does it mean that other engines could/should use that format?


Where are the API docs? this seem rather useless: https://github.com/apache/arrow




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: