Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
BtrBlocks: Efficient Columnar Compression for Data Lakes [pdf] (tum.de)
138 points by eatonphil on Sept 16, 2023 | hide | past | favorite | 27 comments


So, something between parquet/ORC (a bit more compressed) and arrow (very CPU scan-friendly)

I wish we could specify what we intend to do to data, craft a cost model, and let a 4th gen system optimize around that. Something that would pick and choose between the different compression techniques in this paper and also the ones from arrow and parquet.


While I wish there were good principled methods for determining these choices, isn't a meaningful limitation that often when you first start writing the data, you typically don't know all of the ways it will eventually be used? Sometimes, the team tasked with making data available for various forms of bulk processing has to guess at what teams and mandates might exist years from when they are setting up the data lake, or even what compliance requirements will arise. E.g. as different areas introduce data protection laws, perhaps your datalake which could previously assume that records are written but not deleted has to support queries to find PII values associated with a person who has issued a delete request, and you're forced to rewrite lots of files. Or a downstream team starts using vector DBs and wants to assemble datasets by doing queries for records which match any of 100k IDs, which have associated vectors in regions of interest, and you need something that's not a single scan but also not a large number of point lookups. Etc, etc. How do you have a principled method for optimizing for unknown future use cases? Does it make sense to talk about a probability distribution over future queries?


This sounds a lot like Codd's framing of the relational model "problem" in his 1970 ACM paper. [0] His solution was to decouple logical from physical access. That framing and the solution still seem correct to me for the data lake use case.

This implies that any lower level storage formats we pick can morph to better optimized forms in a way that does not alter query semantics. It's not possible to solve this without adding refinements to the framing, such as whether data is mutable/immutable, and adding infrastucture to the solution, like metadata management and alternate projections of data. Vertica/C-Store introduced the latter a couple of decades ago. [1]

Plus ça change...

[0] https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf

[1] https://web.stanford.edu/class/cs345d-01/rl/cstore.pdf


As with most frustrations in my career, many of the expensive bits come down to hoarding. We don’t know what will spark joy so the system has to be able to do anything at any time. This is not free. Sometimes it’s goddamned expensive.

I have in this decade encountered systems that still have to run overnight. Meanwhile I’m spending multiple developer salaries maintaining a system that might be asked to answer a question in ten seconds, or might not be asked any questions for days at a time.

Somebody save me.


I would be happy if Parquet or ORC was used in most DWHs. The difference between JSON and Parquet is much bigger than the difference between Parquet and BtrBlocks.


We, probably, will be even better off by leaving cost model craft to the system.

I recently had to look into various TPC benchmarks and some of them are very non-trivial to cost-estimate. I found at several queries in TPC-DS that join the same table to itself four (4) times. Even triangles (join with itself three times) are hard, squares like these in TPC-DS are even harder.


Many years back I found that the actual performance of orc and parquet and avro was dominated by the optimisations of the client libraries you used to access them rather than the theoretical pros and cons of the formats themselves. (IIRC on spark chose parquet, on presto choose orc etc?)

Hopefully the gaping performance gaps have closed.


This came up in a Twitter discussion, too. Contains a link to an interesting paper which I copied below.

> As someone who work with C++ parquet readers and writers I say that different configurations can easily result in 10x differences in size -- the default behavior is usually very poor (Tony Wang / @marsupialtail_2) https://twitter.com/marsupialtail_2/status/17021850155038883...

> TUM have written a great summary about performance niches around parquet file IO performance, @DatabendLabs adopted a lot of optimizations mentioned in the summary and result is great in practice. https://dl.gi.de/server/api/core/bitstreams/9c8435ee-d478-4b... (zhihanz / @zhihanz1205) https://twitter.com/zhihanz1205/status/1702196118472536166


I have spent hours trying to find what I believe must be a memory leak in snowflake's python client that seems to make a query that gets the serialized data (<10GB) locally in under 10 seconds, but requires over a minute to return to the caller. So, yep, this has been my conclusion too!


Which Python Client? The one using ADBC/nanoarrow? If not, the issue might be related to converting columnar data to row-based.

https://www.youtube.com/watch?v=9FOp160CZYE


> CONCLUSION

> We introduced BtrBlocks, an open columnar compression format for data lakes. By analyzing a collection of real-world datasets, we selected a pool of fast encoding schemes for this use case. Additionally, we introduced Pseudodecimal Encoding, a novel compression scheme for floating-point numbers. Using our sample-based compression scheme selection algorithm and our generic framework for cascading compression, we showed that, compared to existing data lake formats, BtrBlocks achieves a high compression factor, competitive compression speed and superior decompression performance. BtrBlocks is open source and available at https://github.com/maxi-k/btrblocks.


It is interesting and I'd love to look over some details benchmarks on the differences. Storing floats as integers overcome several of their challenges. The example of dollar units would be a good candidate for a short delta compression.

I doubt I'd ever used columnar compression again as I felt it too difficult to fight DBAs on keeping the original sorting and schema preserved in an optimal way. I do find it really interesting though.


Viktor Leis's working group consistently produces interesting fundamental database research. AdaptiveRadixTries, HeightOptimisedTrees, and the Umbra database system are all done by them.


DuckDB uses Adaptive Radix Trees/Tries:

https://duckdb.org/2022/07/27/art-storage.html


ART is an elegant way of thinking about the problem of index representation. I've been using ART variants for almost 15 years now. The algorithm was somewhat common in supercomputing/HPC (used to index some types of sparse data models), which is where I originally came across it. In addition to being performant, ART is considerably more expressive and flexible than people might assume.


Sad that the author seems to be unaware of similar work, but in high energy physics: https://iopscience.iop.org/article/10.1088/1742-6596/2438/1/... (also in arXiv here: https://arxiv.org/abs/2204.09043).


How would someone who doesn't subscribe to a journal of physics keep up on these sorts of developments?


By searching for "columnar data" in Google Scholar or even just plain Google, for example. Thorough literature review is a must for any researcher.


I mean, that supposes you're a "researcher". It's a bit of a stretch for a software engineer looking to make their database better to dig into physics work. Which is to say, I can't say it's "sad" that someone wouldn't stray too fast outside their field to find research which isn't well publicized. It's like someone saying "do your own research" to a non technical person about a deeply technical subject.


Corresponding source code: https://github.com/maxi-k/btrblocks


I’ve been really excited about this paper since I went to the presentation at SIGMOD for it. I think the key innovation here is that it’s the first open standard for a format that’s just a simple combination of lightweight encodings. As far as I’m aware, most columnar DBMS already have storage like this, so it’s not very novel from a technical perspective.

However, the data science/big data world has been bogged down by inefficient, clunky formats like Parquet for quite a while and so I applaud any steps towards bridging the gap between the two worlds.


On the other hand, the popularity and language-independence of parquet has seen a fantastic explosion of tools/products/libraries that I think is really healthy.

Generally with “data stuff” you’re chained to whatever the dominant language and tools do (for the longest time, Python and Spark) and you’d have to accomodate them and their idiosyncrasies directly. Nowadays though, basically everything has a pq lib, there’s tools at all scales: DuckDB, ClickHouse, Databend, etc. The Arrow internal representation of parquet means we’ve escaped the dominance of single massive packages (e.g. pandas and spark dataframes) as the only way to deal with certain datasets.

I think it’s great what we’ve managed to get via Parquet, even if it’s a little bit clunky.


Yeah I agree with both of you. The data ... milieu is both very frustrating and very exciting at the moment, in my opinion. There is a ton of noise - lots of vendors with big marketing budgets, endless blog spam, etc. - but also quite a bit of real signal - new or remixed stuff compared with 10 or even 5 years ago, that is really useful.


Yeah the signal/noise ratio is super interesting. The advertising budgets screech real loud, but the interesting things that make it through are real interesting.

What I’m very much hoping is things stay in the current style of “mostly” language independent formats and tools. I’ve written more than enough Python and Scala/Spark and I’ve moved on that I don’t ever want to go back to the abysmal landscape that was “Python or bust”.


It's strange to see no comparison with ClickHouse's MergeTree format in the article.


Turn that data lake into a data lagoon. Are you with me?


Only if I can be your data dragoon.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: