Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yes and no. We already know there were too many named metrics to give each its own column on the system they were using (paraquet on a data lake), so what are they left with?

Does a column store like paraquet make a good time series DB? Trendy named time series databases I’ve had the displeasure of using would all fail miserably by high cardinality series too, so I’m not convinced there is actually a better thing than files on a lake for this stuff.

So, use some format to name the metric in each row. If paraquet, use dictionary encoding on that column and sort or cluster the rows ... will give go min/max pruning etc.

But presto is currently 5x slower to chew through paraquet vs orc so perhaps simply use orc. Or, for this data, Avro or json lines.

And then when you’ve used presto to discover interesting metrics you can always use presto (or scalding or whatever your poison is) to extract the metrics you have identified you want to examine more closely and to put them into separate datasets etc.

I’m just outlining standard approach’s to these kinds of problems.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: