PostgreSQL Parallel Aggregate

atemerev · on March 23, 2016

At last! This was the only feature I was missing in Postgres for years.

Postgres is so amazing I still find it hard to believe it's free. I abused it many times (as a graph database; as a real-time financial data analytics engine with thousands of new ticks coming in each second; as a document storage with several TBs of data per node), and amazingly, it just worked. Magic.

If in doubt, choose Postgres.

JohnDoe365 · on March 23, 2016

Would love to hear more on your usage as a graph database, especially in light of http://static.googleusercontent.com/media/research.google.co...

atemerev · on March 23, 2016

They have used adjacency lists, which looks strange to me — this is a common way of representing graph data in memory, but doesn't map well to the relational model. I have used "extreme normalization", where all edges are represented as many-to-many relationships with associative tables.

Though I stored document relationships and social graphs, not semantic graphs, as they describe in this paper. I have little experience with semantic graphs — perhaps, they have different requirements.

andy_ppp · on March 23, 2016

How many joins did it have (sounds very joinery :-D)/or how deep was your graph? Presumably it was doing full table scans for everything - sounds very interesting anyway.

atemerev · on March 23, 2016

About 1.5 million nodes (not that big), most of queries followed 3-4 edges.

Joins are good. :)

gdulli · on March 23, 2016

I have a Postgres database with 25 bn rows on a single host with application level partitioning and a parallel query infrastructure. Surely abuse, but Postgres has handled it beautifully.

lamby · on March 23, 2016

I built something very similar once. My tip would be to shove some HTTP interface on the front of it for some cache guarantees (ie. no arbitrary writes so you can trivially flush, etc. etc.)

JohnDoe365 · on March 23, 2016

I miss bi-temporal queries/support most. Unfortunately there is little in place as of now in that regard. http://www.postgresql.org/docs/9.5/static/runtime-config-rep... are the humble beginnings.

kelvich · on March 23, 2016

It is now under development by Enova. They want to prototype bi-temporal support as an extension. https://hdombrovskaya.wordpress.com/2016/03/15/my-presentati...

While I was listening this presentation I've thought that it's quite weird functionality and intersects a lot with transaction visibility rules that MVCC already has. So it is interesting what is your use case?

JohnDoe365 · on April 5, 2016

Analysing open data portals. PG holds status data on retrievability of links. Depicting the evolvement of such an open data portal over time is possible with custom datetime columns and window fucntions, it's just tedious and error prone.

lamby · on March 23, 2016

> I still find it hard to believe it's free

PostgreSQL is only free if your time has no value..

(Paraphrased. In jest..)

atemerev · on March 23, 2016

Do you know any commercial database with comparable level of documentation and community support?

I worked with a certain other database from a company starting with "O" and ending with "racle", and support/documentation quality was abysmal. Our support contract was quite basic, but still, it expected to be better than free alternatives.

It was very NOT.

thornygreb · on March 23, 2016

I'm sorry, but documentation is one thing Oracle is NOT lacking.

http://docs.oracle.com/cd/E11882_01/nav/portal_booklist.htm

esaym · on March 23, 2016

He probably meant "readable" documentation....

hackbinary · on March 23, 2016

I took it to mean the Oracle's support was lessor than that of Postgres', despite paying a fee for it up front.

lamby · on March 27, 2016

It was a comment in jest (as I mentioned).

esaym · on March 23, 2016

In comparison to what? Mysql? Perhaps. Mysql is a little simpler and quicker to pick up, but one must remember, the original intent of mysql was to stop Devs from using text files for storage...

But in comparison to an enterprise solution like DB2 or Oracle? Not even close...

If you are trying to pull nosql into the argument, that is not even apples to oranges, plus with nosql, you get to retool your entire architecture.

aorth · on March 23, 2016

It seems the parallel aggregation increases efficiency of queries linearly with regard to CPUs, close to "perfect parallelisation." I have some 8-core boxes running PostgreSQL and this is still good to know.

kitd · on March 23, 2016

It looks good up to 30 parallel worker processes. It does start to diverge a bit after that, but still very impressive.

pgaddict · on March 23, 2016

I'm not sure "efficiency" is the right term here. My understanding is that to get "more efficient" execution, you'd have to perform better with the same amount of resources.

But parallelism allows using more resources (particularly CPU cores) for a query, while the overall efficiency (the total number of instructions, CPU time etc.) remains about the same. It's just that it's split over multiple cores.

So if all CPU cores are already saturated on your machine (e.g. because you're already running multiple queries), it's not going to improve things.

hvo · on March 23, 2016

I just love PostgreSQL.It is an awesome gift in open source community.I just cant believe it is free.Seriously,if you have not tried PostgreSQL before,please find a time to check it out.

JimmyAustin · on March 23, 2016

I'm looking at evaluating Postgres for the data work I do at work, which would replace the (expensive) MS SQL server we are currently using. Aside from performance boosts from being able to throw more hardware at the problem due to lower costs, is Postgres as performant as MS SQL?

travjones · on March 23, 2016

I think that is a tough question to answer because it really depends on what you're doing with the data/database. Some queries may run faster on one RDBMS and some may run faster on another RDBMS. I would like an answer to this question as well. I did a quick search and I haven't found any modern comparisons of performance between databases, but it would be great if there was a more modern and expanded version of the SQLite comparison[0].

I can say that I use Postgres and have not been disappointed with performance and it costs me $0. I encourage you to try it out. If you're a heavy user of SSMS, then you will have to change your flow a bit to accommodate psql. However, once you get a handle of psql, it's actually a pleasure to use.

[0]: https://www.sqlite.org/speed.html

lafay · on March 23, 2016

Parallel aggregate over cores in a node is fast, but parallel aggregate over many cores across many nodes is even faster:

https://www.kentik.com/metrics-for-microservices/

https://www.kentik.com/postgresql-foreign-data-wrappers/

heyplanet · on March 23, 2016

I would assume that when you query billions of rows, the disk is the bottleneck, not the CPU. What am I missing?

lazyjones · on March 23, 2016

Several things:

- disks (SSDs) are very fast now, so cores are saturated more easily (when queries actually process the data instead of just reading it)

- multiple parallel (random) reads will likely be faster on HDD and SDD to some extent (esp. on larger RAID setups)

- the best optimization is still lots of RAM and people have that these days, so 100% CPU utilization during queries happens more often than not (the benchmarking setup seems suitable for more than 1 billion rows...).

zamalek · on March 23, 2016

Query engines typically aggressively cache data in memory (mmap, CreateFileMapping or such). Their testbed has 256GB of RAM meaning that the cache is likely hot the majority of the time. Even if you don't have that much memory, there might be a chance that the parallel workers are working on the same pages resulting in a non-proportional relationship between worker count and I/O ops.

ogrisel · on March 23, 2016

...on a 32 physical cores machine.

merb · on March 23, 2016

Actually it's not really that expensive, when we buy dell machine's which costs like ~4000 Euro we get something like 16 core's. The most expensive thing in a server are prolly SAS disks.

masklinn · on March 23, 2016

Yeah you can get 18 cores Xeon E7, plop that in a quad socket mobo and you've got 72 cores, 144 HT. They're expensive as hell but if that's what you need…

lazyjones · on March 23, 2016

... but it seems to scale linearly for 2-10 cores, isn't that good enough? Looks like a big win on very modest systems to me.

rodgerd · on March 23, 2016

And? 32 cores are very affordable.

eddd · on March 23, 2016

I know ec2 people find it extraordinary having 32 cores, but on ovh you can get machine with 20c and 40 hyper threads and 256 GB of RAM for less than 350$ a month.

dagw · on March 23, 2016

I know ec2 people find it extraordinary having 32 cores

How do you figure? As an 'EC2 person' I found that EC2 has made arbitrary specs utterly mundane and that is kind of the whole awesome point of EC2.

Need 200GB of RAM for the afternoon to process some data? No problem, just press a button and a minute later you have 200GB of RAM for the afternoon to process some data. Then just turn it off when you are done and forget about it.

eddd · on March 23, 2016

I agree, but who spins up 32 cores for postgresql instance which is up 24/7?

dagw · on March 23, 2016

If you need it, it's a bit over $1000/month. Which on the one hand is perhaps 2-3x more than renting a dedicated server. On the other hand it's only 2-3x times more than renting a dedicated server and if you have other stuff on AWS and make use of the rest of their features it might be worth while.

felisml · on March 23, 2016

Yup. Small multiples of price inefficiency mostly don't bear thinking about until they group together in large batches. The time it takes your engineers to work out the difference is rarely worth it.

taspeotis · on March 23, 2016

They're not always sitting idle waiting for one big OLAP query (well suited to parallel aggregation) to come in to exhibit a 25x speedup.

Let's say you've got a web application with some cookie-cutter OLTP queries (which are well suited to serial plans ... no parallel aggregation benefit here). Your CPUs are running at 50% then some big OLAP query is going to come in and get maybe ... 13x faster.

Which is a nice performance improvement, but not 25x.

These numbers are a good validation of the work done in PostgreSQL, but you can't look at it and look at your X core machine, and say "my workload is going to be 25x faster!"

pilif · on March 23, 2016

> Let's say you've got a web application with some cookie-cutter OLTP queries (which are well suited to serial plans ... no parallel aggregation benefit here).

in my case, the OLTP queries run against a dedicated slave that's mostly busy doing nothing, waiting for the next OLTP query to happen. So for me this will be very, very nice :-)

taspeotis · on March 23, 2016

Since I can't edit my post, the "25x speedup" I quote came from the original title of this post: "PostgreSQL Parallel Aggregate shows 25x speedup"

ishi · on March 23, 2016

That's probably true for any benchmark - YMMV.

mrits · on March 23, 2016

If your server is already running at 100% then there will be no speed increase!

ogrisel · on March 23, 2016

I was not trying to prove a point. It's just that giving a parallel speedup for a mostly CPU-bound task without mentioning the number of cores is not very interesting. 25x on 64 cores is not as interesting as 25x on 32 cores.

If you know the number of cores you get that the efficiency is 80% (w.r.t. linear scaling) which is very good.

eis · on March 23, 2016

But it is explained exactly and in detail how many CPUs they used. They even show a graph that shows performance for different number of CPUs.

ogrisel · on March 23, 2016

the blog post is very good, my comment was just about adding the missing info to the HN post title.

Edit the title of the HN post has changed since I made this comment... probably by being merged with an earlier submission of the same URL. Now I understand why HN readers are misinterpreting my comment. I did not intend to be negative.