At last! This was the only feature I was missing in Postgres for years.
Postgres is so amazing I still find it hard to believe it's free. I abused it many times (as a graph database; as a real-time financial data analytics engine with thousands of new ticks coming in each second; as a document storage with several TBs of data per node), and amazingly, it just worked. Magic.
They have used adjacency lists, which looks strange to me — this is a common way of representing graph data in memory, but doesn't map well to the relational model. I have used "extreme normalization", where all edges are represented as many-to-many relationships with associative tables.
Though I stored document relationships and social graphs, not semantic graphs, as they describe in this paper. I have little experience with semantic graphs — perhaps, they have different requirements.
How many joins did it have (sounds very joinery :-D)/or how deep was your graph? Presumably it was doing full table scans for everything - sounds very interesting anyway.
I have a Postgres database with 25 bn rows on a single host with application level partitioning and a parallel query infrastructure. Surely abuse, but Postgres has handled it beautifully.
I built something very similar once. My tip would be to shove some HTTP interface on the front of it for some cache guarantees (ie. no arbitrary writes so you can trivially flush, etc. etc.)
While I was listening this presentation I've thought that it's quite weird functionality and intersects a lot with transaction visibility rules that MVCC already has. So it is interesting what is your use case?
Analysing open data portals. PG holds status data on retrievability of links. Depicting the evolvement of such an open data portal over time is possible with custom datetime columns and window fucntions, it's just tedious and error prone.
Do you know any commercial database with comparable level of documentation and community support?
I worked with a certain other database from a company starting with "O" and ending with "racle", and support/documentation quality was abysmal. Our support contract was quite basic, but still, it expected to be better than free alternatives.
In comparison to what? Mysql? Perhaps. Mysql is a little simpler and quicker to pick up, but one must remember, the original intent of mysql was to stop Devs from using text files for storage...
But in comparison to an enterprise solution like DB2 or Oracle? Not even close...
If you are trying to pull nosql into the argument, that is not even apples to oranges, plus with nosql, you get to retool your entire architecture.
It seems the parallel aggregation increases efficiency of queries linearly with regard to CPUs, close to "perfect parallelisation." I have some 8-core boxes running PostgreSQL and this is still good to know.
I'm not sure "efficiency" is the right term here. My understanding is that to get "more efficient" execution, you'd have to perform better with the same amount of resources.
But parallelism allows using more resources (particularly CPU cores) for a query, while the overall efficiency (the total number of instructions, CPU time etc.) remains about the same. It's just that it's split over multiple cores.
So if all CPU cores are already saturated on your machine (e.g. because you're already running multiple queries), it's not going to improve things.
I just love PostgreSQL.It is an awesome gift in open source community.I just cant believe it is free.Seriously,if you have not tried PostgreSQL before,please find a time to check it out.
I'm looking at evaluating Postgres for the data work I do at work, which would replace the (expensive) MS SQL server we are currently using. Aside from performance boosts from being able to throw more hardware at the problem due to lower costs, is Postgres as performant as MS SQL?
I think that is a tough question to answer because it really depends on what you're doing with the data/database. Some queries may run faster on one RDBMS and some may run faster on another RDBMS. I would like an answer to this question as well. I did a quick search and I haven't found any modern comparisons of performance between databases, but it would be great if there was a more modern and expanded version of the SQLite comparison[0].
I can say that I use Postgres and have not been disappointed with performance and it costs me $0. I encourage you to try it out. If you're a heavy user of SSMS, then you will have to change your flow a bit to accommodate psql. However, once you get a handle of psql, it's actually a pleasure to use.
- disks (SSDs) are very fast now, so cores are saturated more easily (when queries actually process the data instead of just reading it)
- multiple parallel (random) reads will likely be faster on HDD and SDD to some extent (esp. on larger RAID setups)
- the best optimization is still lots of RAM and people have that these days, so 100% CPU utilization during queries happens more often than not (the benchmarking setup seems suitable for more than 1 billion rows...).
Query engines typically aggressively cache data in memory (mmap, CreateFileMapping or such). Their testbed has 256GB of RAM meaning that the cache is likely hot the majority of the time. Even if you don't have that much memory, there might be a chance that the parallel workers are working on the same pages resulting in a non-proportional relationship between worker count and I/O ops.
Actually it's not really that expensive, when we buy dell machine's which costs like ~4000 Euro we get something like 16 core's. The most expensive thing in a server are prolly SAS disks.
Yeah you can get 18 cores Xeon E7, plop that in a quad socket mobo and you've got 72 cores, 144 HT. They're expensive as hell but if that's what you need…
I know ec2 people find it extraordinary having 32 cores, but on ovh you can get machine with 20c and 40 hyper threads and 256 GB of RAM for less than 350$ a month.
I know ec2 people find it extraordinary having 32 cores
How do you figure? As an 'EC2 person' I found that EC2 has made arbitrary specs utterly mundane and that is kind of the whole awesome point of EC2.
Need 200GB of RAM for the afternoon to process some data? No problem, just press a button and a minute later you have 200GB of RAM for the afternoon to process some data. Then just turn it off when you are done and forget about it.
If you need it, it's a bit over $1000/month. Which on the one hand is perhaps 2-3x more than renting a dedicated server. On the other hand it's only 2-3x times more than renting a dedicated server and if you have other stuff on AWS and make use of the rest of their features it might be worth while.
Yup. Small multiples of price inefficiency mostly don't bear thinking about until they group together in large batches. The time it takes your engineers to work out the difference is rarely worth it.
They're not always sitting idle waiting for one big OLAP query (well suited to parallel aggregation) to come in to exhibit a 25x speedup.
Let's say you've got a web application with some cookie-cutter OLTP queries (which are well suited to serial plans ... no parallel aggregation benefit here). Your CPUs are running at 50% then some big OLAP query is going to come in and get maybe ... 13x faster.
Which is a nice performance improvement, but not 25x.
These numbers are a good validation of the work done in PostgreSQL, but you can't look at it and look at your X core machine, and say "my workload is going to be 25x faster!"
> Let's say you've got a web application with some cookie-cutter OLTP queries (which are well suited to serial plans ... no parallel aggregation benefit here).
in my case, the OLTP queries run against a dedicated slave that's mostly busy doing nothing, waiting for the next OLTP query to happen. So for me this will be very, very nice :-)
I was not trying to prove a point. It's just that giving a parallel speedup for a mostly CPU-bound task without mentioning the number of cores is not very interesting. 25x on 64 cores is not as interesting as 25x on 32 cores.
If you know the number of cores you get that the efficiency is 80% (w.r.t. linear scaling) which is very good.
the blog post is very good, my comment was just about adding the missing info to the HN post title.
Edit the title of the HN post has changed since I made this comment... probably by being merged with an earlier submission of the same URL. Now I understand why HN readers are misinterpreting my comment. I did not intend to be negative.
Postgres is so amazing I still find it hard to believe it's free. I abused it many times (as a graph database; as a real-time financial data analytics engine with thousands of new ticks coming in each second; as a document storage with several TBs of data per node), and amazingly, it just worked. Magic.
If in doubt, choose Postgres.