Distinct on: a confusing, unique and useful feature in Postgres

Groxx · on March 20, 2020

Seems like a saner option (in that behavior is defined) than MySQL's ability to select columns in a group-by that are not aggregates / grouped-on.

e.g. this works in MySQL, and has for many years:

    select a, b, c from table group by a

and you'll get undefined (but valid) b and c items from the grouped results per `a`. It's absurdly useful at times - if the non-grouped columns are just denormalized data, it doesn't matter which row they came from, and there's no need to spend time sorting/maxing/etc. And sometimes you truly don't care which entry you get, you just want an entry, e.g. to show a sample of what's in the group.

munk-a · on March 20, 2020

Thankfully, though, Postgres' DISTINCT ON does play nicely with ordering and that's where I think it shines. If you have revisioned historical data this is one of the best ways to clearly and expressively pull off most recent records - it also works well for MIN/MAX based selections or more complicated ordering schemas - in my workday setting we have a common ordered data set that we order along a three component index and have ended up wrapping one of our most common DISTINCT ON expressions in a view.

Groxx · on March 20, 2020

Yeah, it just continues a general pattern I've seen between MySQL and Postgres.

MySQL has quite a few "impure" but useful features. Postgres pretty often comes up with a sane, predictable version some years later (which often does not perform quite as well, but I'll happily take that tradeoff).

AmericanChopper · on March 20, 2020

I had no idea this existed, and that was the first use case that came to mind. I can see myself using this a lot now. I have one especially terrible dataset I’m working with at the moment where table A has a primary key, and table B is a set of denormalized transaction logs. If I want to know the status of a record in table A, I have to JOIN on table B (SELECT ... MAX), to get the most recent transaction for that PK. It’s ugly, confusing to anybody else who reads it, and inefficient. DISTINCT ON seems like a a significant improvement for that.

Note: it would be theoretically trivial to actually normalize this schema, but sadly that’s beyond my remit. At first glance I can’t think of a scenario where I would choose this design pattern unless I was forced to work within similar constraints (which does happen all the time in the real world).

munk-a · on March 21, 2020

If performance is a concern a MATERIALIZED VIEW might be a good tool to use - I've worked in similar sounding legacy environments and usually, while the underlying structures are immutable due to layers of code, adding a mat view or two on top is do-able.

Maybe that'd be some help?

AmericanChopper · on March 21, 2020

This particular use-case wouldn’t be suited to MVs, because it’s infrequently accessed, and cannot tolerate stale data. So refreshing the view just kinda kicks the can down the road a little bit. Other DB engines support a FAST or INCREMENTAL refresh, which can even support queries with aggregates (as long as certain conditions are met). For instance Oracle achieves this with MATERIALIZED VIEW LOGs, which are essentially just transaction logs for the MV.

It’s a bit of a moot point for me in this case though, as I’m not allowed to make any DDL changes to this DB, I just have to produce periodic reports. Which is perhaps a bit of a contrived situation, but it’s absolutely not my first time dealing with it. In reality there’s other even more trivial solutions to this problem. It could be solved by adding a row to table A (or creating a new related table), and then adding an update to the code that inserts the transaction logs (or even just with a table trigger - though that’s a bit more spaghetti). That would be all that’s required to bring this schema (or at least this particular portion of it) in line with a normal form. This particular DB is a large component in a generally spaghttified architecture, that lots of stakeholders write to and consume data from, so over the years it’s grown too complex to reliably tolerate simple changes. I imagine constraints similar to these is what drives a lot of use for features like DISTINCT ON, where the more “correct” approach might just be to fix your schema.

zeroimpl · on March 21, 2020

I’ve created a custom aggregate in Postgres that lets you essentially get the same result as what MySQL does:

    select a, last(b), last(c) from table group by a

As it’s name implies, returns the last value that the aggregate encountered for the group. You can add an “order by” within the parenthesis to make it deterministic.

profquail · on March 21, 2020

Isn’t this what the LAST_VALUE window function does? If so, that’s a standardized SQL syntax which is supported in both postgresql and recent versions of MySQL.

https://www.postgresqltutorial.com/postgresql-last_value-fun...

zeroimpl · on March 22, 2020

kinda. You can use aggregates as window functions, but cannot use window functions as aggregates.

So it would be:

    select a, last_value(b) over(partition by a), last_value(c) over(partition by a) from table

Note the lack of a "GROUP BY" so this will result in duplicates unless you add a DISTINCT.

paulddraper · on March 20, 2020

I 100% agree. I've argued many times that GROUP BY is frustratingly bad.

However...

(1) AFAIK it's not default in MySQL (version 5.7.5+). You have to disable ONLY_FULL_GROUP_BY to get this to work.

(2) Often you will care about which one is chosen. It would have been more useful if MySQL had let you influence this with ORDER BY.

(3) I create a current_value() aggregate which gives me even more flexibility than either approach for selecting the value.

    SELECT a,
      current_value(b ORDER BY x),
      current_value(c ORDER BY x DESC)
    FROM table GROUP BY a

munk-a · on March 20, 2020

I'm curious about that aggregate and a bit scared by the different ORDER BYs, it looks like you the function would be unable to re-use index reads for other column values.

Do you have much experience with Postgres WINDOW[1] expressions? That might be a useful tool for you even if it takes a bit of time to fully grok it. I don't use WINDOW often but one place I found it was insanely useful was for supplying a tabular page with aggregate data points outside of the data currently being paged to the client - we have a query that returns row information in a paged manner but the query driving it both narrows the row scope to fit the page size request while also using WINDOW functions over the un-partitioned data to calculate a few business useful averages.

1. https://www.postgresql.org/docs/current/tutorial-window.html

paulddraper · on March 20, 2020

> it looks like you the function would be unable to re-use index reads for other column values.

PostgreSQL would have to do a separate sort for each different ORDER BY, yes.

Also, it can't use the the index to scan only some values as it could for DISTINCT ON, because the aggregate is opaque. That may or may not actually matter though.

Another option (though more verbose) is LATERAL plus LIMIT 1.

> Do you have much experience with Postgres WINDOW expressions?

Yes.

   SELECT DISTINCT
     a,
     first_value(b) OVER (PARTITION BY a ORDER BY x),
     first_value(c) OVER (PARTITION BY a ORDER BY x DESC)
   FROM table

One downside is that I can't use FILTER with those as with aggregates, though sometimes you can hack something together with PARTITION/ORDER BY/RANGE

meritt · on March 21, 2020

MySQL 8 has ANY_VALUE() [1] to do just this. Also your example does not work any longer unless that now-deprecated behavior is explicitly enabled.

    select a, any_value(b), any_value(b) from table group by a

https://dev.mysql.com/doc/refman/8.0/en/miscellaneous-functi...

Groxx · on March 21, 2020

yea, that's much clearer about intent. the other form is rather accident-prone.

do you know if it allows different row values for a and b though? (tbh I'm not sure if the old version did or did not, but I've never seen it return data blended from multiple rows...)

bfoks · on March 20, 2020

But not by default. It's related to ONLY_FULL_GROUP_BY.

More here: https://dev.mysql.com/doc/refman/8.0/en/group-by-handling.ht...

Groxx · on March 20, 2020

Huh. It has been enabled on every system I've touched so far, didn't realize it wasn't on by default. TIL

Groxx · on March 21, 2020

It's entirely possible I've mostly touched 5.6 systems, now that I think about it. Looks like this changed in 5.7(.5 specifically): https://dev.mysql.com/doc/refman/5.6/en/group-by-handling.ht... vs https://dev.mysql.com/doc/refman/5.7/en/group-by-handling.ht...

greiskul · on March 20, 2020

Is this similar to BigQuery's ANY_VALUE aggregate function? I found it to be extremely useful.

traek · on March 21, 2020

Yes, similar to that and Presto’s `ARBITRARY`.

cosmie · on March 20, 2020

One really useful call out is that it supports expressions, not just columns. You can create derived entities via case statements (think binning, entity normalization, etc) and leverage the super handy functionality against those derived entities.

It also allows you to use multiple entities/expressions in the distinct on, so you can pivot the logic between granularity levels. Which comes in handy when needing to pull rollups latest/first/biggest/smallest style rollups from varying perspectives.

rattray · on March 20, 2020

This sounds pretty awesome. Anyone have any sample queries demonstrating this lying around?

cosmie · on March 20, 2020

I threw together a Fiddle that demonstrates it here[1]. It shows the basic single column version, a multi-column distinct on, and an example that uses a case statement for a derived column. The only thing with the third example is that I don't believe you can alias the derived column within the distinct on field itself, so have to repeat the expression - once within the distinct on and again in the column list of the select statement.

[1] https://www.db-fiddle.com/f/7CCYFf14UtkuGYfGZAt5PY/3

Tostino · on March 20, 2020

Distinct on is for sure one of the most useful Postgres extensions to SQL, as far as how concise it makes queries which otherwise would suck to write.

I would have liked to see equivalent queries written with GROUP BY so people could understand and translate that knowledge even if they don't have direct Postgres experience.

Thanks for the article!

claytonjy · on March 20, 2020

I believe the demo query

    SELECT DISTINCT ON (location) location, time, report
    FROM weather_reports
    ORDER BY location, time DESC;

could also be written with a self-join like so

    SELECT location, time, report
    FROM weather_reports
    JOIN (
        SELECT location, max(time) AS time
        FROM weather_reports
        GROUP BY location
    ) as _ USING (location, time)
    ORDER BY location;

though I'd be curious what the query-plan differences are. I suspect the DISTINCT ON might translate more closely to a lateral join

    SELECT
    FROM (
        SELECT DISTINCT(location)
        FROM weather_reports
    ) AS locations
    JOIN LATERAL (
        SELECT location, time, report
        FROM weather_reports
        WHERE location = locations.location
        ORDER BY ts DESC
        LIMIT 1
    ) AS reports ON TRUE
    ORDER BY location;

But I really can't say.

paulddraper · on March 20, 2020

The first only works if location,time is unique. In the example it may be, not necessarily in general.

And yes, the JOIN LATERAL would be a good substitute.

P.S. More concise is CROSS JOIN ... rather than JOIN ... ON TRUE

zeroimpl · on March 21, 2020

In general if you query a table twice, the query planner will visit that table twice. So DISTINCT ON is almost always more efficient.

The main exception I’ve seen is when using it in a view, if you try to filter the results of the view, it usually can’t optimize a DISTINCT ON, but sometimes can for GROUP BY (though that might have improved on newer versions)

beckingz · on March 20, 2020

This would be great, because getting those same results in standard SQL or MariaDB is a pain.

beckingz · on March 20, 2020

This removes the self join needed to ask the query "(latest/first/biggest/smallest) record per (customer/user/entity)".

Nice!

Zecc · on March 20, 2020

SQLite takes it one step further.

https://sqlite.org/quirks.html#aggregate_queries_can_contain...

> if an aggregate query contains a single min() or max() function, then the values of columns used in the output are taken from the row where the min() or max() value was achieved. If two or more rows have the same min() or max() value, then the columns values will be chosen arbitrarily from one of those rows.

Also here:

https://sqlite.org/lang_select.html#bareagg

> When the min() or max() aggregate functions are used in an aggregate query, all bare columns in the result set take values from the input row which also contains the minimum or maximum. [...] There is still an ambiguity if two or more of the input rows have the same minimum or maximum value or if the query contains more than one min() and/or max() aggregate function.

Would have been nice if adding an ORDER BY would allow disambiguating.

caseyohara · on March 20, 2020

I've been struggling with a similar query recently perhaps better suited for a time series DB. Can DISTINCT ON be used to ask the query "latest record per customer per day". For example, let's say we have a bunch of devices that send sensor readings arbitrarily throughout each day. Could DISTINCT ON be used to query the last reading of each day per device?

dagss · on March 20, 2020

Also consider expressin this with a lateral join.

Scroll a bit down here for a similar example.

https://stackoverflow.com/questions/11472790/postgres-analog...

lenzm · on March 20, 2020

Absolutely. You'll need an expression to get the date out of the timestamp, then you want rows that are distinct on customer_id & date order by timestamp desc.

claytonjy · on March 20, 2020

I've been using Timescale DB, a postgres extension, to store our sensor-produced timeseries data and it handles this pretty well. DISTINCT ON seems to ruin the query plan there, by forcing sequential, non-index scans of each subtable (timescale breaks tables into time-buckets behind the scenes, so it can skip searching tables in time-filtered queries). I've had good performance with their "time_bucket" and "last" functions, though, and not having to use some NoSQL time series database has been really nice.

eyelidlessness · on March 20, 2020

Yes, you can `DISTINCT ON` with a `DATE_TRUNC` expression which should get you what you want.

abalashov · on March 20, 2020

DISTINCT ON, when I stumbled upon it rather accidentally a few years ago, was a life saver in some complicated queries in some even more complicated stored functions in our telephony infrastructure product, where only the first route offered by any vendor that otherwise matches a set of criteria is desired.

The alternative would have been a lot less concise and would have required a much longer stored procedure.

Twisell · on March 20, 2020

What's nice about PostgreSQL is that you quite often discover little useful trick you have missed so far.

Guess that's however true for any well written language when you master it enough to enjoy theses gems hidden in plain sight.

thibautg · on March 21, 2020

DISTINCT ON is also very useful for nearest neighbor queries with PostGIS.

cosmotic · on March 20, 2020

If you want your application to be portable, don't use this feature.

_lqaf · on March 20, 2020

This advice grew out of the days when database platforms cost enormous amounts of money. It is still as true as it ever was, but since then the biggest pressure that lead to it no longer applies.

Back before PG and others were viable competitors to the big commercial databases, about the only leverage customers had against aggressive sales people was the threat to switch to a competitor. So people would preserve that as a credible threat by not using proprietary SQL extensions.

No vampiric salesdroid from Postgres is going to knock on your door demanding 8 figures.

Now, there may be other reasons you want to switch away from it, or you may want to support a variety of DBs in whatever you're building, or some other reason. So it is good to be aware of. But shying away from SQL extensions is usually no longer a matter of budgetary apocalypse.

irrational · on March 20, 2020

Do you mean portable as in able to run on different databases? Your application would have to be fairly simple to achieve that goal. We moved an application from Oracle to Postgres. It took about 2 years to complete.

cosmotic · on March 23, 2020

Many applications are built on ORMs which are pretty portable. About the only thing I regularly depend on that is proprietary per RDBMS is "last insert autoincrement id" since it's not part of the SQL standard. Binary blobs is another big one.

cosmotic · on March 23, 2020

If you start out minimizing proprietary features, its a lot easier to do. Oracle is filled with proprietary features so I'm not surprised it was a lot of work. My application is pretty simple and I converted from mysql to postgres in an afternoon.

munk-a · on March 20, 2020

At my first job I worked on a program that supported MySQL, MSSQL and Postgres[1]. Now I work on a project that's Postgres only and, honestly, I don't see the need to put any effort into SQL neutrality these days.

1. Only because I hacked it in during my free-time since it is such a trivial dialect to support.

notatoad · on March 21, 2020

is anybody out there actually writing portable sql?

cosmotic · on March 23, 2020

I am. Also anyone else using an ORM likely is.