Seems like a saner option (in that behavior is defined) than MySQL's ability to ...

munk-a · on March 20, 2020

Thankfully, though, Postgres' DISTINCT ON does play nicely with ordering and that's where I think it shines. If you have revisioned historical data this is one of the best ways to clearly and expressively pull off most recent records - it also works well for MIN/MAX based selections or more complicated ordering schemas - in my workday setting we have a common ordered data set that we order along a three component index and have ended up wrapping one of our most common DISTINCT ON expressions in a view.

Groxx · on March 20, 2020

Yeah, it just continues a general pattern I've seen between MySQL and Postgres.

MySQL has quite a few "impure" but useful features. Postgres pretty often comes up with a sane, predictable version some years later (which often does not perform quite as well, but I'll happily take that tradeoff).

AmericanChopper · on March 20, 2020

I had no idea this existed, and that was the first use case that came to mind. I can see myself using this a lot now. I have one especially terrible dataset I’m working with at the moment where table A has a primary key, and table B is a set of denormalized transaction logs. If I want to know the status of a record in table A, I have to JOIN on table B (SELECT ... MAX), to get the most recent transaction for that PK. It’s ugly, confusing to anybody else who reads it, and inefficient. DISTINCT ON seems like a a significant improvement for that.

Note: it would be theoretically trivial to actually normalize this schema, but sadly that’s beyond my remit. At first glance I can’t think of a scenario where I would choose this design pattern unless I was forced to work within similar constraints (which does happen all the time in the real world).

munk-a · on March 21, 2020

If performance is a concern a MATERIALIZED VIEW might be a good tool to use - I've worked in similar sounding legacy environments and usually, while the underlying structures are immutable due to layers of code, adding a mat view or two on top is do-able.

Maybe that'd be some help?

AmericanChopper · on March 21, 2020

This particular use-case wouldn’t be suited to MVs, because it’s infrequently accessed, and cannot tolerate stale data. So refreshing the view just kinda kicks the can down the road a little bit. Other DB engines support a FAST or INCREMENTAL refresh, which can even support queries with aggregates (as long as certain conditions are met). For instance Oracle achieves this with MATERIALIZED VIEW LOGs, which are essentially just transaction logs for the MV.

It’s a bit of a moot point for me in this case though, as I’m not allowed to make any DDL changes to this DB, I just have to produce periodic reports. Which is perhaps a bit of a contrived situation, but it’s absolutely not my first time dealing with it. In reality there’s other even more trivial solutions to this problem. It could be solved by adding a row to table A (or creating a new related table), and then adding an update to the code that inserts the transaction logs (or even just with a table trigger - though that’s a bit more spaghetti). That would be all that’s required to bring this schema (or at least this particular portion of it) in line with a normal form. This particular DB is a large component in a generally spaghttified architecture, that lots of stakeholders write to and consume data from, so over the years it’s grown too complex to reliably tolerate simple changes. I imagine constraints similar to these is what drives a lot of use for features like DISTINCT ON, where the more “correct” approach might just be to fix your schema.

zeroimpl · on March 21, 2020

I’ve created a custom aggregate in Postgres that lets you essentially get the same result as what MySQL does:

    select a, last(b), last(c) from table group by a

As it’s name implies, returns the last value that the aggregate encountered for the group. You can add an “order by” within the parenthesis to make it deterministic.

profquail · on March 21, 2020

Isn’t this what the LAST_VALUE window function does? If so, that’s a standardized SQL syntax which is supported in both postgresql and recent versions of MySQL.

https://www.postgresqltutorial.com/postgresql-last_value-fun...

zeroimpl · on March 22, 2020

kinda. You can use aggregates as window functions, but cannot use window functions as aggregates.

So it would be:

    select a, last_value(b) over(partition by a), last_value(c) over(partition by a) from table

Note the lack of a "GROUP BY" so this will result in duplicates unless you add a DISTINCT.

paulddraper · on March 20, 2020

I 100% agree. I've argued many times that GROUP BY is frustratingly bad.

However...

(1) AFAIK it's not default in MySQL (version 5.7.5+). You have to disable ONLY_FULL_GROUP_BY to get this to work.

(2) Often you will care about which one is chosen. It would have been more useful if MySQL had let you influence this with ORDER BY.

(3) I create a current_value() aggregate which gives me even more flexibility than either approach for selecting the value.

    SELECT a,
      current_value(b ORDER BY x),
      current_value(c ORDER BY x DESC)
    FROM table GROUP BY a

munk-a · on March 20, 2020

I'm curious about that aggregate and a bit scared by the different ORDER BYs, it looks like you the function would be unable to re-use index reads for other column values.

Do you have much experience with Postgres WINDOW[1] expressions? That might be a useful tool for you even if it takes a bit of time to fully grok it. I don't use WINDOW often but one place I found it was insanely useful was for supplying a tabular page with aggregate data points outside of the data currently being paged to the client - we have a query that returns row information in a paged manner but the query driving it both narrows the row scope to fit the page size request while also using WINDOW functions over the un-partitioned data to calculate a few business useful averages.

1. https://www.postgresql.org/docs/current/tutorial-window.html

paulddraper · on March 20, 2020

> it looks like you the function would be unable to re-use index reads for other column values.

PostgreSQL would have to do a separate sort for each different ORDER BY, yes.

Also, it can't use the the index to scan only some values as it could for DISTINCT ON, because the aggregate is opaque. That may or may not actually matter though.

Another option (though more verbose) is LATERAL plus LIMIT 1.

> Do you have much experience with Postgres WINDOW expressions?

Yes.

   SELECT DISTINCT
     a,
     first_value(b) OVER (PARTITION BY a ORDER BY x),
     first_value(c) OVER (PARTITION BY a ORDER BY x DESC)
   FROM table

One downside is that I can't use FILTER with those as with aggregates, though sometimes you can hack something together with PARTITION/ORDER BY/RANGE

meritt · on March 21, 2020

MySQL 8 has ANY_VALUE() [1] to do just this. Also your example does not work any longer unless that now-deprecated behavior is explicitly enabled.

    select a, any_value(b), any_value(b) from table group by a

https://dev.mysql.com/doc/refman/8.0/en/miscellaneous-functi...

Groxx · on March 21, 2020

yea, that's much clearer about intent. the other form is rather accident-prone.

do you know if it allows different row values for a and b though? (tbh I'm not sure if the old version did or did not, but I've never seen it return data blended from multiple rows...)

bfoks · on March 20, 2020

But not by default. It's related to ONLY_FULL_GROUP_BY.

More here: https://dev.mysql.com/doc/refman/8.0/en/group-by-handling.ht...

Groxx · on March 20, 2020

Huh. It has been enabled on every system I've touched so far, didn't realize it wasn't on by default. TIL

Groxx · on March 21, 2020

It's entirely possible I've mostly touched 5.6 systems, now that I think about it. Looks like this changed in 5.7(.5 specifically): https://dev.mysql.com/doc/refman/5.6/en/group-by-handling.ht... vs https://dev.mysql.com/doc/refman/5.7/en/group-by-handling.ht...

greiskul · on March 20, 2020

Is this similar to BigQuery's ANY_VALUE aggregate function? I found it to be extremely useful.

traek · on March 21, 2020

Yes, similar to that and Presto’s `ARBITRARY`.