Seems like a saner option (in that behavior is defined) than MySQL's ability to select columns in a group-by that are not aggregates / grouped-on.
e.g. this works in MySQL, and has for many years:
select a, b, c from table group by a
and you'll get undefined (but valid) b and c items from the grouped results per `a`. It's absurdly useful at times - if the non-grouped columns are just denormalized data, it doesn't matter which row they came from, and there's no need to spend time sorting/maxing/etc. And sometimes you truly don't care which entry you get, you just want an entry, e.g. to show a sample of what's in the group.
Thankfully, though, Postgres' DISTINCT ON does play nicely with ordering and that's where I think it shines. If you have revisioned historical data this is one of the best ways to clearly and expressively pull off most recent records - it also works well for MIN/MAX based selections or more complicated ordering schemas - in my workday setting we have a common ordered data set that we order along a three component index and have ended up wrapping one of our most common DISTINCT ON expressions in a view.
Yeah, it just continues a general pattern I've seen between MySQL and Postgres.
MySQL has quite a few "impure" but useful features. Postgres pretty often comes up with a sane, predictable version some years later (which often does not perform quite as well, but I'll happily take that tradeoff).
I had no idea this existed, and that was the first use case that came to mind. I can see myself using this a lot now. I have one especially terrible dataset I’m working with at the moment where table A has a primary key, and table B is a set of denormalized transaction logs. If I want to know the status of a record in table A, I have to JOIN on table B (SELECT ... MAX), to get the most recent transaction for that PK. It’s ugly, confusing to anybody else who reads it, and inefficient. DISTINCT ON seems like a a significant improvement for that.
Note: it would be theoretically trivial to actually normalize this schema, but sadly that’s beyond my remit. At first glance I can’t think of a scenario where I would choose this design pattern unless I was forced to work within similar constraints (which does happen all the time in the real world).
If performance is a concern a MATERIALIZED VIEW might be a good tool to use - I've worked in similar sounding legacy environments and usually, while the underlying structures are immutable due to layers of code, adding a mat view or two on top is do-able.
This particular use-case wouldn’t be suited to MVs, because it’s infrequently accessed, and cannot tolerate stale data. So refreshing the view just kinda kicks the can down the road a little bit. Other DB engines support a FAST or INCREMENTAL refresh, which can even support queries with aggregates (as long as certain conditions are met). For instance Oracle achieves this with MATERIALIZED VIEW LOGs, which are essentially just transaction logs for the MV.
It’s a bit of a moot point for me in this case though, as I’m not allowed to make any DDL changes to this DB, I just have to produce periodic reports. Which is perhaps a bit of a contrived situation, but it’s absolutely not my first time dealing with it. In reality there’s other even more trivial solutions to this problem. It could be solved by adding a row to table A (or creating a new related table), and then adding an update to the code that inserts the transaction logs (or even just with a table trigger - though that’s a bit more spaghetti). That would be all that’s required to bring this schema (or at least this particular portion of it) in line with a normal form. This particular DB is a large component in a generally spaghttified architecture, that lots of stakeholders write to and consume data from, so over the years it’s grown too complex to reliably tolerate simple changes. I imagine constraints similar to these is what drives a lot of use for features like DISTINCT ON, where the more “correct” approach might just be to fix your schema.
I’ve created a custom aggregate in Postgres that lets you essentially get the same result as what MySQL does:
select a, last(b), last(c) from table group by a
As it’s name implies, returns the last value that the aggregate encountered for the group. You can add an “order by” within the parenthesis to make it deterministic.
Isn’t this what the LAST_VALUE window function does? If so, that’s a standardized SQL syntax which is supported in both postgresql and recent versions of MySQL.
I'm curious about that aggregate and a bit scared by the different ORDER BYs, it looks like you the function would be unable to re-use index reads for other column values.
Do you have much experience with Postgres WINDOW[1] expressions? That might be a useful tool for you even if it takes a bit of time to fully grok it. I don't use WINDOW often but one place I found it was insanely useful was for supplying a tabular page with aggregate data points outside of the data currently being paged to the client - we have a query that returns row information in a paged manner but the query driving it both narrows the row scope to fit the page size request while also using WINDOW functions over the un-partitioned data to calculate a few business useful averages.
> it looks like you the function would be unable to re-use index reads for other column values.
PostgreSQL would have to do a separate sort for each different ORDER BY, yes.
Also, it can't use the the index to scan only some values as it could for DISTINCT ON, because the aggregate is opaque. That may or may not actually matter though.
Another option (though more verbose) is LATERAL plus LIMIT 1.
> Do you have much experience with Postgres WINDOW expressions?
Yes.
SELECT DISTINCT
a,
first_value(b) OVER (PARTITION BY a ORDER BY x),
first_value(c) OVER (PARTITION BY a ORDER BY x DESC)
FROM table
One downside is that I can't use FILTER with those as with aggregates, though sometimes you can hack something together with PARTITION/ORDER BY/RANGE
yea, that's much clearer about intent. the other form is rather accident-prone.
do you know if it allows different row values for a and b though? (tbh I'm not sure if the old version did or did not, but I've never seen it return data blended from multiple rows...)
e.g. this works in MySQL, and has for many years:
and you'll get undefined (but valid) b and c items from the grouped results per `a`. It's absurdly useful at times - if the non-grouped columns are just denormalized data, it doesn't matter which row they came from, and there's no need to spend time sorting/maxing/etc. And sometimes you truly don't care which entry you get, you just want an entry, e.g. to show a sample of what's in the group.