Seems like a saner option (in that behavior is defined) than MySQL's ability to select columns in a group-by that are not aggregates / grouped-on.
e.g. this works in MySQL, and has for many years:
select a, b, c from table group by a
and you'll get undefined (but valid) b and c items from the grouped results per `a`. It's absurdly useful at times - if the non-grouped columns are just denormalized data, it doesn't matter which row they came from, and there's no need to spend time sorting/maxing/etc. And sometimes you truly don't care which entry you get, you just want an entry, e.g. to show a sample of what's in the group.
Thankfully, though, Postgres' DISTINCT ON does play nicely with ordering and that's where I think it shines. If you have revisioned historical data this is one of the best ways to clearly and expressively pull off most recent records - it also works well for MIN/MAX based selections or more complicated ordering schemas - in my workday setting we have a common ordered data set that we order along a three component index and have ended up wrapping one of our most common DISTINCT ON expressions in a view.
Yeah, it just continues a general pattern I've seen between MySQL and Postgres.
MySQL has quite a few "impure" but useful features. Postgres pretty often comes up with a sane, predictable version some years later (which often does not perform quite as well, but I'll happily take that tradeoff).
I had no idea this existed, and that was the first use case that came to mind. I can see myself using this a lot now. I have one especially terrible dataset I’m working with at the moment where table A has a primary key, and table B is a set of denormalized transaction logs. If I want to know the status of a record in table A, I have to JOIN on table B (SELECT ... MAX), to get the most recent transaction for that PK. It’s ugly, confusing to anybody else who reads it, and inefficient. DISTINCT ON seems like a a significant improvement for that.
Note: it would be theoretically trivial to actually normalize this schema, but sadly that’s beyond my remit. At first glance I can’t think of a scenario where I would choose this design pattern unless I was forced to work within similar constraints (which does happen all the time in the real world).
If performance is a concern a MATERIALIZED VIEW might be a good tool to use - I've worked in similar sounding legacy environments and usually, while the underlying structures are immutable due to layers of code, adding a mat view or two on top is do-able.
This particular use-case wouldn’t be suited to MVs, because it’s infrequently accessed, and cannot tolerate stale data. So refreshing the view just kinda kicks the can down the road a little bit. Other DB engines support a FAST or INCREMENTAL refresh, which can even support queries with aggregates (as long as certain conditions are met). For instance Oracle achieves this with MATERIALIZED VIEW LOGs, which are essentially just transaction logs for the MV.
It’s a bit of a moot point for me in this case though, as I’m not allowed to make any DDL changes to this DB, I just have to produce periodic reports. Which is perhaps a bit of a contrived situation, but it’s absolutely not my first time dealing with it. In reality there’s other even more trivial solutions to this problem. It could be solved by adding a row to table A (or creating a new related table), and then adding an update to the code that inserts the transaction logs (or even just with a table trigger - though that’s a bit more spaghetti). That would be all that’s required to bring this schema (or at least this particular portion of it) in line with a normal form. This particular DB is a large component in a generally spaghttified architecture, that lots of stakeholders write to and consume data from, so over the years it’s grown too complex to reliably tolerate simple changes. I imagine constraints similar to these is what drives a lot of use for features like DISTINCT ON, where the more “correct” approach might just be to fix your schema.
I’ve created a custom aggregate in Postgres that lets you essentially get the same result as what MySQL does:
select a, last(b), last(c) from table group by a
As it’s name implies, returns the last value that the aggregate encountered for the group. You can add an “order by” within the parenthesis to make it deterministic.
Isn’t this what the LAST_VALUE window function does? If so, that’s a standardized SQL syntax which is supported in both postgresql and recent versions of MySQL.
I'm curious about that aggregate and a bit scared by the different ORDER BYs, it looks like you the function would be unable to re-use index reads for other column values.
Do you have much experience with Postgres WINDOW[1] expressions? That might be a useful tool for you even if it takes a bit of time to fully grok it. I don't use WINDOW often but one place I found it was insanely useful was for supplying a tabular page with aggregate data points outside of the data currently being paged to the client - we have a query that returns row information in a paged manner but the query driving it both narrows the row scope to fit the page size request while also using WINDOW functions over the un-partitioned data to calculate a few business useful averages.
> it looks like you the function would be unable to re-use index reads for other column values.
PostgreSQL would have to do a separate sort for each different ORDER BY, yes.
Also, it can't use the the index to scan only some values as it could for DISTINCT ON, because the aggregate is opaque. That may or may not actually matter though.
Another option (though more verbose) is LATERAL plus LIMIT 1.
> Do you have much experience with Postgres WINDOW expressions?
Yes.
SELECT DISTINCT
a,
first_value(b) OVER (PARTITION BY a ORDER BY x),
first_value(c) OVER (PARTITION BY a ORDER BY x DESC)
FROM table
One downside is that I can't use FILTER with those as with aggregates, though sometimes you can hack something together with PARTITION/ORDER BY/RANGE
yea, that's much clearer about intent. the other form is rather accident-prone.
do you know if it allows different row values for a and b though? (tbh I'm not sure if the old version did or did not, but I've never seen it return data blended from multiple rows...)
One really useful call out is that it supports expressions, not just columns. You can create derived entities via case statements (think binning, entity normalization, etc) and leverage the super handy functionality against those derived entities.
It also allows you to use multiple entities/expressions in the distinct on, so you can pivot the logic between granularity levels. Which comes in handy when needing to pull rollups latest/first/biggest/smallest style rollups from varying perspectives.
I threw together a Fiddle that demonstrates it here[1]. It shows the basic single column version, a multi-column distinct on, and an example that uses a case statement for a derived column. The only thing with the third example is that I don't believe you can alias the derived column within the distinct on field itself, so have to repeat the expression - once within the distinct on and again in the column list of the select statement.
Distinct on is for sure one of the most useful Postgres extensions to SQL, as far as how concise it makes queries which otherwise would suck to write.
I would have liked to see equivalent queries written with GROUP BY so people could understand and translate that knowledge even if they don't have direct Postgres experience.
SELECT DISTINCT ON (location) location, time, report
FROM weather_reports
ORDER BY location, time DESC;
could also be written with a self-join like so
SELECT location, time, report
FROM weather_reports
JOIN (
SELECT location, max(time) AS time
FROM weather_reports
GROUP BY location
) as _ USING (location, time)
ORDER BY location;
though I'd be curious what the query-plan differences are. I suspect the DISTINCT ON might translate more closely to a lateral join
SELECT
FROM (
SELECT DISTINCT(location)
FROM weather_reports
) AS locations
JOIN LATERAL (
SELECT location, time, report
FROM weather_reports
WHERE location = locations.location
ORDER BY ts DESC
LIMIT 1
) AS reports ON TRUE
ORDER BY location;
In general if you query a table twice, the query planner will visit that table twice. So DISTINCT ON is almost always more efficient.
The main exception I’ve seen is when using it in a view, if you try to filter the results of the view, it usually can’t optimize a DISTINCT ON, but sometimes can for GROUP BY (though that might have improved on newer versions)
> if an aggregate query contains a single min() or max() function, then the values of columns used in the output are taken from the row where the min() or max() value was achieved. If two or more rows have the same min() or max() value, then the columns values will be chosen arbitrarily from one of those rows.
> When the min() or max() aggregate functions are used in an aggregate query, all bare columns in the result set take values from the input row which also contains the minimum or maximum. [...] There is still an ambiguity if two or more of the input rows have the same minimum or maximum value or if the query contains more than one min() and/or max() aggregate function.
Would have been nice if adding an ORDER BY would allow disambiguating.
I've been struggling with a similar query recently perhaps better suited for a time series DB. Can DISTINCT ON be used to ask the query "latest record per customer per day". For example, let's say we have a bunch of devices that send sensor readings arbitrarily throughout each day. Could DISTINCT ON be used to query the last reading of each day per device?
Absolutely. You'll need an expression to get the date out of the timestamp, then you want rows that are distinct on customer_id & date order by timestamp desc.
I've been using Timescale DB, a postgres extension, to store our sensor-produced timeseries data and it handles this pretty well. DISTINCT ON seems to ruin the query plan there, by forcing sequential, non-index scans of each subtable (timescale breaks tables into time-buckets behind the scenes, so it can skip searching tables in time-filtered queries). I've had good performance with their "time_bucket" and "last" functions, though, and not having to use some NoSQL time series database has been really nice.
DISTINCT ON, when I stumbled upon it rather accidentally a few years ago, was a life saver in some complicated queries in some even more complicated stored functions in our telephony infrastructure product, where only the first route offered by any vendor that otherwise matches a set of criteria is desired.
The alternative would have been a lot less concise and would have required a much longer stored procedure.
This advice grew out of the days when database platforms cost enormous amounts of money. It is still as true as it ever was, but since then the biggest pressure that lead to it no longer applies.
Back before PG and others were viable competitors to the big commercial databases, about the only leverage customers had against aggressive sales people was the threat to switch to a competitor. So people would preserve that as a credible threat by not using proprietary SQL extensions.
No vampiric salesdroid from Postgres is going to knock on your door demanding 8 figures.
Now, there may be other reasons you want to switch away from it, or you may want to support a variety of DBs in whatever you're building, or some other reason. So it is good to be aware of. But shying away from SQL extensions is usually no longer a matter of budgetary apocalypse.
Do you mean portable as in able to run on different databases? Your application would have to be fairly simple to achieve that goal. We moved an application from Oracle to Postgres. It took about 2 years to complete.
Many applications are built on ORMs which are pretty portable. About the only thing I regularly depend on that is proprietary per RDBMS is "last insert autoincrement id" since it's not part of the SQL standard. Binary blobs is another big one.
If you start out minimizing proprietary features, its a lot easier to do. Oracle is filled with proprietary features so I'm not surprised it was a lot of work. My application is pretty simple and I converted from mysql to postgres in an afternoon.
At my first job I worked on a program that supported MySQL, MSSQL and Postgres[1]. Now I work on a project that's Postgres only and, honestly, I don't see the need to put any effort into SQL neutrality these days.
1. Only because I hacked it in during my free-time since it is such a trivial dialect to support.
e.g. this works in MySQL, and has for many years:
and you'll get undefined (but valid) b and c items from the grouped results per `a`. It's absurdly useful at times - if the non-grouped columns are just denormalized data, it doesn't matter which row they came from, and there's no need to spend time sorting/maxing/etc. And sometimes you truly don't care which entry you get, you just want an entry, e.g. to show a sample of what's in the group.