Many popular NoSQL DBs do support joins now. However, the fact that with NoSQL, you tend to structure your data so joins aren't necessary IS one way of solving the problem relational DBs have with "join performance".
The problem is basically:
- Say you have a query filtering on 2 fields, e.g. `WHERE x = ... AND y = ...`. If you have separate indexes on x and y, if both filters return a lot of records, you'll have perf problems. Internally, the DB needs to fetch all of the records where `x = ...` AND all of the records where `y = ...`, them and then combine in memory. This is very, very slow with large data volumes, even if there are very few records that satisfy the combo (e.g. lots of records where `x = ...`, lots of records where `y = ...`, but few records where both `x = ... AND y = ...`)
- The same is true for many related problems, like filtering on 1 index, then sorting on another (e.g. the common `WHERE x = ... ORDER BY y LIMIT ...` queries)
- If your data is all in a single table/collection/whatever, then you can fix these performance issues with a compound index on (x,y), which allows you to efficiently filter/sort on the combination of x and y, without having to fetch huge intermediate record sets. But if your data is in different tables/collections/whatever, I'm not aware of any popular DBs that allow the creation of cross-table compound indexes, so that's out
NoSQL's hierarchical, join-free approach means you CAN create that compound index most of the time, while with relational DBs, you much more often can't. With NoSQL, it's much more likely that when these slow queries emerge, fields x and y are already in the same "collection". Traditionally, with relational DBs, people have solved these perf issues via denormalizing - if fields x and y live in different tables, you decide to keep a copy of field y on table x as well, to allow you to speed up the query with a compound (x,y) index. Works, but denormalizing sucks. NoSQL DBs are another approach, basically eliminating joins with more hierarchical data, but that has issues too (including, very often, requiring you to denormalize!).
"Worst-case optimal joins" are a way to have relational DBs solve this for you, without needing to resort to denormalization and compound indexes, which is very exciting! Instead of saying "fetch all the data where x = ..., and all the data where y = ..., then find the intersection in memory", they try hard to avoid this. e.g. intelligently fetch a little from x, a little from y, see if that's enough. It's a great concept, but only recently seeing much serious attention, and most big relational DBs don't support them yet.
Are join algorithms like this really the alternative to compound indexes? I understand that they reduce the memory required for joins, but the amount of data involved in a join is the same.
The problem is basically:
- Say you have a query filtering on 2 fields, e.g. `WHERE x = ... AND y = ...`. If you have separate indexes on x and y, if both filters return a lot of records, you'll have perf problems. Internally, the DB needs to fetch all of the records where `x = ...` AND all of the records where `y = ...`, them and then combine in memory. This is very, very slow with large data volumes, even if there are very few records that satisfy the combo (e.g. lots of records where `x = ...`, lots of records where `y = ...`, but few records where both `x = ... AND y = ...`)
- The same is true for many related problems, like filtering on 1 index, then sorting on another (e.g. the common `WHERE x = ... ORDER BY y LIMIT ...` queries)
- If your data is all in a single table/collection/whatever, then you can fix these performance issues with a compound index on (x,y), which allows you to efficiently filter/sort on the combination of x and y, without having to fetch huge intermediate record sets. But if your data is in different tables/collections/whatever, I'm not aware of any popular DBs that allow the creation of cross-table compound indexes, so that's out
NoSQL's hierarchical, join-free approach means you CAN create that compound index most of the time, while with relational DBs, you much more often can't. With NoSQL, it's much more likely that when these slow queries emerge, fields x and y are already in the same "collection". Traditionally, with relational DBs, people have solved these perf issues via denormalizing - if fields x and y live in different tables, you decide to keep a copy of field y on table x as well, to allow you to speed up the query with a compound (x,y) index. Works, but denormalizing sucks. NoSQL DBs are another approach, basically eliminating joins with more hierarchical data, but that has issues too (including, very often, requiring you to denormalize!).
"Worst-case optimal joins" are a way to have relational DBs solve this for you, without needing to resort to denormalization and compound indexes, which is very exciting! Instead of saying "fetch all the data where x = ..., and all the data where y = ..., then find the intersection in memory", they try hard to avoid this. e.g. intelligently fetch a little from x, a little from y, see if that's enough. It's a great concept, but only recently seeing much serious attention, and most big relational DBs don't support them yet.