What they should have done is not that strange. Text is merely an ordered collection of characters. If you just assigned each character (aka grapheme) a number, text becomes a sequence of numbers. The first two questions you pose, "how many letters does this have" and "are these two pieces of text different" are trivially answered by such a representation. Unicode's fuck up is the managed to come up with something that can not reliably answer those two questions.
In fact what Unicode has end up with is so horrible, it's a major exercise in coding just to answer a simple question like "is there an 'o' in this sentence", as in Python3's "'o' in sentence" does not always return the right result.
Unicode's starting point was all wrong. There is an encoding that did a perfectly good job of mapping graphemes to numbers: ISO-10646. In fact Unicode is based on it, by then committed their original sin: they decided all the proposed ISO-10646 encodings (ie, how the numbers are encoding into byte streams) were crap, so they released a standard that combined two concepts that should have remained orthogonal: codepoints and encoding those codepoints to a binary stream.
Now it's true ISO-10646 proposed encodings were undercooked. That became painfully apparent when Ken Thompson came up with utf-8. But no biggie right: utf-8 was just another ISO-10646 encoding, just let it take over naturally. The Unicode solution to the encoding problem was to first decide we would never need more than 2^16 codepoints, then wrap it up in "one true encoding everyone can use": UCS2. Windows and Java, among others, bought the concept, and have paid the price ever since.
They were wrong of course. 2^16 was not enough. So they replaced the USC2 encoding with UTF-16 which was sort of backwards compatible. But not one UTF-16, oh no, that would be too simple. We got UFT-16LE and UTF-16BE. Notice what has happened here: take identical pieces of text, encode them as valid Unicode, and end up with two binary objects that were different. Way to go boys!
But that wasn't the worst of it: they managed to screw up UTF-16 so badly it didn't expand the code space to the 2^32 points, just 2^20. And in case you can't guess what happens next, I tell you: turns out there are more than 2^20 grapheme's out there.
What to do? Well there are a lot of characters that are "minor variants” of each other, like, like o and ö. Now Unicode already had a single code point for ö but to make it all fit and be uniform they decided "Combining Diaeresis” was the way these things should be done in future. So now the correct way to represent ö is a code point that says "add an umlaut to the next character (provided it isn't another diaeresis)" followed by the code point for o. But as the original codepoint for ö still exists, we can have two identical graphemes that don't compare as equal under Unicode, which is how we get to ö ≠ ö.
So it's not only Python3 "'o' in sentence" that doesn't always always work. We arrived at the point that "'ö' in sentence" can't be done without some heavy lifting that must be done by a library. Just to make it plain: some CPU's can do "'o' in sentence" in a single instruction. That simple design decison have lost is orders of magnitude in CPU efficiency.
I know these are strong words, but IMO this is a brain dead, monumental fuckup, making things acre feet, furlong fortnights look positively sane. It's time to abandon Unicode, and it's “Combining Diaeresis” in particular and go back to basics: ISO-10646 and utf-8. UTF-8 provides a 28 bit encoding space, which is more than enough to realise the one the single guiding principle that ISO-10646 was founded on: one codepoint per grapheme.
It won’t happen of course, so as a programmer I’ll have to deal with the shit sandwich the Unicode consortium has served up for the rest of my life.
While one codepoint per grapheme would be nice, it still wouldn't solve text. There are also problems like RTL and LTR writing systems that need to be combined into the same text.
And, many of the examples I gave earlier will not go away. The problem of similar URLs using different characters would be smaller, but not gone - microsoft.com and mícrosoft.com still look too similar. Text search should still support alternate spellings (color and colour). People's names would still have multiple legally identical spellings.
In fact what Unicode has end up with is so horrible, it's a major exercise in coding just to answer a simple question like "is there an 'o' in this sentence", as in Python3's "'o' in sentence" does not always return the right result.
Unicode's starting point was all wrong. There is an encoding that did a perfectly good job of mapping graphemes to numbers: ISO-10646. In fact Unicode is based on it, by then committed their original sin: they decided all the proposed ISO-10646 encodings (ie, how the numbers are encoding into byte streams) were crap, so they released a standard that combined two concepts that should have remained orthogonal: codepoints and encoding those codepoints to a binary stream.
Now it's true ISO-10646 proposed encodings were undercooked. That became painfully apparent when Ken Thompson came up with utf-8. But no biggie right: utf-8 was just another ISO-10646 encoding, just let it take over naturally. The Unicode solution to the encoding problem was to first decide we would never need more than 2^16 codepoints, then wrap it up in "one true encoding everyone can use": UCS2. Windows and Java, among others, bought the concept, and have paid the price ever since.
They were wrong of course. 2^16 was not enough. So they replaced the USC2 encoding with UTF-16 which was sort of backwards compatible. But not one UTF-16, oh no, that would be too simple. We got UFT-16LE and UTF-16BE. Notice what has happened here: take identical pieces of text, encode them as valid Unicode, and end up with two binary objects that were different. Way to go boys!
But that wasn't the worst of it: they managed to screw up UTF-16 so badly it didn't expand the code space to the 2^32 points, just 2^20. And in case you can't guess what happens next, I tell you: turns out there are more than 2^20 grapheme's out there.
What to do? Well there are a lot of characters that are "minor variants” of each other, like, like o and ö. Now Unicode already had a single code point for ö but to make it all fit and be uniform they decided "Combining Diaeresis” was the way these things should be done in future. So now the correct way to represent ö is a code point that says "add an umlaut to the next character (provided it isn't another diaeresis)" followed by the code point for o. But as the original codepoint for ö still exists, we can have two identical graphemes that don't compare as equal under Unicode, which is how we get to ö ≠ ö.
So it's not only Python3 "'o' in sentence" that doesn't always always work. We arrived at the point that "'ö' in sentence" can't be done without some heavy lifting that must be done by a library. Just to make it plain: some CPU's can do "'o' in sentence" in a single instruction. That simple design decison have lost is orders of magnitude in CPU efficiency.
I know these are strong words, but IMO this is a brain dead, monumental fuckup, making things acre feet, furlong fortnights look positively sane. It's time to abandon Unicode, and it's “Combining Diaeresis” in particular and go back to basics: ISO-10646 and utf-8. UTF-8 provides a 28 bit encoding space, which is more than enough to realise the one the single guiding principle that ISO-10646 was founded on: one codepoint per grapheme.
It won’t happen of course, so as a programmer I’ll have to deal with the shit sandwich the Unicode consortium has served up for the rest of my life.