Is java.lang.String still UTF-16? Is there any plan to fix that? Once Windows and Java take care of it, I can't think of any other major UTF-16 uses left. Are there any that I've forgotten about?
I don't think they can fix that without completely breaking backwards compatibility. The basic char type in Java is defined as a 16 bit wide unsigned integer value and String doesn't abstract over that.
Only for ASCII text. There is still no UTF-8 support (it's even called out as a non-goal in the JEP: "It is not a goal to use alternate encodings such as UTF-8 in the internal representation of strings.")
I don't think it's a big deal for Java because it's always easy to transfer in from and out to UTF-8. Very few Java programs use UTF-16 as a persistence format, and Java-native applications can directly marshal strings around as they are a first-class datatype.
You’re right! I’m surprised I didn’t know that. It looks like it can also be UCS-2, going by the spec:
> A conforming implementation of this International standard shall interpret characters in conformance with the Unicode Standard, Version 3.0 or later and ISO/IEC 10646-1 with either UCS-2 or UTF-16 as the adopted encoding form, implementation level 3. If the adopted ISO/IEC 10646-1 subset is not otherwise specified, it is presumed to be the BMP subset, collection 300. If the adopted encoding form is not otherwise specified, it is presumed to be the UTF-16 encoding form.
> A conforming implementation of this Standard shall interpret characters in conformance with the Unicode Standard, Version 3.0 or later and ISO/IEC 10646-1 with either UCS-2 or UTF-16 as the adopted encoding form, implementation level 3. If the adopted ISO/IEC 10646-1 subset is not otherwise specified, it is presumed to be the BMP subset, collection 300. If the adopted encoding form is not otherwise specified, it presumed to be the UTF-16 encoding form.
> A String value is a member of the String type. Each integer value in the sequence usually represents a single 16-bit unit of UTF-16 text. However, ECMAScript does not place any restrictions or requirements on the values except that they must be 16-bit unsigned integers.
That's not really possible as strings are defined in terms of char and guarantee O(1) access to UTF16 code units. They might try to switch to "indexed UTF8" (as pypy did in the Python ecosystem whereas "CPython proper" refused to switch to UTF8 with the Python 3 upheaval and went with the death trap that is PEP 393 instead).
However it's not quite unequivocal. Windows still uses UTF-16 in the kernel (or actually an array of 16bit integers, but UTF-16 is a very strong convention). The code page will often allow the Win32 API to perform the conversion back and forth instead of your application doing it.
AFAICT, it's not only "internal representation". .NET strings are defined as a sequence of UTF-16 units, including the definition of the Char type representing a single UTF-16 code unit. I can't imagine how such a change could be implemented (other than changing the internal representation but converting on all accesses which would be nonsense, I think).
Basically WTF-16 is any sequence of 16-bit integers, and is thus a superset of UTF-16 (because UTF-16 doesn't allow certain combinations of integers, mainly surrogate code points that exist outside of surrogate pairs).
Then WTF-8 is what you get if you naively transform invalid UTF-16 into UTF-8. It is a superset of UTF-8.
This is very useful when dealing with applications like Java and Javascript that treat strings as sequences of 16-bit code points, even though not all such strings are valid UTF-16.
> Basically WTF-16 is any sequence of 16-bit integers, and is thus a superset of UTF-16 (because UTF-16 doesn't allow certain combinations of integers, mainly surrogate code points that exist outside of surrogate pairs).
If WTF-16 is the ability in potentia to store and return invalid UTF-16 without signalling errors, I don't know that there's any actual UTF-16 system out there to the possible exception of… HFS+ maybe?.
That's good news. Last time I looked, more than a decade ago admittedly, that bug was WONTFIX.
In fact I was so surprised I just wrote a test program. They have fixed it!
It was the dumbest bug I ever saw in Windows. It was special case code in the console output code path of the user mode part of WriteFile. It only existed to make utf8 work, and it didn't even do that.
Ah, that's surprising, Microsoft was very stubbornly not doing that for at least a decade and a half.
In fact, the FAQ in TFA (questions 9 and 20) mentions that there are still problems with CP_UTF8 (65001). Is the article out of date? Can someone respond to those statements?
The article is outdated, it's from 2012. Not only they fixed the problems but in Windows 10 1803 they also added an option to globally and permanently set both OEM and ANSI(!) codepages to 65001.
It can be enabled by checking "Beta: Use Unicode UTF-8 for worldwide language support" checkbox in region settings.
They recommend now to use the UTF-8 "code page" in new code.