Even Microsoft is finally giving up UTF-16! They recommend now to use the UTF-8 ...

snazz · on April 14, 2020

Is java.lang.String still UTF-16? Is there any plan to fix that? Once Windows and Java take care of it, I can't think of any other major UTF-16 uses left. Are there any that I've forgotten about?

Edit: Still looks like UTF-16, according to the Oracle documentation page: https://docs.oracle.com/en/java/javase/14/docs/api/java.base... Edit 2: JavaScript too. See my reply to someone else below.

josefx · on April 14, 2020

I don't think they can fix that without completely breaking backwards compatibility. The basic char type in Java is defined as a 16 bit wide unsigned integer value and String doesn't abstract over that.

diroussel · on April 14, 2020

Compact Strings were added in Java 9; https://openjdk.java.net/jeps/254

So they can now be stored as one byte per character.

kllrnohj · on April 14, 2020

Only for ASCII text. There is still no UTF-8 support (it's even called out as a non-goal in the JEP: "It is not a goal to use alternate encodings such as UTF-8 in the internal representation of strings.")

projektfu · on April 14, 2020

I don't think it's a big deal for Java because it's always easy to transfer in from and out to UTF-8. Very few Java programs use UTF-16 as a persistence format, and Java-native applications can directly marshal strings around as they are a first-class datatype.

lokedhs · on April 14, 2020

I think it will be hard to change that. But it's not alone. Javascript also uses UTF-16.

snazz · on April 14, 2020

You’re right! I’m surprised I didn’t know that. It looks like it can also be UCS-2, going by the spec:

> A conforming implementation of this International standard shall interpret characters in conformance with the Unicode Standard, Version 3.0 or later and ISO/IEC 10646-1 with either UCS-2 or UTF-16 as the adopted encoding form, implementation level 3. If the adopted ISO/IEC 10646-1 subset is not otherwise specified, it is presumed to be the BMP subset, collection 300. If the adopted encoding form is not otherwise specified, it is presumed to be the UTF-16 encoding form.

im3w1l · on April 14, 2020

USC-2 is an old version of UTF-16 that lacks support for surrogate pairs, which means that rare symbols and emoji don't work.

rimunroe · on April 14, 2020

JavaScript:

https://www.ecma-international.org/ecma-262/5.1/#sec-2

> A conforming implementation of this Standard shall interpret characters in conformance with the Unicode Standard, Version 3.0 or later and ISO/IEC 10646-1 with either UCS-2 or UTF-16 as the adopted encoding form, implementation level 3. If the adopted ISO/IEC 10646-1 subset is not otherwise specified, it is presumed to be the BMP subset, collection 300. If the adopted encoding form is not otherwise specified, it presumed to be the UTF-16 encoding form.

https://www.ecma-international.org/ecma-262/5.1/#sec-4.3.16

> A String value is a member of the String type. Each integer value in the sequence usually represents a single 16-bit unit of UTF-16 text. However, ECMAScript does not place any restrictions or requirements on the values except that they must be 16-bit unsigned integers.

masklinn · on April 14, 2020

> Is java.lang.String still UTF-16?

Yes.

> Is there any plan to fix that?

That's not really possible as strings are defined in terms of char and guarantee O(1) access to UTF16 code units. They might try to switch to "indexed UTF8" (as pypy did in the Python ecosystem whereas "CPython proper" refused to switch to UTF8 with the Python 3 upheaval and went with the death trap that is PEP 393 instead).

nathanaldensr · on April 14, 2020

Do you have a source for this? AFAIK the .NET Framework CLR and CoreCLR both still store strings internally as UTF-16.

ChrisSD · on April 14, 2020

The closest I could find to a recommendation for UTF-8 is in UWP design guidelines: https://docs.microsoft.com/en-us/windows/uwp/design/globaliz...

However it's not quite unequivocal. Windows still uses UTF-16 in the kernel (or actually an array of 16bit integers, but UTF-16 is a very strong convention). The code page will often allow the Win32 API to perform the conversion back and forth instead of your application doing it.

mormegil · on April 14, 2020

AFAICT, it's not only "internal representation". .NET strings are defined as a sequence of UTF-16 units, including the definition of the Char type representing a single UTF-16 code unit. I can't imagine how such a change could be implemented (other than changing the internal representation but converting on all accesses which would be nonsense, I think).

leosarev · on April 14, 2020

Current plan is: https://github.com/dotnet/corefxlab/issues/2350

leosarev · on April 14, 2020

CoreCLR actively discussing introducing Utf8String type. https://github.com/dotnet/corefxlab/issues/2350

jdsully · on April 14, 2020

Are you sure? That will result in a conversion every time a string is passed to the kernel.

Windows can handle utf-8 but it is not the native character set for the platform.

JdeBP · on April 14, 2020

There's a conversion in every ...A() function. Conversion between UTF-8 and WTF-16 is just more of the same, but without codepage lookup tables. (-:

Shebanator · on April 14, 2020

WTF-16? I like it...

ekimekim · on April 14, 2020

WTF-8 and WTF-16 are a thing: https://simonsapin.github.io/wtf-8/

Basically WTF-16 is any sequence of 16-bit integers, and is thus a superset of UTF-16 (because UTF-16 doesn't allow certain combinations of integers, mainly surrogate code points that exist outside of surrogate pairs).

Then WTF-8 is what you get if you naively transform invalid UTF-16 into UTF-8. It is a superset of UTF-8.

This is very useful when dealing with applications like Java and Javascript that treat strings as sequences of 16-bit code points, even though not all such strings are valid UTF-16.

masklinn · on April 14, 2020

> Basically WTF-16 is any sequence of 16-bit integers, and is thus a superset of UTF-16 (because UTF-16 doesn't allow certain combinations of integers, mainly surrogate code points that exist outside of surrogate pairs).

If WTF-16 is the ability in potentia to store and return invalid UTF-16 without signalling errors, I don't know that there's any actual UTF-16 system out there to the possible exception of… HFS+ maybe?.

loeg · on April 16, 2020

APFS continues to normalize codepoints as well.

mark-r · on April 14, 2020

They probably still do a codepage lookup just for consistency.

bostonvaulter2 · on April 15, 2020

I hope they update the Language Server Protocol to use UTF-8 then! Nearly every language supports UTF-8 well, which is not so for UTF-16

gpvos · on April 14, 2020

Have they fixed all the bugs with that pseudocodepage?

xeeeeeeeeeeenu · on April 14, 2020

Bugs like WriteFile() reporting the wrong number of bytes written with 65001 codepage were fixed years ago.

buckminster · on April 14, 2020

That's good news. Last time I looked, more than a decade ago admittedly, that bug was WONTFIX.

In fact I was so surprised I just wrote a test program. They have fixed it!

It was the dumbest bug I ever saw in Windows. It was special case code in the console output code path of the user mode part of WriteFile. It only existed to make utf8 work, and it didn't even do that.

gpvos · on April 14, 2020

Ah, that's surprising, Microsoft was very stubbornly not doing that for at least a decade and a half.

In fact, the FAQ in TFA (questions 9 and 20) mentions that there are still problems with CP_UTF8 (65001). Is the article out of date? Can someone respond to those statements?

xeeeeeeeeeeenu · on April 14, 2020

The article is outdated, it's from 2012. Not only they fixed the problems but in Windows 10 1803 they also added an option to globally and permanently set both OEM and ANSI(!) codepages to 65001.

It can be enabled by checking "Beta: Use Unicode UTF-8 for worldwide language support" checkbox in region settings.