Java 18 is UTF-8 by Default

alkonaut · on April 14, 2022

I love how they are ready to change behavior of existing code and require opting into a compat mode. I wish this way of fixing things properly while having opt-in compat was more common. Too often we see aversion towards breaking changes for no obvious reason. If you don’t want things to break then don’t upgrade things.

exabrial · on April 14, 2022

Agree. Java also has a great history of providing off-ramps for legacy code. When JDK modules came out, they allowed disabling of the system. This served an important purpose to allow older codebases to upgrade to newer JDK versions that had important security fixes while buying timing for those apps to age out of existence. Always a pleasure working with the jdk.

Beltalowda · on April 14, 2022

I have a lot of little programs that are more or less finished: "does what it needs to do and there are no bugs". Some are public, some are not. I wrote them once – sometimes years ago – and they still work fine today.

I really like it that code I write today still works in five or ten years without mucking about to "keep up to date". "Don't upgrade things" isn't as easy as you say: how do you get an old version of Ruby? How do you get an old browser? Do you even want that?

Opt-in should be easy and painless, probably even the default for new projects (through some build file or whatnot; I'm not very familiar with Java), but I think making existing projects work has a lot of value.

redbar0n · on April 17, 2022

An upgrade shouldn’t be a downgrade. But too often it is. It breaks expectations and ruins trust. Soon, users learn, and in response turn off all upgrades by default. Success?

If an upgrade breaks things then it should be made exceedingly explicit (not only in text, but in a dialog button) that you accept that your existing stuff will break, and you should be provided with measures and (at least description of) steps to overcome it.

exabrial · on April 14, 2022

I'm curious if the internal representation will stay UTF-16? Someone smarter than I probably could say there's an advantage to using a fixed-length charset for internal purposes.

pdpi · on April 14, 2022

UTF-16 isn't fixed length. It's sort of an extension of UCS-16, which was fixed length, but couldn't give you more than 64k codepoints. Internal representation is... complicated. java.lang.String uses UTF-16, but .class files use UTF-8.

colejohnson66 · on April 14, 2022

Nitpick: UCS counts the bytes, not the bits. So UTF-16, but limited to the BMP, is UCS-2, and UTF-32 is UCS-4.

pdpi · on April 14, 2022

Oooops. You're absolutely right.

pjmlp · on April 14, 2022

And since Java 9 there is some compression going on.

bjoli · on April 14, 2022

The java internal representation previously had a look up table for each Nth char (let's say 128, which is a nice round number), and if the indexes increased by 128 between every 128 chars it simply did a linear lookup. For most text this means constant time access, instead of a linear search through the 128 chars.

I don't think the benefits against a utf-8 system with a LUT would be very big, but as a retrofitted system (which it was) it is not bad. Pretty compact with mostly constant time access.

golergka · on April 14, 2022

UTF16 is not fixed length, it has surrogate pairs.

pjmlp · on April 14, 2022

The internal representation isn't pure UTF-16 for quite some time now.

https://openjdk.java.net/jeps/254

mc4ndr3 · on April 14, 2022

Now your turn, Windows.

jdjkfkf · on April 14, 2022

Windows is already Unicode and has been since Windows 2000. The internal representation however is UTF-16.

This is in contrast to the article you are commenting on, where Java hasn’t had a consistent default representation in Unicode until now.

torginus · on April 14, 2022

Yeah but that's not the same. UTF-8 is vastly more efficient for european alphabets (and imo, ends up being a wash for non-european ones), while not pandering the seductive lie of UTF-16 that 1 character/rune == 2 bytes. It really shows up in memory usage of large pieces of text, speed of Regex etc.

Unfortunately the NT codebase is old enough to have made the wrong choice regarding this.

I wonder if such a a transition would be possible for .NET as well, because while technically you don't get access to the raw strings easily, there's tons of APIs that allow you to grab a pointer to the underlying memory in strings, which would immediately break if such a transition was made.

pjmlp · on April 14, 2022

This is one of the scenarios where Java being more high level than .NET wins out.

Additionally to what is being discussed, string compression and deduplication exists since several releases.

WorldMaker · on April 14, 2022

.NET hasn't been so "low level" that it can't consider a different string implementation since "Core". There have been multiple explorations and discussions and considerations of moving .NET to using UTF-8 string representations internally (rather than UTF-16): https://github.com/dotnet/runtime/issues/6612

It would potentially slow down platform invokes (C/C++ DLLs) and COM calls, especially on Windows, but that's not a "low level" concern just as JNI's existence doesn't imply that Java is "low level".

pjmlp · on April 14, 2022

People tend to forget CLR was developed to support all languages, including C and C++.

As such, it is possible to access details in unsafe code via MSIL, than on the JVM side are not accessible.

So naturally while nothing prevents them to eventually introduce such change, it might come with breaking changes to code that although legal, is playing with such low level details.

As for the JNI, that is outside of what I am discussing of what is possible only with bytecodes.

WorldMaker · on April 18, 2022

It doesn't have anything to do with C/C++ support: .NET strings have never been null terminated C-style strings, they've always been Length encoded, garbage-collected strings and have always needed to be marshalled/converted to/from C-style strings no matter how "raw" bytecode you get.

It is possible to do unsafe pointer manipulation in the CLR, but using that for string manipulation was always a bad idea (too easy to fall into bad UCS-2 assumptions that don't apply to UTF-16), and in modern .NET not only is it the wrong approach but Span<T> and Rune APIs provide a much friendlier and better way to do it without unsafe pointer manipulation.

You are correct that there probably is more bad code using unsafe CLR pointer access for string manipulation in the wild than should probably exist in .NET and ever existed in the JVM, but .NET has made breaking changes to unsafe code before (.NET Framework 1 to .NET Framework 2 involved quite a few; it may be long ago in memory, but modern .NET transitions have been closer to that 1 to 2 transition than not; already gone are the days of .NET Framework 2 to 4.x where they were afraid to make low level breaking changes because the Framework was bundled with Windows).

recursive · on April 14, 2022

It's certainly possible in .net, and has been seriously proposed.

https://github.com/dotnet/runtime/issues/933

WorldMaker · on April 14, 2022

https://github.com/dotnet/runtime/issues/6612 is the more directly comparable proposal.

torginus · on April 14, 2022

Yeah, but this is talking about introducing a separate, UTF-8 string type. .NET already has some support for UTF-8, for example ASP.NET and the new Json API uses UTF-8 directly, however the API is just gnarly.

pjmlp · on April 14, 2022

This is not the same as Java, because there the change is transparent, no need for additional types.

torginus · on April 14, 2022

I'm not sure about compression, but deduplication has been in .NET since forever.

phonon · on April 14, 2022

Yeah? Then explain this (Windows 10, latest)

https://imgur.com/gDGaQA2

mananaysiempre · on April 14, 2022

Compatibility: http://archives.miloush.net/michkap/archive/2005/09/17/46994.... (I’m not sure if the mention of it being implemented on the font level is up to date since locales have changed considerably on the way from the intensely buggy MUI packs on XP SP2 to seamless system-wide locale switching on 10.)

Windows NT (the first implementation of Win32 to be released) was intended to be Unicode even before it was renamed from NT OS/2, and none of the native NT APIs accept strings in anything other than UTF-16.

phonon · on April 14, 2022

It's US/English Windows, with a Japanese setting for all NON-unicode text. No reason whatsoever why that should affect how device manager renders code paths.

https://imgur.com/teJx2m9

mananaysiempre · on April 15, 2022

Hmm. No, this actually makes sense (given the justification for the wart in the link above).

Once upon a time, there used to be a thing called a “US English Windows”, an actual separate build of the system; my understanding is that in most cases the differences between regional builds were pretty predictable (messages, help, version resources, maybe extra fonts, stuff like that), but for example right-to-left, Indic, or Far Eastern versions could include more significant changes. In any case you could not actually turn one local version into a different one (and wasn’t that a heap of fun for third-party localization testing).

There isn’t anymore. It seems the localization infrastructure revamp finally ended in Windows 8[1], but in any case in 10 you can install and set a new default language (and associated settings), delete the old one, and the system will be more or less as if it was in the new language in the first place.

So, regardless of what was printed on the box, you have two major settings, system language[2] and code page[3], which you can set at your leisure. Those do not in principle have to be related (though if you want non-Unicode applications to work you should probably make sure the code page can display the language).

The Japanese code page is odd in that it is ASCII-incompatible: code point 5C is not backslash U+005C but yen sign U+00A5 (as these will be numbered in Unicode in 1991, twenty-two years after JIS X 0201 defines it), and the backslash is not there at all (in the inimitable ISO manner, the ability to do this is officially a feature). The historical (I would guess even DOS) solution is to have 5C be the path separator whatever it actually is. The Japanese just learn to deal with it, and the rest of the world pretends the backslash is always present. (The eagle-eyed have noticed that in the 1998 anime Serial Experiments Lain a throwaway fragment of C code spells line feed as ¥n.)

Now along comes NT with (then-fixed-width) Unicode everywhere including file names and declares the path separator to be U+005C. It needs some degree of DOS and Win16 compatibility, and furthermore it’s positively a resource hog that only runs convincingly on a minimum of perhaps 16 MB of RAM—not the casual home user kind of hardware and money. So Windows 93 (ultimately 95) is released, and Win32 develops a split personality of paired “ANSI” and Unicode APIs, the former being universal and the latter NT-only with very rare exceptions.

All of this means that we cannot just map 5C in the Japanese codepage to a yen sign like sane people; not only will people’s data be mangled in some cases, non-Unicode applications (DOS, Win16, or Win32) need the path separator to be 5C while Unicode-only NT internals absolutely require it to be U+005C. The solution chosen, as we can see, is to make 5C map to U+005C and mean backslash whatever the language, but display it as a yen sign if we are Japanese.

What does it mean to be Japanese? Japanese messages in the UI? No, a Japanese code page, because that is what got us into this mess to begin with. The UI language does not matter.

Hypothetically, we could perhaps draw 5C as a yen sign when a non-Unicode program puts one on the screen but translate it to a backslash when it travels through the ANSI-Unicode border into the file APIs, without affecting Unicode applications at all, but I would not want to be one to explain to users why their paths sometimes look one way and sometimes another, or to document the resulting behaviour of FunctionA thunks to FunctionW implementations. (Right now it is “sprinkle with MultiByteToWideChar and its opposite in the obvious places” for the most part.) The current solution is the saner one, I think, as hacky as it is.

[1] http://archives.miloush.net/michkap/archive/2012/10/26/10362...

[2] Which usually drags with it things like formatting, collation, and whatever a .NET “culture” is, but you can set those separately if you want to (en-US messages with ISO 8601 dates FTW).

[3] Triples of (ANSI, OEM, Mac) codepages, actually, but if you know what that those are and how to handle the distinction, then, first, my condolences; second, I doubt I can tell you anything new here. (AreFileApisANSI *shudder*)

recursive · on April 14, 2022

Are you suggesting that UTF-16 is not a Unicode encoding?

pohl · on April 14, 2022

Both that — and the opposite — in different sentences.

lern_too_spel · on April 14, 2022

.NET StreamReader and StreamWriter already default to UTF-8. The only thing that is changing here is the default encoding when Java serializes a String to bytes and deserializes a String from bytes, which will both now match .NET. The .NET String constructor that takes a byte array still has the problem that Java has now with using the system default charset.

rurban · on April 14, 2022

hopefully the POSIX libc's and compilers are next. runtime locale lookups are a nightmare and nobody uses them anymore.