Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Normalisation is expressly done with the composition of version 3.1 for compatibility: see <https://www.unicode.org/reports/tr15/#Versioning>. IF that’s what HFS+ does, then “proprietary variant” is wrong. And if not, I’m curious what it does differently.

(On the use of version 3.1, note that in practice version 3.2 is used, correcting one typo: see <https://www.unicode.org/versions/corrigendum3.html>.)

I find a few references to it being slightly different, but not one of them actually says what’s different; Wikipedia is the only one with a citation (<https://en.wikipedia.org/wiki/HFS_Plus>: “and normalized to a form very nearly the same as Unicode Normalization Form D (NFD)[12]”), and that citation says it’s UAX #15 NFD, no deviations. One library that handles HFS+ differently switches to UCD 3.2.0 for HFS+ <https://github.com/ksze/filename-sanitizer/blob/e990e963dc5b...>, but my impression from UAX #15 is that this should be superfluous, not actually changing anything. (Why is UCD 3.2.0 still around there? Probably because IDNA 2003 needs it: <https://bugs.python.org/issue42157#msg379674>.)

Update: https://developer.apple.com/library/archive/technotes/tn/tn1... has actual technical information, but the table in question doesn’t show Unicode version changes like they claim it does, so I dunno. Looks like maybe from macOS 10.3 it’s exactly UAX #15, but 8.1–10.2 was a precursor? I’m fuzzy on where the normalisation actually happens, anyway.



The `filename-sanitizer` library you have linked has the following comment.

                # FIXME: improve HFS+ handling, because it does not use the standard NFD. It's
                # close, but it's not exactly the same thing.
                'hfs+': (255, 'characters', 'utf-16', 'NFD'),
I wonder what does that mean...


The technote linked by the parent has a note saying

> The characters with codes in the range u+2000 through u+2FFF are punctuation, symbols, dingbats, arrows, box drawing, etc. The u+24xx block, for example, has single characters for things like "(a)". The characters in this range are not fully decomposed; they are left unchanged in HFS Plus strings. This allows strings in Mac OS encodings to be converted to Unicode and back without loss of information. This is not unnatural since a user would not necessarily expect a dingbat "(a)" to be equivalent to the three character sequence "(", "a", ")" in a file name.

> The characters in the range u+F900 through u+FAFF are CJK compatibility ideographs, and are not decomposed in HFS Plus strings.

The bit about the u+24xx block is misleading, the decomposition of the characters I looked at there (such as ⒜) are compatibility canonicalizations. However the CJK compatibility ideographs is a working example. U+F902 (車) decomposes to U+8ECA (車) regardless of normalization form but the technote says these must not be decomposed.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: