Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Why isn’t the answer just “Don’t unicode normalise the file name”?

I thought the generally recommended way to deal with file names is to treat as a block of bytes (to the extent that e.g. rust has an entirely separate string type for OS provided strings), or just to allow direct encoding/decoding but not normalisation or alteration.



Well, precisely because if you don't normalize the filenames, ö ≠ ö. You could have two files with different filenames, `göteborg.txt` and `göteborg.txt`, and they are different files with different filenames.

Or you could have one file `göteborg.txt`, and when you try to ask for it as `göteborg.txt`, the system tells you "no file by that name".

Unicode normalization is the solution to this. And the unicode normalization algorithms are pretty good. The bug in this case is that the system did not apply unicode normalization consistently. It required a non-default config option to be turned on to do so? I don't really understand what's going on here, but it sounds like a bug in the system to me that this would be a non-default config option.

Dealing with the entire universe of human language is inherently complicated. But unicode gives us some actually pretty marvelous tools for doing it consistently and reasonably. But you still have to use them, and use them right, and with all software bugs are possible.

But I don't think you get fewer crazy edge cases by not normalizing at all. (In some cases you can even get security concerns, think about usernames and the risk of `jöhn` and `jöhn` being two different users...). I know that this is the choice some traditional/legacy OSs/file systems make, in order to keep pre-unicode-hegemony backwards compat. It has problems as well. I think the right choice for any greenfield possibilities is consistent unicode normalization, so `göteborg.txt` and `göteborg.txt` can't be two different files with two different filenames.

[btw I tried to actually use the two common different forms of ö in this text; I don't believe HN normalizes them so they should remain.]


It looks like instead of the config option switching everything to use the same normalization it keeps a second copy of the name in a database to compare to. What a horrible kludge, I wonder how they even got into this situation of using different normalization in different parts of the system?


That seems an odd choice indeed, because even if you do have different normalizations in differnet parts of the system, you don't need to keep multiple copies -- you just need to apply the right normalization in the right place. All of the unicode normalization algorithms are both idempotent and of course completely deterministic. If you apply NFD to any legal input, you get the same thing every time -- there's no need to store the NFC version separately to compare it to NFC input when all you have is NFD otherwise, you can just normalize the input to NFD to compare it to what you have!

Unless it was meant to be for performance?


In terms of what filenames are neither Windows nor Linux (I don't know for sure with MacOS but I doubt it) actually guarantee you any sort of characters.

Linux filenames are a sequence of non-zero bytes (they might be ASCII, or at least UTF-8, they might be an old 8-bit charset, but they also might just be arbitrary non-zero bytes) and Windows file names are a sequence of non-zero 16-bit unsigned integers, which you could think of as UTF-16 code units but they don't promise to encode UTF-16.

Probably the files have human readable names, but, maybe not. If you're accepting command line file names it's not crazy to insist on human readable (thus, Unicode) names, but if you process arbitrary input files you didn't create, particularly files you just found by looking around on disks unsupervised - you need to accept that utter gibberish is inevitable sooner or later and you must cope with that successfully.

Rust's OSStr variants match this reality.


This is what I found quite refreshing about Rust — instead of choosing one of the following:

  A) The programmer is a almighty god who knows everything, we just expose him to the raw thing
  
  B) The programmer is a immature toddler who cannot be trusted, so we handle things for them
What Rust does is more among the lines of "you might already know this, but anyways here is a reminder that you, the programmer need to take some decision about this".


macOS is interesting: some APIs normalize filenames while others don't. And it causes some very interesting bugs.

One example is when you submit a file in Safari it doesn't normalize the file name while js file.name does.


Filenames in HFS+ filesystem (an old filesystem used by Mac OS X) are normalized with a proprietary variant of NFD - this is a filesystem feature. APFS removed this feature.


By “proprietary variant” you mean “publicly documented variant” which IIRC is just the normalization tables frozen in time from an early version of Unicode (the idea being that updating your OS shouldn’t change the rules about what filenames are valid).

As for APFS, it ~~doesn’t~~didn’t normalize but I believe it still requires UTF-8. And the OS will normalize filenames at a higher level. EDIT: they added native normalization. At least for iOS, I didn’t dig enough to check it macOS is doing native normalizing or is just normalization-insensitive.


Normalisation is expressly done with the composition of version 3.1 for compatibility: see <https://www.unicode.org/reports/tr15/#Versioning>. IF that’s what HFS+ does, then “proprietary variant” is wrong. And if not, I’m curious what it does differently.

(On the use of version 3.1, note that in practice version 3.2 is used, correcting one typo: see <https://www.unicode.org/versions/corrigendum3.html>.)

I find a few references to it being slightly different, but not one of them actually says what’s different; Wikipedia is the only one with a citation (<https://en.wikipedia.org/wiki/HFS_Plus>: “and normalized to a form very nearly the same as Unicode Normalization Form D (NFD)[12]”), and that citation says it’s UAX #15 NFD, no deviations. One library that handles HFS+ differently switches to UCD 3.2.0 for HFS+ <https://github.com/ksze/filename-sanitizer/blob/e990e963dc5b...>, but my impression from UAX #15 is that this should be superfluous, not actually changing anything. (Why is UCD 3.2.0 still around there? Probably because IDNA 2003 needs it: <https://bugs.python.org/issue42157#msg379674>.)

Update: https://developer.apple.com/library/archive/technotes/tn/tn1... has actual technical information, but the table in question doesn’t show Unicode version changes like they claim it does, so I dunno. Looks like maybe from macOS 10.3 it’s exactly UAX #15, but 8.1–10.2 was a precursor? I’m fuzzy on where the normalisation actually happens, anyway.


The `filename-sanitizer` library you have linked has the following comment.

                # FIXME: improve HFS+ handling, because it does not use the standard NFD. It's
                # close, but it's not exactly the same thing.
                'hfs+': (255, 'characters', 'utf-16', 'NFD'),
I wonder what does that mean...


The technote linked by the parent has a note saying

> The characters with codes in the range u+2000 through u+2FFF are punctuation, symbols, dingbats, arrows, box drawing, etc. The u+24xx block, for example, has single characters for things like "(a)". The characters in this range are not fully decomposed; they are left unchanged in HFS Plus strings. This allows strings in Mac OS encodings to be converted to Unicode and back without loss of information. This is not unnatural since a user would not necessarily expect a dingbat "(a)" to be equivalent to the three character sequence "(", "a", ")" in a file name.

> The characters in the range u+F900 through u+FAFF are CJK compatibility ideographs, and are not decomposed in HFS Plus strings.

The bit about the u+24xx block is misleading, the decomposition of the characters I looked at there (such as ⒜) are compatibility canonicalizations. However the CJK compatibility ideographs is a working example. U+F902 (車) decomposes to U+8ECA (車) regardless of normalization form but the technote says these must not be decomposed.


ZFS can support normalization also:

    $ echo test > $'\xc3\xb6'
    $ cat $'\x6f\xcc\x88'
    cat: ö: No such file or directory

    $ zfs create -o normalization=formD pool/dataset
    $ echo test > $'\xc3\xb6'
    $ cat $'\x6f\xcc\x88'
    test


>APFS removed this feature.

And then brought it back. It normalizes now.


Sure but at some point you might want to create a file and frequently using user input or filter files using some user provided query string, the kind of use cases that unicode normalization was invented for. So the whole "opaque blob of bytes" filesystem handling is nice if all you want is to not silently corrupt files, but it is very obviously not even covering 10% of normal use cases. Rust isn't being super smart, it just has its hands thrown up in the air.


The most common desktop file systems are case-insensitive, which complicates the picture.


Still, it looks like the right thing to do is let the filesystem do the filesystem's job. The filesystem should be normalizing unicode and enforceing the case-insensitivity and whatnot, but just the filesystem. Wrappers around it like whatever Nextcloud is doing should be treating the filenames as a dumb pile of bytes.


I'm not sure this problem even has a "right" solution.

> Wrappers around it like whatever Nextcloud is doing should be treating the filenames as a dumb pile of bytes.

What do you do when the input isn't a dumb pile of bytes, but actual text? (Like from a text box the user typed into?)


Maintain a table that maps the original file name to random-generated one that doesn't hit these gotchas.


I'm afraid I don't follow. Who maintains this table and who consumes it? What if they're different entities? How do you prevent it from going out of sync with the file system when the user renames a file? Are you inventing your own file system here? How do you deal with existing file systems?


I assumed that you have a system where file management/synchronization happens strictly through a web interface, and files are not changed or renamed outside this system's knowledge. Under these preconditions, having such a mapping table frees the users from having to abide whatever restrictions the underlying file system places on valid file names.


Oh I was talking about the general case from a programming standpoint. What do you do on a typical local filesystem?

The point I'm trying to get at being, you need to worry about the representation at multiple layers, not just at the bottom FS layer.


And place the files in chunks, and... Wait I think we're getting close to reinventing block storage again ;)


Case insensitivity is a braindead behavior. If desired it should be a fallback path selecting the best match, not the first resort.


The opposite; case insensitivity is what human brains do, we read word WORD Word and woRD as the same thing, it's computer case-sensitive matching which is "brainless". Computers not aligning with what humans do is annoying and frustrating; they should be tools for us, not us for them. There's no way two people would write ö ö and have readers think they were different because one was written in oil-based ink and one in water-based ink, or whatever compares with behind the scenes implementation details like combining form vs single character.

I have just been arguing the same thing in far too much detail in this thread: https://news.ycombinator.com/item?id=29722019


Case insensitivity and "what human brains do" becomes incredibly complicated outside of English. There are also many other things which human brains recognize as the same thing but would be unreasonable to implement in filesystems.

In Japanese, くるま, クルマ, and 車 are all the same word (the first two are the phonetic spelling "kuruma", the later is the Chinese character). However in order to know that 車 is read くるま you need to be a native Japanese speaker (or have a dictionary) -- should filesystems have dictionaries to match what a human would think? Search engines that support Japanese have to handle this to some degree, but I humbly suggest implement Google Search's language handling code into a filesystem would be an ill-advised decision.

If you wanted to implement the most minimal version of this you would map between katakana and hiragana, but that means you'll need to do this for other languages. For instance, Serbian. Serbian uses two scripts (both of which have upper and lower case forms) and any native Serbian speaker would see "tuđa ljuta paprika" and "туђа љута паприка" as the same text (note that lj became љ). Should that also be automatically translated in the filesystem?

In German, capitalisation is not reversible. ß becomes SS when capitalised but will be lowercased as ss. (There is now a capital version -- ẞ -- but from what I gather it's not widely used.)

Even in English you have British and American spellings of a given word -- native speakers would recognise them as the same thing but it would not be reasonable to expect a filesystem to map them to the same thing. Initialisms can have different identical representations (N.S.A vs NSA). And you also have cases where capitalisation actually does distinguish words (May vs may, PRISM vs prism, CAT vs cat, etc). What about fullwidth and halfwidth latin characters (Hello vs Hello)? Arguably those are even more identical than upper and lower case.

For all of the above reasons, case insensitivity is something which most systems will only ever implement for English and a few other European languages, meaning that it's more of a wart than a fully-working feature. If the argument really is "well, a human would recognise these two names as the same thing, so the filesystem should too" then why are none of the other examples given above handled? If it's too difficult to do correctly (which is my view) then why support any of this in the first place? However, everything should be normalised (NFC or NFD depending on your usecase).


There are a couple arguments against case-insensitive filesystems I think are strong. The first is simply compatibility with existing case-sensitive systems. The second is that case is locale-dependent, so a pair of names could be equivalent or not depending on the device's locale.

I don't think I've seen any good argument against normalization, though.


> word WORD Word and woRD as the same thing

I don't know about anyone else, but I read WORD as someone yelling, Word as designating/specifying a "word" with some importance, and woRD as the mocking Spongebob meme. I absolutely don't read "case insensitive" and I don't think filesystems should either.


You read DOG as someone yelling ‘dog’, not as a different word to ‘dog’. And Dog as a significant dog, not a significant something else.

Imagine if you could only search for ‘dog’ if you had to specify whether the author yelled it or not before you could find it.


It sounds like you're saying that cases should matter in some ways but not in others, which I take no issue with.


Case can have information in it, like color and underline and boldface and italics can carry information. I think it would be clever if Google let me colour my search text and then only found text which was rendered in the same colour, but terrible if colouring my search text was mandatory and it then only found pages with text in the same colour. Likewise terrible if your code editor searched only for code with syntax highlighting matching the colours you typed in the search box.

Dog in bold, italics, red, green, uppercase, lowercase, initialcaps, smallcaps, are all the same word. What "the same" means has fuzzy boundaries and sometimes needs very precise specification, but I personally want the default to be the fuzzy convenient and the hyper-literal to be available as a fallback.

[I notice that I used 'color' and 'colour' here. My native language is UK English and programming languages and much of the internet use US English. I'm not sure if I would want `vim colour.txt` to open `color.txt`. Probably not. PowerShell 7 has a suggestions feature for "you typed a command which wasn't found, here are the most similar command names:" - mentioned in https://github.com/PowerShell/PowerShell/issues/10546 ]


> Dog in bold, italics, red, green, uppercase, lowercase, initialcaps, smallcaps, are all the same word.

But that is a feature of the word Dog, try the German words maßen (limited amounts) and massen (large amounts), historically they share the same upper case rendering MASSEN. Now someone versed in German could change to the alternative MASZEN or use the rather modern upper case version of ß. However the default naive (and most of the time correct) conversion between cases looses a significant amount of information.


Have to agree. It's also usually only about 10 lines of code to support both insensitive and sensitive searching for those who can't read English that way.


You've done the same thing here as your other comment. Here suggesting that people "can't read English" and in your other comment suggesting that people "can't get their head around capslock and don't deserve support".

What about people who CAN read English that way, but think having to match case when searching or referencing text hinders more than it helps?


Then the search tool should support not being case sensitive. I understand the efficiency of case insensitive search (otherwise Google's empire wouldn't exist). But having it enforced as the source of all truth is just broken.


> The opposite; case insensitivity is what human brains do, we read word WORD Word and woRD as the same thing, it's computer case-sensitive matching which is "brainless".

Only if different case-variants do not have meaning. When two words that differ only in case have different meaning, we distinguish them (e.g. "moon" and "Moon").


Assuming you mean the distinction between moons in general and the Earth's Moon (Luna), if I wrote "neil armstrong was the first human to walk on the moon" would you think I meant anything other than Neil Armstrong walking on The Moon?

Meaning doesn't go when the case changes in anything like the way meaning goes when the letters change. "neil armstrong was the first human to walk on the roof" is a very different sentence, you can't get anything like that difference with just case changes. If I spoke it, you wouldn't be able to tell if I spoke the correct case or not. Would you want school children searching for "one small step for man, one giant leap for mankind" and Google says "no results" because they used a lowercase m in mankind? Would you want a TV quiz show asking "What is Europa?" and a contenstant answers "a moon of jupiter" and the host asks "do you mean moon with a capital m or lowercase m?" before they decide whether the answer is correct?


WORD, Word WoRD....

Sorry to say I tend to use case sensitivity as a filter for me offering support to other developers. I'm not willing to find time for people who can't get their head around "turn on/off caps lock". You don't do it in professional writeups or applications (and I hope not in a CV) so don't pollute my filesystems or codebases with that madness.


I’m not talking about caps lock. I can get my head around case sensitivity, I can use it, it’s worse, I don’t want to have to use it anymore than I want to use filesystem permissions in octal even though I can. Having tools take chmod u+r is easier and doesn’t change the filesystem at all.


Sorry, not sure I see the point here other than computers provide human representation of binary data?

If the mapping is non trivial then unless you're careful you end up breaking basic consistency between input and stored data hence the weird issues with mangling the unicode chars. If the mapping is trivial theres almost nothing to discuss. If the mapping is many to many you're going to have a bad time unless your consistent with your use of the maps. Then the fun is broken mappings where you get data loss due to incorrect many to one and one to many mappings.

There are times when caps matter, I e. code and filesystems are human readable so should not be arbitrary, but searching these for instance makes sense to be insensitive when needed (perhaps even default)


So you’re fine with ~/Downloads and ~/downloads coexisting as entirely separate directories? And John.McCauley@yahoo.fr and john.mccauley@yahoo.fr being attributed to two different people ;)


First one: yes, though good UI should prevent it from happening unless the user really intended it (for example I have ~/Documents symlinked into Dropbox, so ~/documents could be local-only documents)

Second one: no, emails are not filenames, and more generally distinguishability is more important for identifiers. In cases where identifiers like emails need to be mapped to filenames, like caches, they should be normalized.


> So you’re fine with ~/Downloads and ~/downloads coexisting as entirely separate directories?

Case (in)sensitivity for filenames is a non-issue in my experience. Never had problems with either convention. As for emails, I do think insensitivity was the right choice.


The RFC states that email addresses are case sensitive.

The local-part of a mailbox MUST BE treated as case sensitive.

Section 2.4 RFC 2821, https://www.ietf.org/rfc/rfc2821.txt


Ah interesting. I guess the case insensitivity (for incoming email) is a decision of the popular services then, like gmails decision to consider johndoe equivalent to john.doe.


My guess would be that the local part of an email address would usually map to a directory on case sensitive filesystems...


can we just say no to capital letters? (or lowercase?)

do capital letters have a good enough usage case to justify their continued existence?


You are free to stop using capital letters, but good luck getting everyone to go along. Capitals have been around for centuries (they’re older than the printing press) and aren’t going anywhere.


The lower-case letters in Greek/Latin/Cyrillic are the new additions, initially we only had what is now called upper-case.


Fun fact: The Apple Ⅱ and Ⅱ+ originally only did upper-case, and it was very popular to add a Shift Key / lower-case mod via one of the gamepad buttons: https://web.archive.org/web/20010212094858/http://home.swbel...


That works for programmers, but not for users. There could be several files with the same name, buth with different encodings. Worse, depending on how your terminal encodes user input, some of them migth not be typable.


From the users perspective I don't want any normalisation at all. It's good as long as you only have one file system but as soon as you get multiple file systems with conflicting rules (which includes transferring files to other people) it becomes hell. Unfortunately we are stuck with that hell.


Falls over on the fact that I don’t want to be able to write these two files in the same dir. if I write file ö1.txt and ö1.txt then I want to be warned that the file exists even of the encoding is different when I use two different apps but try to write the same file.

The same applies for a.txt and A.txt on case insensitive file systems (as someone pointed out the most common desktop file systems are).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: