Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Q: What do you think about Byte Order Marks? A: According to the Unicode Standard (v6.2, p.30): "Use of a BOM is neither required nor recommended for UTF-8". [...] Using BOMs would require all existing code to be aware of them, even in simple scenarios as file concatenation. This is unacceptable.

Then your site "UTF-8 everywhere" is misnamed, because standards-following UTF-8 can have a BOM. It's not required or recommended, but it is possible and allowable, so you might see them and if you follow the standard you have to deal with them. It's not a matter of "this would require all existing code to handle them" - that is not hypothetical, that is the current world, to be standards-compliant all existing code does already need to be aware of them. It isn't, which means it's broken. Declaring it "unacceptable" is meaningless, except to say you're rejecting the standard and doing something incompatible and broken because it's easier.

Which is a position one can take and defend, but it's not a good position for a site claiming to be pushing for people to follow the standard. What it is, is yet another non-standard ad-hoc variant defined by what some subset of tools the authors use can/can't handle in April 2020.

> "the UTF-8 BOM exists only to manifest that this is a UTF-8 stream"

Throwing the word "only" in there doesn't make it go away. It exists as a standards-compliant way to distinguish UTF-8 from ASCII, not recommended but not forbidden.

> "A: Are you serious about not supporting all of Unicode in your software design? And, if you are going to support it anyway, how does the fact that non-BMP characters are rare practically change anything"

Well in the same way, how does the fact that UTF8+BOM is rare practically change anything? At some level you're either pushing for everyone to follow standards even if it's inconvenient because that makes life better for everyone overall, like you are with surrogate pairs and indexing, or you're creating another ad-hoc incompatible variation of UTF-8 which you prefer to the standard and trying to strong-arm everyone else into using it with threats of being incompatible with all the code which already does it wrong.

Being wary of Chesterton's Fence, presumably there's some company or system which got UTF-8+BOM added to the standard because they wanted it, or needed it.



100% agree.

> using BOMs would require all existing code to be aware of them, even in simple scenarios as file concatenation

Absolutely! Any app that writes UTF-files can (and probably should) avoid writing them. But any program that reads UTF files must handle a BOM. A lot of apps write UTF-8 including the BOM by default, for example Visual Studio.

You can NOT concatenate two UTF-8 streams and expect that the resulting stream is also a valid UTF-8 stream. NO tool should assume that, ever.


> You can NOT concatenate two UTF-8 streams and expect that the resulting stream is also a valid UTF-8 stream.

Actually you can; the ability to concatenate UTF-8 streams is an intentionally part of the design of UTF-8. The BOM is an ordinary Unicode code point and can occur in the middle of a valid UTF-8 stream, where it should be treated as either a zero-width non-breaking space or an unsupported character (which only affects rendering). So concatenating two UTF-8 streams with leading BOMs still results in a valid UTF-8 stream, albeit with an extra zero-width space.

The bigger problem with the BOM is that it breaks transparent compatibility with ASCII. Absent a leading BOM character, a UTF-8 steam containing only codepoints 0-127 is binary-identical to an ASCII-encoded text stream and can be handled with tools that are not UTF-8 aware. This was an explicit design consideration for both Unicode and UTF-8. Add the BOM, however, and your file is no longer plain text, which can lead to syntax errors or other issues that are difficult to diagnose because the BOM is invisible in UTF-8 aware text editors.

I think the BOM was a mistake—along with the variable-length multi-byte encodings it was created to support—but unfortunately at this point we're stuck with it. (Actually the BOM is prohibited in the multi-byte formats with an explicit byte order, like UTF-16BE; it would have been really nice if the same policy had been applied to UTF-8 where byte order is irrelevant.) The best we can do is recommend that new programs omit the BOM when outputting UTF-8 and either skip it at the beginning or convert it to U+2060 WORD JOINER anywhere else when it appears in the input.


Interesting, I thought a BOM-in-the-middle was invalid. I know apps are even more likely to choke on that than a leading BOM though.

In any case, you need to handle it in every app that claims to read UTF. The loss of compatibility is indeed the biggest problem and I agree the BOM should be omitted when possible, but that doesn’t change that it’s part of the spec and millions of UTF files have a BOM.

Even if 100% of all apps stopped using a BOM today you couldn’t ignore it in a parser.


Downvoting doesn't make the BOM stop being part of the standard either, btw.

Yes, supporting BOM on arbitrary UTF-8 streams is varying between difficult and impossible, but then get it removed from the standard, or state that you don't support the standard. Don't pretend you support the standard while ignoring the bits you don't like, that's dishonest and unhelpful.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: