How the Unicode Committee Broke the Apostrophe

thaumasiotes · on June 14, 2015

Makes a real effort to completely gloss over a very common English use of apostrophes. From the article:

> Consider any English word with an apostrophe, e.g. “don’t”. The word “don’t” is a single word. It is not the word “don” juxtaposed against the word “t”. The apostrophe is part of the word, which, in Unicode-speak, means it’s a modifier letter, not a punctuation mark, regardless of what colloquial English calls it.

> According to the Unicode character database, U+2019 is a punctuation mark (General Category = Pf), while U+02BC is a modifier letter (General Category = Lm). Since English apostrophes are part of the words they’re in, they are modifier letters, and hence should be represented by U+02BC, not U+2019.

> (It would be different if we were talking about French. In French, I think it makes more sense to consider «L’Homme» as two words, or «jusqu’ici» as two words. But that’s a conversation for another time. Right now I’m talking about English.)

OK, I've considered an English word: "man's". In the sentence "That man's pants are on fire!", this is usually considered a single word, the genitive case of "man" (personally, I'm not a huge fan of that approach, since the "genitive" 's attaches to phrases, not to words, but it is the mainstream position in linguistics).

In the sentence "That man's about to jump", on the other hand, the "word" "man's" is two words joined by an apostrophe, exactly as in French "l'homme". These clitics aren't exactly rare in English. The author shows some linguistic training in the comments to his piece, but never once mentions clitics, and fails to address them when another commenter brings them up.

Use U+0027 for English apostrophes. ;p

userbinator · on June 14, 2015

I agree with U+0027.

IMHO the fact that all these new Unicode characters look very similar to existing ones, maybe even pixel-identical in particular fonts, are sources of extremely unpleasant surprises.

https://en.wikipedia.org/wiki/IDN_homograph_attack

https://en.wikipedia.org/wiki/Unicode_equivalence

TazeTSchnitzel · on June 14, 2015

Depends what you mean by "word". "Man's" is not the headword you'll find in a dictionary, but it would be separated by a space in a sentence. I think this latter case is what matters here. "Man' s" doesn't make sense.

greggyb · on June 14, 2015

Is a clitic distinct from a normal contraction?

I am not up on linguistics, so I'm just curious, but I'd class "don't" and "man's" as in "the man's about to jump" the same as just contractions.

thaumasiotes · on June 14, 2015

What's a "normal contraction"?

The mainstream view is that English "don't" is just a word, the negative form of "do". The apostrophe is a historical accident. The author of this piece gives one argument for this view in the comments, pointing out that "don't" can appear in contexts where "do not" is ungrammatical.

A "clitic" in linguistics refers to an item which is (1) a word in the sense of having its own dictionary entry (which I might call "at the lexical level"), but (2) not a word at the phonological level -- clitics depend for their pronunciation on the word(s) (usually just one word) next to them. So, for example, the 's of "the man's about to jump" is lexically a form of the verb is, but it's been reduced down to zero syllables. The indefinite article (a/an) is another English clitic, and you can observe its pronunciation changing according to the word that follows it in the sentence pair:

1. A cow trampled me.

2. An elephant trampled me.

Traditionally, the definite article ("the") is also clitic, with one pronunciation before consonants and a different pronunciation before vowels. The before-consonants pronunciation is in the process of becoming universal.

Languages differ in whether clitics are written together with the words they attach to phonologically or not. Ancient Greek clitics are traditionally separated with orthographic spaces (we know they're still clitics because they affect the placement of word accents). Latin clitics are written as part of the same word: "felis canisque" (="the cat and the dog", where -que means "and"). English uses both approaches.

SUMMARY, about the specific example you chose: "don't" and the "man's" of "the man's about to jump" are not in the same class, because "don't" is just a word with no internal structure, and that "man's" is two words which are realized in speech as a single syllable. That "man's" might be thought of as a "normal contraction", a term of no meaning that I know, but linguistically is a full word ("man") with a clitic ("'s") attached. However, clitics in general are not necessarily zero syllables long.

TazeTSchnitzel · on June 14, 2015

In a ideal world, yes, this is how it would work. But in practice it is not. The vast, vast majority of documents using non-straight quotes use ’ (U+2019[1], the Windows-1252 \x92 right curly quote that Microsoft Word <3s) for apostrophes. There's not much that can be done about that.

Unicode has to strike a balance between what's most "correct" and how the real world actually uses it.

[1] I was looking at that codepoint and thought it must be wrong. It's too big a number for a Latin-1 codepoint. Aren't the first 256 characters of Unicode just Latin-1? Well, exactly. They're Latin-1, rather than Windows-1252, which is where the now-infamous curly “smart quotes” come from. The two encodings are easily confused, because they're mostly the same. The difference is Microsoft replaced the extra control codes in the high byte (who needs those, really? ASCII had too many already) with more useful new printable characters.

mcguire · on June 14, 2015

"Unicode has to strike a balance between what's most 'correct' and how the real world actually uses it."

That train left Unicode station a very long time ago. They have chosen correctness over convenience too many times to switch tactics now.

TazeTSchnitzel · on June 14, 2015

To clarify: I think it'd be better if the Unicode Consortium changed the properties of U+2019 than change which character is the canonical representation of an apostrophe, given you won't be able to change most documents.

If you make apostrophes a different character, how would you make sure apostrophes and end quotes aren't confused? Unless you're a Unicode fanatic, you probably won't manually edit sequences of hexadecimal codepoints.

shiggerino · on June 14, 2015

Sadly, Unicode is a clusterfuck. But can anything be done about it? Or should we just be happy we for once have managed to get a decent adoption of something interoperable?

lambda · on June 14, 2015

Unicode is not a clusterfuck. Overall, it is a very comprehensive, well thought out standard, that has improved the situation for interoperable internationalization dramatically over the hundreds of separate encodings that preceded it.

The clusterfuck is mostly in the essential complexity of the problem; the world's languages and writing systems are quite complex and varied, and all of the writing systems and punctuation conventions were designed for reading, hand-writing and hand-typesetting by people who understood the language in question, not automatic typesetting and processing by a general purpose computer.

The complexity of the problem was increased by the necessity of providing migration paths from legacy encodings to Unicode and back again; without such a guarantee, bootstrapping the world into using Unicode would have been a much more difficult proposition, but that constraint also means that many oddities of legacy encodings have had to be preserved in Unicode in order to be able to preserve that round-trip mapping.

Unicode, and sister projects like the Common Locale Data Repository, are doing an admirable job of navigating and standardizing this complex problem.

There are definitely aspects of the Unicode process where they have gotten it wrong; UCS-2/UTF-16 is one of them, in hindsight it is apparent that UTF-8 is superior in pretty much every way. Having a variable width encoding but which most of the world's text fits in a fixed width, and which has endianness issues that necessitate inclusion of an non-textual byte order marker to disambiguate, has caused a number of problems and incompatibilities. There may be a few other points of legitimate criticism, like some aspects of Han unification.

But on the whole, outside of a few problems like those, the "clusterfuck" is caused not by Unicode, but simply by the essential complexity of the problem involved. Language and text are simply difficult things to model in a computer.

zokier · on June 14, 2015

> The complexity of the problem was increased by the necessity of providing migration paths from legacy encodings to Unicode and back again; without such a guarantee, bootstrapping the world into using Unicode would have been a much more difficult proposition, but that constraint also means that many oddities of legacy encodings have had to be preserved in Unicode in order to be able to preserve that round-trip mapping.

While I agree that preserving round-trip integrity was essential for Unicodes success, I'm not sure if the approach taken to achieve that was the best one. I would have preferred that the complexity tradeoff would have been shifted to software converting between Unicode and legacy encodings by having more complex mapping tables and cleaner code point space.

I also think that Unicode Consortium should have been more aggressive in segregating (and discouraging the general use of) legacy compatibility features/codepoints and the stuff that is actually supposed to be used. My personal pet peeve is precomposed characters.

In a more general note, I sometimes wonder if it would have been beneficial to have separate layers in Unicode and have more focused on providing generic primitives. As a simple example it is mighty convenient that I can type 2³ = 8 in plain text, but arguably it would be even nicer if instead of special 'SUPERSCRIPT THREE' codepoint there would be generic superscript modifier codepoint that could be combined with any character.

Speaking of superscripts, they demonstrate well one aspect that I dislike in Unicode, the way they have absorbed legacy encodings verbatim. The numeric superscripts (e.g. ⁰ ¹ ² ³ ⁴ ⁵ ⁶) happen to have inconsistent look on my machine because the superscripts for 1, 2, and 3 are from Latin1 while the rest are in their own block.

harshreality · on June 14, 2015

Iʼm sold, I think... (U+02bc isn't really intended for such use, but until there's a proper alternative, other than U+2019 or U+0027, I'm using it)[1].

    (in ~/.XCompose)
    include "%L"
    <Multi_key> <apostrophe> <minus>                : "ʼ"   U02BC   # MODIFIER LETTER APOSTROPHE

[1] A potential problem with U+0027 is that low-ascii ' and " have uses for demarcating things (like attribute values in html, most popularly), so if you're editing anything that uses ' for markup, you can't search and replace based on ' anymore.

jmount · on June 14, 2015

"Using U+2019 is inconsistent with the rest of the standard" I agree with the article, just my negativity is such I would say the correct statement is more like: "Using U+2019 is inconsistent with good use, making it consistent with the rest of the mess that is the standard."

sctb · on June 14, 2015

https://news.ycombinator.com/item?id=9655387

lambda · on June 14, 2015

I think there are a few things wrong with this argument. I'm unconvinced by the argument that this should actually be considered a modifier letter.

  Consider any English word with an apostrophe, e.g. 
  “don’t”. The word “don’t” is a single word. It is not the 
  word “don” juxtaposed against the word “t”. The 
  apostrophe is part of the word, which, in Unicode-speak, 
  means it’s a modifier letter, not a punctuation mark, 
  regardless of what colloquial English calls it.

The definition of a modifier letter is (http://www.unicode.org/versions/Unicode7.0.0/ch07.pdf#G15832):

  Modifier letters, in the sense used in the Unicode 
  Standard, are letters or symbols that are typically 
  written adjacent to other letters and which modify their 
  usage in some way. They are not formally combining marks 
  (gc=Mn or gc=Mc) and do not graphically combine with the 
  base letter that they modify. They are base characters in 
  their own right. The sense in which they modify other 
  letters is more a matter of their semantics in usage; they 
  often tend to function as if they were diacritics, in 
  dicating a change in pronunciation of a letter, or 
  otherwise distinguishing a letter’s use. Typically this 
  diacritic modification applies to the character preceding 
  the modifier letter, but modifier letters may sometimes 
  modify a following character. Occasionally a modifier 
  letter may simply stand alone representing its own sound.

Punctuation, on the other hand, is (http://www.unicode.org/faq/punctuation_symbols.html):

  Punctuation marks are standardized marks or signs used to 
  clarify the meaning and separate structural units of text.

Based on these definitions, the apostrophe seen in contractions and possessives is definitely punctuation, not a modifier letter. Modifier letters indicate some effect on sound or pronunciation, either modifying an adjacent letter or having a sound on their own. U+02BC (MODIFIER LETTER APOSTROPHE) is such an example, being used to indicate a glottal stop.

Apostrophes used in contractions and possessives, however, have no effect on pronunciation; instead, just as in the definition of punctuation, they are used to "clarify the meaning and separate structural units of text."

  But we shouldn’t be perpetuating this problem. When a 
  programmer is writing a regex that can match text in 
  Chinese, Arabic, or any other human language supported by 
  Unicode, they shouldn’t have to add an exception for 
  English.

Thinking that it's possible to do text processing in a language or writing system neutral way is a fallacy. Unicode simply provides an encoding that allows all of these writing systems in a single document, plus a number of algorithms that are designed to be fairly reasonable across the entire encoding, but which cannot be correct for all languages and writing systems without specific tailoring.

Many writing systems do not use spaces between words. Any form of word segmentation for these writing systems will necessarily be language specific, generally involving dictionaries. Using a regex like \w+ on Chinese or Thai text is fairly meaningless, as it will generally match an entire sentence at a time, rather than actually matching a single word.

  For godsake, apostrophes are not closing quotation marks!

No, they are not. However, they also aren't modifier letters. If you wanted to provide a distinction for the purposes mentioned here, you would probably need to add a new, distinct punctuation character "curly apostrophe" or something of the sort (since the ASCII range apostrophe can't be reused due to its overloaded meaning). However, even if you did that, you would still need to deal with all of the legacy documents which use ASCII apostrophe and closing quotation marks; you wouldn't actually be able to simplify the implementation by making the assumptions that a closing quotation mark was always actually closing a quotation.

Now having three different characters that looked identical (the modifier letter apostrophe, the closing quotation mark, and the punctuation apostrophe) would additionally add to confusion.

Even if you didn't introduce a new character, and instead used the modifier letter apostrophe as a punctuation apostrophe, you would still have all of the problems with legacy documents; it would take years for this change to make it's way through all of the various word processing programs and text editors, even after it had there would be existing documents using the old conventions, etc.

In short, text processing is hard, because text conventions were designed for human readers who know the language, not computers trying to process text in a language-independent way, and they were designed either through handwriting or manual typesetting, not keyboard entry. You are never going to achieve a perfect text processing model that can handle all of the world's languages simply by using particular global Unicode properties of characters and applying a simple algorithm or categorization on them. A lot of text processing will need to be contextual, language (and locale) specific, and involve dictionaries.

I don't think that switching from the punctuation closing quote character to the modifier letter apostrophe for the punctuation apostrophe is likely to help much; and the confusion caused by nearly 20 years of documents that follow the existing conventions and so having to support both conventions is likely to make the situation much worse, not better.