Unicode: Good, Bad, and Ugly (2011)

danso · on Nov 11, 2015

What a great, useful read...maybe its text density doesn't follow best practices for a slidedeck but something about dividing up all those code examples across slides made this the most engaging multi-language writeup about Unicode. I always think I know how complicated it is but I think the OP set me to a new level of realized ignorance.

Though I also like that my perceived difficulties with Unicode when moving from Ruby to Python are not just imaginary or out of ignorance, but seem to be actual differences/flaws in implementation. Also did not know about the `regex` module for Python, which aims to replace the standard `re` (and was just updated this week): http://pypi.python.org/pypi/regex

Veedrac · on Nov 11, 2015

regex one of the most underused packages IMHO, even for people using straight ASCII. It even has fuzzy matching!

re · on Nov 11, 2015

These slides are from 2011, so some things have changed for the better in the interim. For example, Python now uses a flexible internal string representation that ensures that characters in a string are always whole code points, addressing the "inherent brokenness" called out in the slides: https://www.python.org/dev/peps/pep-0393/

Avernar · on Nov 11, 2015

For what reason other than perhaps font rendering code do you need to index code points in a string? Everything you think you need code points for (character counting, truncation, etc) you actually need grapheme clusters.

I wish Python went with UTF-8 instead of their multi width internal representation.

Veedrac · on Nov 11, 2015

The problem IMHO is more not regressing on lots of already-written code which assumes O(1) indexing, and less a problem of principle.

Avernar · on Nov 11, 2015

Most code works perfectly fine indexing bytes in a UTF-8 string. Anything that looks for stuff in the ASCII range such as parsers would not need to be changed.

Python 3 already required all code to be looked over because of the string literal change so it wouldn't be much different.

Joeri · on Nov 11, 2015

PHP now also has intl, which provides access to the gamut of functionality of libicu, so the php part of the deck needs a near-total rewrite.

gkoberger · on Nov 11, 2015

Broken on mobile and hard to navigate on desktop. Here's an easier version: http://output.jsbin.com/qewonisinu/1

devy · on Nov 11, 2015

In my own experience, the biggest unicode implementation failure was MySQL's choice of using 3-byte BMP Only UTF-8 instead of a full 4-byte Unicode support.[1] In retrospect, their decision to go with a broken subset implementation had caused more trouble, confusions and incompatibilities than their claimed benefits of speed/performance/simplification. Do a simple google search will see almost everyone recommends avoid `utf8` and use `utf8mb4` instead. [2][3][4]

As a result, if your MySQL databases/tables/columns are using `utf8` instead of `utf8mb4` (a MySQL invention) charset, you cannot store / retrieve emoji characters properly.

[1] https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8...

[2] https://mzsanford.wordpress.com/2010/12/28/mysql-and-unicode...

[3] https://mathiasbynens.be/notes/mysql-utf8mb4

[4] https://www.drupal.org/node/1314214

Ulti · on Nov 11, 2015

Perl 6 has some of the best unicode support out there as default behaviour. Even supports all unicode digits as numeric literals. So:

    1 + ൭ == 8

shiro · on Nov 11, 2015

Impressive. But I wonder if I want to see something like this in the code:

    ٦1٥٠3 + ٤६੬៩ - ৭۹੧

Demiurge · on Nov 11, 2015

I wish this was updated. I thought I'd check some things in python, and slide 37 about Python not treating characters as smallest units does not apply anymore. This seems valid:

  Python 3.5.0 (default, Sep 22 2015, 12:32:59) 
  [GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.0.72)] on darwin
  Type "help", "copyright", "credits" or "license" for more information.
  >>> import re
  >>> g = "\N{GREEK SMALL LETTER ALPHA WITH VARIA AND YPOGEGRAMMENI}"
  >>> print(g)
  ᾲ
  >>> print(re.search(r'\w', g))
  <_sre.SRE_Match object; span=(0, 1), match='ᾲ'>
  >>> p = "\N{MATHEMATICAL SCRIPT CAPITAL P}"
  >>> print(p)
  𝒫
  >>> print(re.search(r'\w', p))
  <_sre.SRE_Match object; span=(0, 1), match='𝒫'>
  >>> print(re.search(r'..', p))
  None
  >>> print(re.search(r'.', p))
  <_sre.SRE_Match object; span=(0, 1), match='𝒫'>
  >>>

MindTwister · on Nov 11, 2015

Reads better if you install the fonts recommended at the end Alfios and Symbola http://users.teilar.gr/~g1951d/ and the font Everson Mono available here: http://www.evertype.com/emono/

tempodox · on Nov 11, 2015

That presentation is evil. It doesn't behave as a web page but like some stupid Keynote document.

ygra · on Nov 11, 2015

https://en.wikipedia.org/wiki/S5_%28file_format%29

You can also click on the Ø in the lower-right to remove all that and view it as a normal web page.

dfc · on Nov 11, 2015

Can someone put 2011 in the title?

lhecker · on Nov 11, 2015

I guess the author must have been pretty happy when Swift came out, with all of it's glorious Unicode support...

forrestthewoods · on Nov 11, 2015

Website doesn't work on iPad

scrollaway · on Nov 11, 2015

Given that the site was around before the ipad was a thing, isn't it more fair to say "iPad doesn't work with this site"?

TuringTest · on Nov 11, 2015

> Given that the site was around before the ipad was a thing, isn't it more fair to say "iPad doesn't work with this site"?

No. Because the slides have been published to the world wide web, it should follow WWW standards and then it could have been seen in any future device that followed those. Those are the expectations of the publishing platform, which this presentation doesn't follow.

The site instead chooses to break browser compatibility, and therefore many standard actions are impossible:

- Navigating back and forth between slides.

- Selecting text (such as the URL in the first slide, that I had to type in the address bar instead of being able to copy/paste it).

- Deep-linking a URL to any intermediate page (there's a workaround to this, but it's not obvious).

Apparently, its limitations also include not being able to see this simple content in any browser, like the native iOS one.

lqdc13 · on Nov 11, 2015

I very much dislike presentations that keep the slide history in the browser history.

Now, to go back to the original page you were on before you went to the site, you have to search through your history or click 100x times on the back button.

I also never found a use for deep-linking to an intermediate page.

TuringTest · on Nov 11, 2015

>Now, to go back to the original page you were on before you went to the site, you have to search through your history or click 100x times on the back button.

Useful trick: if you long-press on the back button in most browsers, it shows a list that allows you to jump several pages back.

However, I was not asking for keeping the slide history in the browser history - the presentation doesn't even have back/forward buttons to get back to the previous slide while staying in the page.

> I also never found a use for deep-linking to an intermediate page. It's used for referencing the content of the slide, so that people coming from an external site will see exactly the page that you're talking about.

I hate it when someone quotes some content in a slideshow and link you to the first page, forcing you to guess which part of the file contains the content they're talking about.

I hate

forgotmypassw · on Nov 11, 2015

>- Navigating back and forth between slides.

That has nothing to do with browser compatibility, it simply changes the element visibility.

>- Selecting text (such as the URL in the first slide, that I had to type in the address bar instead of being able to copy/paste it).

Works just fine if you press ^C before releasing the mouse button.

>- Deep-linking a URL to any intermediate page (there's a workaround to this, but it's not obvious).

True but again unrelated to browser compatibility.

Although you're right, it's extremely annoying and as stupid as single page apps and infinite scrolling websites.

micampe · on Nov 11, 2015

All these years I’ve been told that web is better than native because it’s cross platform and future proof.

Tomte · on Nov 11, 2015

Do you expect the website's author to read your comment here?

If not, what's the use? Mail him if it bothers you!

forrestthewoods · on Nov 11, 2015

It's not uncommon for authors to read HN comments on their content. It also serves as a warning to fellow HN users.

tenfingers · on Nov 11, 2015

This is a fallacy in all link-agreggator sites: "Post a comment: assume author reads". The author doesn't read unless he's also the poster, or when he gets the chance some days later when the link is already off the front page.

If you want your comment to be heard by the author, contact the author!