Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Unicode: Good, Bad, and Ugly (2011) (azabani.com)
77 points by Tomte on Nov 11, 2015 | hide | past | favorite | 27 comments


What a great, useful read...maybe its text density doesn't follow best practices for a slidedeck but something about dividing up all those code examples across slides made this the most engaging multi-language writeup about Unicode. I always think I know how complicated it is but I think the OP set me to a new level of realized ignorance.

Though I also like that my perceived difficulties with Unicode when moving from Ruby to Python are not just imaginary or out of ignorance, but seem to be actual differences/flaws in implementation. Also did not know about the `regex` module for Python, which aims to replace the standard `re` (and was just updated this week): http://pypi.python.org/pypi/regex


regex one of the most underused packages IMHO, even for people using straight ASCII. It even has fuzzy matching!


These slides are from 2011, so some things have changed for the better in the interim. For example, Python now uses a flexible internal string representation that ensures that characters in a string are always whole code points, addressing the "inherent brokenness" called out in the slides: https://www.python.org/dev/peps/pep-0393/


For what reason other than perhaps font rendering code do you need to index code points in a string? Everything you think you need code points for (character counting, truncation, etc) you actually need grapheme clusters.

I wish Python went with UTF-8 instead of their multi width internal representation.


The problem IMHO is more not regressing on lots of already-written code which assumes O(1) indexing, and less a problem of principle.


Most code works perfectly fine indexing bytes in a UTF-8 string. Anything that looks for stuff in the ASCII range such as parsers would not need to be changed.

Python 3 already required all code to be looked over because of the string literal change so it wouldn't be much different.


PHP now also has intl, which provides access to the gamut of functionality of libicu, so the php part of the deck needs a near-total rewrite.


Broken on mobile and hard to navigate on desktop. Here's an easier version: http://output.jsbin.com/qewonisinu/1


In my own experience, the biggest unicode implementation failure was MySQL's choice of using 3-byte BMP Only UTF-8 instead of a full 4-byte Unicode support.[1] In retrospect, their decision to go with a broken subset implementation had caused more trouble, confusions and incompatibilities than their claimed benefits of speed/performance/simplification. Do a simple google search will see almost everyone recommends avoid `utf8` and use `utf8mb4` instead. [2][3][4]

As a result, if your MySQL databases/tables/columns are using `utf8` instead of `utf8mb4` (a MySQL invention) charset, you cannot store / retrieve emoji characters properly.

[1] https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8...

[2] https://mzsanford.wordpress.com/2010/12/28/mysql-and-unicode...

[3] https://mathiasbynens.be/notes/mysql-utf8mb4

[4] https://www.drupal.org/node/1314214


Perl 6 has some of the best unicode support out there as default behaviour. Even supports all unicode digits as numeric literals. So:

    1 + ൭ == 8


Impressive. But I wonder if I want to see something like this in the code:

    ٦1٥٠3 + ٤६੬៩ - ৭۹੧


I wish this was updated. I thought I'd check some things in python, and slide 37 about Python not treating characters as smallest units does not apply anymore. This seems valid:

  Python 3.5.0 (default, Sep 22 2015, 12:32:59) 
  [GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.0.72)] on darwin
  Type "help", "copyright", "credits" or "license" for more information.
  >>> import re
  >>> g = "\N{GREEK SMALL LETTER ALPHA WITH VARIA AND YPOGEGRAMMENI}"
  >>> print(g)
  ᾲ
  >>> print(re.search(r'\w', g))
  <_sre.SRE_Match object; span=(0, 1), match='ᾲ'>
  >>> p = "\N{MATHEMATICAL SCRIPT CAPITAL P}"
  >>> print(p)
  𝒫
  >>> print(re.search(r'\w', p))
  <_sre.SRE_Match object; span=(0, 1), match='𝒫'>
  >>> print(re.search(r'..', p))
  None
  >>> print(re.search(r'.', p))
  <_sre.SRE_Match object; span=(0, 1), match='𝒫'>
  >>>


Reads better if you install the fonts recommended at the end Alfios and Symbola http://users.teilar.gr/~g1951d/ and the font Everson Mono available here: http://www.evertype.com/emono/


That presentation is evil. It doesn't behave as a web page but like some stupid Keynote document.


https://en.wikipedia.org/wiki/S5_%28file_format%29

You can also click on the Ø in the lower-right to remove all that and view it as a normal web page.


Can someone put 2011 in the title?


I guess the author must have been pretty happy when Swift came out, with all of it's glorious Unicode support...


Website doesn't work on iPad


Given that the site was around before the ipad was a thing, isn't it more fair to say "iPad doesn't work with this site"?


> Given that the site was around before the ipad was a thing, isn't it more fair to say "iPad doesn't work with this site"?

No. Because the slides have been published to the world wide web, it should follow WWW standards and then it could have been seen in any future device that followed those. Those are the expectations of the publishing platform, which this presentation doesn't follow.

The site instead chooses to break browser compatibility, and therefore many standard actions are impossible:

- Navigating back and forth between slides.

- Selecting text (such as the URL in the first slide, that I had to type in the address bar instead of being able to copy/paste it).

- Deep-linking a URL to any intermediate page (there's a workaround to this, but it's not obvious).

Apparently, its limitations also include not being able to see this simple content in any browser, like the native iOS one.


I very much dislike presentations that keep the slide history in the browser history.

Now, to go back to the original page you were on before you went to the site, you have to search through your history or click 100x times on the back button.

I also never found a use for deep-linking to an intermediate page.


>Now, to go back to the original page you were on before you went to the site, you have to search through your history or click 100x times on the back button.

Useful trick: if you long-press on the back button in most browsers, it shows a list that allows you to jump several pages back.

However, I was not asking for keeping the slide history in the browser history - the presentation doesn't even have back/forward buttons to get back to the previous slide while staying in the page.

> I also never found a use for deep-linking to an intermediate page. It's used for referencing the content of the slide, so that people coming from an external site will see exactly the page that you're talking about.

I hate it when someone quotes some content in a slideshow and link you to the first page, forcing you to guess which part of the file contains the content they're talking about.

I hate


>- Navigating back and forth between slides.

That has nothing to do with browser compatibility, it simply changes the element visibility.

>- Selecting text (such as the URL in the first slide, that I had to type in the address bar instead of being able to copy/paste it).

Works just fine if you press ^C before releasing the mouse button.

>- Deep-linking a URL to any intermediate page (there's a workaround to this, but it's not obvious).

True but again unrelated to browser compatibility.

Although you're right, it's extremely annoying and as stupid as single page apps and infinite scrolling websites.


All these years I’ve been told that web is better than native because it’s cross platform and future proof.


Do you expect the website's author to read your comment here?

If not, what's the use? Mail him if it bothers you!


It's not uncommon for authors to read HN comments on their content. It also serves as a warning to fellow HN users.


This is a fallacy in all link-agreggator sites: "Post a comment: assume author reads". The author doesn't read unless he's also the poster, or when he gets the chance some days later when the link is already off the front page.

If you want your comment to be heard by the author, contact the author!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: