What a great, useful read...maybe its text density doesn't follow best practices for a slidedeck but something about dividing up all those code examples across slides made this the most engaging multi-language writeup about Unicode. I always think I know how complicated it is but I think the OP set me to a new level of realized ignorance.
Though I also like that my perceived difficulties with Unicode when moving from Ruby to Python are not just imaginary or out of ignorance, but seem to be actual differences/flaws in implementation. Also did not know about the `regex` module for Python, which aims to replace the standard `re` (and was just updated this week): http://pypi.python.org/pypi/regex
These slides are from 2011, so some things have changed for the better in the interim. For example, Python now uses a flexible internal string representation that ensures that characters in a string are always whole code points, addressing the "inherent brokenness" called out in the slides: https://www.python.org/dev/peps/pep-0393/
For what reason other than perhaps font rendering code do you need to index code points in a string? Everything you think you need code points for (character counting, truncation, etc) you actually need grapheme clusters.
I wish Python went with UTF-8 instead of their multi width internal representation.
Most code works perfectly fine indexing bytes in a UTF-8 string. Anything that looks for stuff in the ASCII range such as parsers would not need to be changed.
Python 3 already required all code to be looked over because of the string literal change so it wouldn't be much different.
In my own experience, the biggest unicode implementation failure was MySQL's choice of using 3-byte BMP Only UTF-8 instead of a full 4-byte Unicode support.[1] In retrospect, their decision to go with a broken subset implementation had caused more trouble, confusions and incompatibilities than their claimed benefits of speed/performance/simplification. Do a simple google search will see almost everyone recommends avoid `utf8` and use `utf8mb4` instead. [2][3][4]
As a result, if your MySQL databases/tables/columns are using `utf8` instead of `utf8mb4` (a MySQL invention) charset, you cannot store / retrieve emoji characters properly.
I wish this was updated. I thought I'd check some things in python, and slide 37 about Python not treating characters as smallest units does not apply anymore. This seems valid:
Python 3.5.0 (default, Sep 22 2015, 12:32:59)
[GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.0.72)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> g = "\N{GREEK SMALL LETTER ALPHA WITH VARIA AND YPOGEGRAMMENI}"
>>> print(g)
ᾲ
>>> print(re.search(r'\w', g))
<_sre.SRE_Match object; span=(0, 1), match='ᾲ'>
>>> p = "\N{MATHEMATICAL SCRIPT CAPITAL P}"
>>> print(p)
𝒫
>>> print(re.search(r'\w', p))
<_sre.SRE_Match object; span=(0, 1), match='𝒫'>
>>> print(re.search(r'..', p))
None
>>> print(re.search(r'.', p))
<_sre.SRE_Match object; span=(0, 1), match='𝒫'>
>>>
> Given that the site was around before the ipad was a thing, isn't it more fair to say "iPad doesn't work with this site"?
No. Because the slides have been published to the world wide web, it should follow WWW standards and then it could have been seen in any future device that followed those. Those are the expectations of the publishing platform, which this presentation doesn't follow.
The site instead chooses to break browser compatibility, and therefore many standard actions are impossible:
- Navigating back and forth between slides.
- Selecting text (such as the URL in the first slide, that I had to type in the address bar instead of being able to copy/paste it).
- Deep-linking a URL to any intermediate page (there's a workaround to this, but it's not obvious).
Apparently, its limitations also include not being able to see this simple content in any browser, like the native iOS one.
I very much dislike presentations that keep the slide history in the browser history.
Now, to go back to the original page you were on before you went to the site, you have to search through your history or click 100x times on the back button.
I also never found a use for deep-linking to an intermediate page.
>Now, to go back to the original page you were on before you went to the site, you have to search through your history or click 100x times on the back button.
Useful trick: if you long-press on the back button in most browsers, it shows a list that allows you to jump several pages back.
However, I was not asking for keeping the slide history in the browser history - the presentation doesn't even have back/forward buttons to get back to the previous slide while staying in the page.
> I also never found a use for deep-linking to an intermediate page.
It's used for referencing the content of the slide, so that people coming from an external site will see exactly the page that you're talking about.
I hate it when someone quotes some content in a slideshow and link you to the first page, forcing you to guess which part of the file contains the content they're talking about.
This is a fallacy in all link-agreggator sites: "Post a comment: assume author reads". The author doesn't read unless he's also the poster, or when he gets the chance some days later when the link is already off the front page.
If you want your comment to be heard by the author, contact the author!
Though I also like that my perceived difficulties with Unicode when moving from Ruby to Python are not just imaginary or out of ignorance, but seem to be actual differences/flaws in implementation. Also did not know about the `regex` module for Python, which aims to replace the standard `re` (and was just updated this week): http://pypi.python.org/pypi/regex