Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I do not agree that restricting it to UTF-8 (or to Unicode in general) is a fair and reasonable design decision (although UTF-8 may be reasonable if Unicode is somehow required anyways (you should avoid requiring Unicode if you can though), especially the program is also expected to deal with ASCII in addition to requiring Unicode), but regardless of that, the number of code points is not usually relevant (and substring operations indexed by code points is not usually necessary either), and the number of bytes will be more important, and some programs should not need to know about the character encoding at all (or only have a limited consideration of what they do with them).

(One reason you might care about the number of code points is because you are converting UTF-8 to UTF-32 (or Shift-JIS to TRON-32 or whatever else) and you want to allocate the memory ahead of time. The number of characters (which is not the same as the number of code points in the case of Unicode, although for other character sets it might be) is probably not important; if you want to display it, you will care about the display width according to the font, and if you are doing editing then where one character starts and ends is going to be more significant than how many characters they are. If you are using and indexing by the number of code points a lot (even though as I say that should not usually be necessary), then you might use UTF-32 instead of UTF-8.)

(It is also my opinion that Unicode is not a good character set.)





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: