Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Fun with Glibc and the Ctype.h Functions (rachelbythebay.com)
72 points by picture on Oct 1, 2021 | hide | past | favorite | 49 comments


Here's what the C standard says about character handling functions:

> In all cases the argument is an int, the value of which shall be representable as an unsigned char or shall equal the value of the macro EOF. If the argument has any other value, the behavior is undefined.

So this is just a case of glibc being optimized in a way that's really unforgiving if you commit that particular UB.


No, this is a case of glibc trying to support localization of ctype in spite of the fact that it can't be localized to anything other than English in UTF-8 locales, anything other than Latin scripts in ISO-8859-* locales, or English in C/POSIX or EBCDIC locales. And then on top of that trying to be fast.

I'd give up on supporting localization for ctype.

This makes me think, too, "never use ctype, just hardcode my own that assumes ASCII".


I don't know about never, but try hard not to use anything whose behavior is influenced by a call to setlocale.


Is that even possible? This setlocale stuff is everywhere. Even basic functions like printf are affected. Reminds me of this incredible commit message:

https://github.com/mpv-player/mpv/commit/1e70e82baa9193f6f02...

I found it easier to get rid of libc and write freestanding C instead. Linux system calls have none of these problems. This locale bullshit is nowhere to be found. No global state anywhere. No thread-local errno. No stupid stuff like EOF. Writing C became fun again.


OpenBSD man page (https://man.openbsd.org/isalnum.3):

“OpenBSD always uses the C locale for these functions, ignoring the global locale, the thread-specific locale, and the locale argument.“

and https://man.openbsd.org/setlocale.3:

“On OpenBSD, the only useful value for the category is LC_CTYPE. It sets the locale used for character encoding, character classification, and case conversion. For compatibility with natural language support in packages(7), all other categories — LC_COLLATE, LC_MESSAGES, LC_MONETARY, LC_NUMERIC, and LC_TIME — can be set and retrieved, too, but their values are ignored by the OpenBSD C library. A category of LC_ALL sets the entire locale generically, which is strongly discouraged for security reasons in portable programs.”

Probably non-standard, but from what I read here, also a defensible choice.


That's interesting! Non-standard but it doesn't matter. That's the right thing to do no matter what some piece of paper says.


Although printf is affected, not every aspect of printf is affected.

IIRC, one nasty area is floating-point numbers, when you're in the locale that doesn't have . as the decimal point, but your program's requirements are such that . is the decimal point.

Like, oh, you're writing a programming language which specifies that.

Like, you know, C does for floating-point constants.


> floating-point numbers, when you're in the locale that doesn't have . as the decimal point

I live in such a country. It's subtle but constant pain even for average users. Even the Windows calculator screws this up: it only accepts either . or , as input instead of both.

> Like, oh, you're writing a programming language which specifies that.

Yeah. The commit message I linked has one such example.

> This is still less bad than that time when libquivi fucked up OpenGL rendering

> calling a libquvi function would load some proxy abstraction library, which in turn loaded a KDE plugin

> which in turn called setlocale() because Qt does this

> made the mpv GLSL shader generation code emit "," instead of "." for numbers

> and of course only for users who had that KDE plugin installed, and lived in a part of the world where "." is not used as decimal separator

Just imagine debbuging this insanity.


> Even the Windows calculator screws this up: it only accepts either . or , as input instead of both.

It does, however, always accept the . key on the numeric keypad, so many users won't really notice the discrepancy.


Yeah, as a french it bit me a lot of times. Now I just LC_ALL=C.UTF-8, french translations of english software are shitty most of the time anyways and in particular for dev tools.


If you had to takeaway just one thing from this glorious rant it would be: globals are evil.


The man pages used to warn that isalnum, ispunct etc were only defined if isascii was true.

That seems to have disappeared - which supports your view.


Here are some branchless/constant-time versions of those functions that don't rely on locale: https://git.zx2c4.com/wireguard-tools/tree/src/ctype.h


I like the suffix in 0x80001FU


This also applies to C++ <locale> functions, like std::isspace.

Another fun one: With FD_CLR, FD_ISSET, FD_SET you can corrupt memory by merely passing a socket descriptor that is not 0..1024. Pass a negative integer for some undefined behavior as well (shift by negative value occurs here [1])

[1] https://github.com/lattera/glibc/blob/895ef79e04a953cac14938...


IMHO the more interesting oddity about the functions declared in <ctype.h> is that they work with unsigned char, which means that they have undefined behavior if you pass a negative char value (other than EOF, which is typically -1).

This means that if you have a char value (say, an element of a string), you need to cast it to unsigned char before passing it to any of the is*() functions.


The rant behind her post (https://drewdevault.com/2020/09/25/A-story-of-two-libcs.html ), which has had some circulation, really shows its author’s limited perspective.

glibc needs to solve two hard problems: be very fast and run on innumerable systems. Some of that conditional stuff is because all the world is not Linux or BSD; some of the macrology is there to make sure such handling is performed everywhere needed, and of course the preprocessor is the closest a language like C can get to preprocessing.

I was in the code as glibc started to exist (we paid for a lot of it) and it looked like Musl: very straightforward.


The definition of the ctype functions as working on unsigned char values and EOF + CHAR_BIT being 8 everywhere now basically means that there isn't much locale-specificity to the ctype functions: they can be made to work with ASCII, ISO-8859-*, and... EBCDIC, but not UTF-8 in general (just ASCII) or any Unicode encoding (idk, maybe they can be made to be locale-specific for Shift-JIS, but only for ASCII in Shift-JIS).

And... yes, glibc does have support for EBCDIC, which is probably ultimately why it has these run-time indirections in its ctype. There's no other reason to have run-time indirections for ctype functions given the limitation of unsigned char values + EOF. That means this code can be simplified a great deal.

Anyways, yes, Drew DeVault's rant misses glibc's need to support EBCDIC, but glibc is exactly like this for every little thing -- an unmaintainable mess. There has to be a better way to produce a fast C library w/o being such a mess on the inside.


They obviously can be made to work for Unicode; iswhatever(x) can just report the whatever property for code points U+0000 through U+00FF, its documented range. If you want to know about higher-valued code points, use iswwhatever, the wide version.


No not really. wchar_t is not Unicode. It's whatever the current locale's codeset and encoding demand. No, just say no to all of this.

Instead:

  - use UTF-8 locales
  - use a Unicode library for all things Unicode
  - make your own ctype for when it's Just ASCII


I don't know who you think said wchar_t is Unicode, but it wasn't me.

In a given locale, you can almost certainly regard wchar_t as being a continuation of the range of unsigned char. If you want to know whether the value UCHAR_MAX + 1 is alpha-numeric, you can't pass that to isalnum, but if that value is in the range of wint_t, you can pass it to iswalnum.

For values 0 to UCHAR_MAX, it would be surprising if isalnum and iswalnum produced different results.


But that is ASCII. That part of Unicode is a literal copy of ASCII. In any case, just putting an "if out of range, return 0" clause wouldn't hurt performance noticeably given all the indirection already present. If used in a loop, the CPU will predict that branch perfectly every time if your data is correct. There is no reason to just crash.


ASCII goes to 7F, not to FF; it is a 7 bit character code.

Therefore, for instance, isspace(0xA0) might usefully report true if we are in a Unicode locale, otherwise not.

The 0x80-0xFF values are also used in 8 bit extensions over ASCII, like ISO-8859 1 and ISO 8859-15 character sets. E.g. 0xE0 is à in ISO-8859 1 (which is, of course, the same as the Unicode U+00E0 but logically distinct).

A totally different 8 bit extension is KOI-8.

The point is valid that if you don't support any "weird" extensions to ASCII (just ISO Latin) or non-ASCII 8 bit, then there isn't much of a need for run-time table indirection. The cases that may arise can be handled ad hoc. Along the lines of "if we are in an ASCII locale, then report false above 7F, otherwise go through the combined Latin/Unicode combined table".


I do get what you're saying, but musl also has to live in many different worlds. Using the example where glibc is trawling into endianness in the post you linked, for example. Musl runs on a bunch of different big and little endian router boxes and other unusual use cases. While I haven't tested, I'm guessing that their much simpler isalnum() works fine on all of them.

Musl does have a lot less legacy to contend with, and musl is often much slower than glibc, so your point stands, of course.


musl's "isalpha" is trivially wrong, for instance it wouldn't support "ç" (0xe7) or "ß" (0xdf) in ISO 8859-1 which are both alphabetic characters which fit in an unsigned char.


Those both return 0 for isalpha() on glibc for me, with or without export LC_CTYPE=iso_8859_1

Is there some other setup I'd need to do to see it work in glibc?


most likely you need to build the locale on your system (uncomment the relevant line in /etc/locale.gen and run sudo locale-gen).

here

  #include <ctype.h>
  #include <locale.h>
  #include <stdio.h>

  int main(int argc, char** argv)
  {
    setlocale(LC_CTYPE, "fr_FR.iso88591");
    if(isalpha('ç'))
      printf("ok\n");
  }
prints ok (with the file in the correct encoding)


ctype is trivially non-localizable to locales with codesets larger than sizeof(unsigned char) anyways. Maybe the problem here is POSIX.


oh yes, no code written in 2021 should use that mess. but a glibc being some level of posix compatibility.. hard to blame them for at least trying to make it work.


Hmm, well, I mean, if ctype can't work for any interesting non-ASCII (and non-EBCDIC) cases (no one should still be using ISO-8859 locales...)... maybe stop trying so hard?


isalpha() works with the "C" locale unless you first call setlocale().

For example, on my system isalpha(0xe7) is true if I first call setlocale(LC_ALL, "en_US.iso88591").


well, yes, in "normal" C programs you're supposed to fetch the locale from the user's env vars (with setlocale (LC_ALL, ""))


> While I haven't tested, I'm guessing that their much simpler isalnum() works fine on all of them.

isalnum works fine of both, it only veers off when you get into UB which is UB.

If you define “works fine” as “gives correct answers even in ub” then musl’s is completely broken since it only gives correct answers for english in ascii.


It can't give correct answers for anything other than English in UTF-8 locales.

It can't give correct answers for any non-Latin scripts in any locales.

The problem is ctype and POSIX.

Given that, making ctype only work for ASCII (and maybe EBCDIC if you're really unlucky, which glibc is) is basically sufficient.


> Given that, making ctype only work for ASCII (and maybe EBCDIC if you're really unlucky, which glibc is) is basically sufficient.

Of course, but then why complain that glibc doesn't work outsid of ascii is the point.


But that's the issue: glibc's ctype does work beyond ASCII: it works for ISO-8859-*. It's pointless because that's pretty much all it can work for.


Nothing in POSIX mandates that isalnum has to be implemented using UB. It is entirely permissible to return 0 for values out of range. Or -1. Or anything else. POSIX is not part of the C language, if a function's behaviour is not defined for a given input, that's not UB, that's a license for the implementation to do whatever is natural. Which can be crashing embarrassingly, but why would you do that.


> Nothing in POSIX mandates that isalnum has to be implemented using UB.

That is a misunderstanding on your part. Calling isalnum with out-of-range values is UB. Per-spec: "If the argument has any other value, the behavior is undefined.".

> It is entirely permissible to return 0 for values out of range. Or -1. Or anything else.

Of course it is, it's UB, there's nothing it can't do.

> POSIX is not part of the C language, if a function's behaviour is not defined for a given input, that's not UB

So a behaviour which is not defined is not an undefined behaviour. Sure. Whatever.

> that's a license for the implementation to do whatever is natural. Which can be crashing embarrassingly, but why would you do that.

Why wouldn't you? A table-based implementation is flexible, convenient and efficient, and you don't care what happens outside of the function's bounds.


> Okay, fine, my bad. My code is wrong. I apparently cannot just hand a UCS-32 codepoint to isalnum and expect it to tell me if it’s between 0x30-0x39, 0x41-0x5A, or 0x61-0x7A.

Yikes. If you have wide characters, you want iswalnum, or else preprocessing: (ch <= UCHAR_MAX) ? iswhatever(ch) : 0, assuming positive ch.


There’s nothing more off-putting to me than a rant based on a wrong premise. All that hot air and insults and in the end he was just wrong, by his own admission.


> all the world is not Linux or BSD

since when does glibc run on bsd


didn't glibc exist before linux ? surely it would have been running on bsd then


It did! The glibc 1.09 README lists a number of supported configurations, including a couple with BSD, and has no mention of Linux.

https://sourceware.org/git/?p=glibc.git;a=blob;f=README;h=b9...


Well for one, Debian GNU/kFreeBSD.


Looks like the array lookup isn't exclusive to glibc.

Illumos: https://github.com/illumos/illumos-gate/blob/9ecd05bdc59e4a1...

...although there is a "sensible" version at:

https://github.com/illumos/illumos-gate/blob/9ecd05bdc59e4a1...

FreeBSD: You have to chase it through "__sbistype" to "__sbmaskrune".

https://cgit.freebsd.org/src/tree/lib/libc/locale/isctype.c

https://cgit.freebsd.org/src/tree/include/_ctype.h


Illumos also has ridiculously huge (but sanely-behaving, thankfully) strings functions, because somebody met jump tables once and thought they were cool.

https://github.com/illumos/illumos-gate/blob/master/usr/src/... e.g.


A table-based implementation, aside from the convenience of being declarative, is also natural if not outright necessary for locales support.


Ran it on 32-bit ARM, 64-bit ARM, 32-bit x86, and 64-bit x86. All had different results, but all were the same until index 549, which is greater than the maximum value for unsigned char (255).


What is supposed to be the problem? I can't see mentioned anywhere in the articles or comments what's actually meant to go wrong. Mine didn't crash (after 30 seconds or so) and I couldn't see a problem with the output.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: