Can you elaborate? Why not?

James_K · on Feb 10, 2025

A character could be 1 byte long, in which case the language cannot properly handle unicode; it could be 4 bytes long in which care there is lot of wasted space storing text and it cannot properly handle extended grapheme clusters; or a character could be arbitrary length at which point strings no longer have a flat representation in memory. None of these are good. The exact properties of a string can really only be encoded efficiently with a flat linear access data-type.

dzaima · on Feb 10, 2025

1-byte characters (i.e. what k's typically have) handle ASCII just fine, for which doing reversing/splitting/uppercase/lowercase/iteration/etc is actually meaningful (stock symbols, stringified dates, identifiers, etc).

And if you have to handle arbitrary language user input, there's basically no operations you can/should actually do anyway. Uppercasing/lowercasing? Doesn't make sense on CJK languages. Reversing? Completely meaningless. Trimming to the first N chars for some visual display/summary/preview? Even grapheme clusters won't help avoiding a character with ten thousand combining components, and you'll have to do language-specific logic to not cut in the middle of a word for languages where the display of a prefix of a word may change depending on later letters! And forget about spaces meaning anything.

Basically the only string ops I can think of that make sense for non-ASCII generally would be splitting/joining on newlines and escaping for JSON/HTML or whatever, which'll work completely fine on a byte list anyway.

There's perhaps some middle-ground of doing things for a specific set of languages, but even for such you won't care about the storage format anyways, as what matters for you is just whether operations you use (presumably using some library; and even if you write a manual uppercase for French specifically or whatever, you'd notice if you implemented it wrongly) do the thing they should.

So a list of byte chars is just fine for anything one would actually do, providing optimal access to ASCII, and not actually making things worse for non-ASCII.

James_K · on Feb 10, 2025

Not true at all! Extended grapheme clusters are defined by Unicode for a reason and include relevant combining marks following a letter[1]. The point more generally is that a programming language shouldn't preferentially choose one character definition over another. The decision of whether to iterate by bytes, points, or clusters is a significant one which the language shouldn't force upon users. For many common operations, bytes are a sufficient representation, but then one must be precise about encoding. A list of UTF-8 bytes is very easy to deal with but the bytes of a UTF-16 string are highly problematic. Inserting a single byte character at the start of such a string would destroy it's entire content. There is no situation where "give me the characters of this string" is a sufficiently precise statement, so it should not be made available by programming languages. Likewise, the idea of indexing a string is not well defined at all. The only consistent interface for accessing strings requires users to specify both encoding and separation, and this can only be done performantly in the general case with a linear scan.

[1] http://unicode.org/reports/tr29/

mlochbaum · on Feb 10, 2025

I think it's worth considering that application development and GUIs really aren't K's thing. For those, yes, you want to be pretty careful about the concept of a "character", but (as I understand it) in K you're more interested in analyzing numerical data, and string handling is just for slapping together a display or parsing user commands or something. So a method that lets the programmer use array operations that they already have to know instead of learning three different non-array ways to work with strings is practical. Remember potential users are already very concerned about the difficulty of learning the language!

dzaima · on Feb 10, 2025

I meant the combining mark point as a thing you would want to cut off; a 50-char chopped-off "summary" of a thing should not include a character with ten thousand combining marks ever. Of course it'd be preferred to cut to cut before and not in the middle, but certainly not after, which is what you'd get if taking the first 50 extended grapheme clusters, the 20000-byte glyph counting as one. Point being, you still just want to use a library that has properly thought out the question. And that applies to most (all?) sane fully-Unicode-aware operations.

Places where ASCII-only is a known expectation and there are meaningful per-char operations are plenty; that's what using a list of bytes provides. Indeed you'd probably want to use another abstraction if you have non-ASCII. And for such you could use something to do the form of iteration or operation you want just fine, even if the input/output is a list of byte-chars representing plain UTF-8.

James_K · on Feb 10, 2025

Well in that case, the way you get a 50 char summary is by iterating grapheme clusters, then counting up to 50 points and discarding the broken cluster. It's quite trivial if the language exposes an interface for iterating both clusters and points, and without such an interface the problem is much harder to notice. Hence why the language shouldn't prefer clusters to points or points to clusters. It should expose all relevant representations without prejudice.

Even if ASCII is appropriate in some situation, this should be stated within the program. Requiring people to be explicit about the data they produce and consume is important and useful. A user might decide that UTF-16 best serves their need (or be working on the Windows platform) in which case code which works with strings as linear sequences will be able to operate on their strings without issue. Code which assumes a UTF-8 byte representation will require an the entire string to be allocated, converted, then reallocated and converted back. Huge overhead and potential incompatibility for no reason.

dzaima · on Feb 10, 2025

> It's quite trivial if the language exposes an interface for iterating both clusters and points, and without such an interface the problem is much harder to notice

I assure you, 99% of people won't handle this correctly even if given a cluster-based interface (if they even bother using it). And this still doesn't handle the question of cutting words in the middle of some languages resulting in broken display of the non-cut part (or languages without space-based word boundaries to cut on). So the preferred thing is still to use a library.

I don't think anyone in k would use UTF-16 via a character list of 2 chars per code unit; an integer list would work much nicer for that (and most k interpreters should be capable of storing such with 16-bit ints; there's still some preference for using UTF-8 char lists, namely, such get pretty-printed as strings); and you'd have to convert on some I/O probably anyway. Never mind the world being basically all-in on UTF-8.

Even if you have a string type that's capable of being backed by either UTF-8 or UTF-16, you'll still need conversions between those at some points; you'd want the Windows API calls to have a "str.asNullTerminatedUTF16Bytes()" or whatnot (lest a UTF-8-encoded string makes its way here), which you can trivially have an equivalent of for a byte list. And I highly doubt that overhead of conversion would matter anywhere you need a UTF-16-only Windows API.

I doubt all of those fancy operations you'll be doing will have optimized impls for all formats internally either, so there's internal conversions too. If anything, I'd imagine that having a unified internal representation would end up better, forcing the user to push the conversions to the I/O boundaries and allowing focus on optimizing for a single type, instead of going back-and-forth internally or wasting time on multiple impls.

fc417fc802 · on Feb 10, 2025

Python uses UTF-8. A Python string is iterable. It is generally reasonable to describe any iterable as a vector (at least in terms of the API). The result of such iteration might not be a character in any formal sense, but it's a reasonable description nonetheless.

I'm really not seeing the issue here.