Loving the progression here. Tomorrow, someone’s going to reduce the boot times of macOS by 90% by the same principle. A week from now, someone will prove P=NP because all the problems we thought were NP were just running strlen() on the whole input.
And maybe, in a decade or so, the man page for these functions will list their algorithmic complexity! That was the most interesting takeaway from this article, for me at least. I have only seen a one or two libraries that actually list this in their documentation.
It's easy to forget that the original C standards were largely codifying existing practice during an era when using gets() [1] was existing practice. The world wasn't quite ready for Ada, I guess. Best-laid plans of mice and men etc. etc..
Also, keep an eye out for "amortized" complexity. This does have a legitimately rigorous definition, but for latency-bound paths it can practically amount to "O(whatever), except for the particular invocations that are far, far worse under unspecified conditions".
It's also easy to forget that C was competing mainly with assembly, while C++ competed with managed languages. The early C programmer ethos, especially among library authors, was much more along the lines of "look at the generated object code if you want to know what it's doing" while modern practice leans more towards "read the documentation for complexity guarantees". I'm not saying that worse documentation leads to better programmers, but I'm not not saying that either. Practices change, standards change.
Good documentation and inspecting the compiled bytecode are both good ways of finding out about performance characteristics of certain features. The problem starts when people rely on assumptions ("sscanf should be fast because it's widely used") or performance folklore ("localizing every function you'll ever use makes your Lua code faster"), because those tend to either be completely wrong or lack very important context.
I live in js land, and the barrier between “folklore” and “documentation” is extremely thin. Especially since V8 may introduce changes at any time that affect performance characteristics of js.
I’d respond with “well if performance matters it shouldn’t be in js” except for all the shite being written in js these days, with js being the hammer that makes everything else look like a nail.
You can write very fast JS code. When carefully written it can have Java like performance[2]. It is just very hard in practice where most ecosystem is optimized for developer productivity.
When performance matter, write your own code and carefully benchmark everything. You can see this working for Typescript and VSCode[3]
It makes me chuckle when hash maps are stated to be O(1) insertions. Which is true, in respect to the number of items in the map, assuming the map doesn't need resizing and there isn't a hash collision... but it's generally not true in respect to the key length. (I think most implementations are O(ln), where l is the length of the key and n is the number of inserted items, assuming the hash function is O(l) - the _amortised_ runtime would be O(l))
I wrote my own version of a part of a very popular Java scientific tool, and my version runs about 50 times faster. Their mistake? They had a hashCode() implementation on the objects they were using as keys for HashMaps that iterated through all of the voluminous content of that object. And there was no point - they could have used IdentityHashMaps instead with the same result. I pointed this out to them, and they still haven't fixed it.
I'm guessing GP means the complexity guarantee sidesteps the complexity of the hashing function. It probably doesn't matter all that much in typical case - I'm guessing 80-90% of hash map use is with very short strings.
And the analysis of hashmaps is not such a well-written guarantee -- as you resize, you need a bigger hash function output to reach all possible buckets. A bigger hash function output, assuming you have to keep the avalanche effect to keep output well-scrambled, requires more computations.
Short strings, long strings; they're going to use the same key length. Calculating the key may take longer for the long string, if you're basing the hash on the contents of the string[1], but the key won't end up being a different size. The md5 of a 3-byte string is 16 bytes and the md5 of a 40GB string is also 16 bytes.
[1] Not typical. e.g. Java takes the hash key of an object to be its address in memory, which doesn't require looking at the contents.
Calculating the key may take longer for the long string
Right, that’s exactly what they are warning about.
Not typical. e.g. Java takes the hash key of an object to be its address in memory
No, that’s just the base implementation in Object (and arguably it was a bad idea). All useful “value type” classes will override it with a real hash of the content, including String.
There are some cases in Java where you do want to use IDs instead of values as your map keys, but they’re rare.
> All useful “value type” classes will override it with a real hash of the content
Well, this is necessary for a lot of sensible things you'd want to do with non-numeric value types as hash keys...
> including String
...except String is something of an intermediate case. There are loads of use cases where what you're really using is a set of constant strings, not variables that contain arbitrary character data. In that case, you should intern the strings, resulting in non-"value type" keywords where the only thing you care about for equality is whether two keywords do or don't have the same machine address.
I don't actually know how Java handles this, but I had the vague idea that two equal String literals will in fact share their machine address. And String is specifically set up to accommodate this; Strings are immutable, so in theory it could easily be the case that any two equal Strings must share their machine address, even if you got them from user input.
Java does intern string literals and constants, but you can’t rely on reference equality unless you intern every string you create at runtime by formatting or decoding, and it isn’t specified whether that creates strong references that will never be GC’d.
Yes, Strings are immutable, so they only calculate their hashCode once, then cache it. However, you need to explicitly intern them with String.intern() if you want to avoid multiple copies of the same String.
> Strings are immutable, so in theory it could easily be the case that any two equal Strings must share their machine address, even if you got them from user input.
Hey, and now you have two problems: String hashing and finding all strings which are equal to each other in memory
Well, no, the whole point of this discussion is that solving the second problem means the first problem never comes up.
And this isn't exactly some exotic approach; how often do you think people write Hashes in Ruby where the keys they use are all symbols? It's so common that there's dedicated syntax for it.
It's as old as Lisp, but there's a reason symbols exist separately from strings - they're used differently. Strings are frequently transformed, symbols almost never are. String are frequently taken from end-user input, symbols very rarely. Strings sometimes are very large, symbol names are almost universally very short.
The problem is, interning is an expensive operation. It means adding to an ever growing database of strings, but first checking if the string isn't already there. You don't want to do that every time you change case or flip a letter in a string, or use it to access a hash table. I'm not saying it can't be done, but I honestly have no idea how to implement sane, generic, automatic interning of strings. I feel more comfortable having a symbol type, and control over turning strings into symbols.
I definitely agree that uninterned strings are important. All I'm really trying to say down here is that there are many cases where you have a hash table which uses strings as keys (as an implementation detail), when (conceptually) it wants to be using symbols.
(And on a less fundamental level, the particular Java String class is less string-like and more symbol-like than most string types, and this appears to have been done intentionally.)
> Everything is O(1) if N is constant, including log(N), N^2, 2^N, N!, etc.
Not even close. 2^k is not O(1) by virtue of N being constant. Only 2^N.
This has been covered above. It is more common to consider the complexity of hash table operations in terms of the number of operations, or the size of the table; the size of the key is very often constant. These are different variables; the constant size of the key does not trivialize the complexity of inserting N items each with a constant key size.
Here, the relevant key is the output of the hash function though -- that's what you need to increase in order to ensure you can reach all buckets. And that (k) must increase with the table size. So it is not constant and depends on n (table size).
I remember a proof in CLRS which first developed a function that was bounded above by 5 for all conceivable input ("a very quickly-growing function and its very slowly-growing inverse"), and then substituted the constant 4 or 5 into a complexity calculation in place of that function, giving a result which was "only" correct for all conceivable input.
The same approach applies to key length requirements for hash tables with arbitrarily large backing stores. They do not grow as slowly as the CLRS log* function, but they grow so slowly that there are easily identifiable sharp limits on how large they can be -- an easy example is that a hash table cannot use more memory than the hardware offers no matter how the software is written. A backing store with 1TB of addressable bytes cannot need the key to be more than 40 bits long.
On a different note, by "table size" in my earlier comment I meant to refer to the number of entries in the table, not the capacity of the backing store. It seems like you might be using the same word for a different concept?
>The same approach applies to key length requirements for hash tables with arbitrarily large backing stores. They do not grow as slowly as the CLRS log* function, but they grow so slowly that there are easily identifiable sharp limits on how large they can be -- an easy example is that a hash table cannot use more memory than the hardware offers no matter how the software is written. A backing store with 1TB of addressable bytes cannot need the key to be more than 40 bits long.
So? That's still putting a bound on table size, which makes it in-practice constant, but doesn't make the algorithm O(1), because you can never get such a result by bounding n, for the reasons the GGP gave -- that's cheating.
Your complexity bound has to be written on the assumption that n (number of elements to store in hashtable) increases without bound. Assuming you will never use more that Y bytes of data is not valid.
>On a different note, by "table size" in my earlier comment I meant to refer to the number of entries in the table, not the capacity of the backing store. It seems like you might be using the same word for a different concept?
No, I was using table size exactly as you, to mean the number of elements stored. Is there a reason my comments only made sense under a different definition? It not, be charitable. (And avoid using obscure terms.)
> No, I was using table size exactly as you, to mean the number of elements stored. Is there a reason my comments only made sense under a different definition? It not, be charitable. (And avoid using obscure terms.)
I interpreted your comment to refer to the size of the backing store, because that is fundamentally what a hash key needs to be able to address.
I didn't mean to say that, if you were using it that way, you were doing anything wrong, only that there appeared to be a mismatch.
>I interpreted your comment to refer to the size of the backing store, because that is fundamentally what a hash key needs to be able to address.
Under the assumption (upthread) of constant resizing as element are added, the distinction is irrelevant. The more elements you have in the table, the more elements you need to address, and the more possible outputs your hash function needs to have.
And the needed size of the backing store scales with the number elements you want to store anyway.
>I didn't mean to say that, if you were using it that way, you were doing anything wrong, only that there appeared to be a mismatch.
Why bring up something like that if it doesn't translate into something relevant to the discussion e.g. to show my point to be in error?
Incidentally, the person replying to you in that thread incorrectly stated that comparison is O(logN) on the number of bits. The most common comparison function, lexicographic comparison, is actually O(1) average case given random inputs of arbitrary length.
But, isn't the key length a constant and we are back to O(1)? Ok, in theory you could exhaust all possible keys of a certain length and proceed with longer keys. It would give us what? O(ln(n))?
His point is, if you use Moby Dick as the key, it's going to take longer to hash that than a three letter string. Hashing isn't O(1) if the key has variable size.
...I fully plan to use "O(whatever)". Not sure for what.
But, yes. (naive) Quicksort's amortized complexity being O(nlogn), but its O(n^2) on already sorted data, is all I ever needed to learn to take away that lesson. When sorting already sorted data is worse than sorting randomized data, it's a quick realization that "amortized cost" = "read the fine print".
Or triple, or quadruple. Or even (IIRC) "increase by 50%" (but, I would need to sit down and do the actual math on that). But, doubling a number is cheap and more conservative than quadrupling (the next "cheap" multiplier).
Also, already sorted data.. in reverse order. If it's already sorted in the right order, quicksort takes linear time. This is an important difference - data you use might indeed often be appropriately sorted, but in practice will seldom be sorted in reverse order.
On the contrary: very common UI pattern to have a data grid that sorts by a particular column when you click the header, then reverses that sort order when you click the header again. So for a user to sort by date, descending, they click the header, causing an ascending sort, then click it again, causing a descending one.
Often such a grid will be quite well abstracted from its data source - it might be executing a remote query to return data in the new order every time - but I bet there are some examples out there that are backed by a local dataset and carry out an actual sort operation when you hit the header... and fall into a quicksort worst case if the user clicks the same header twice in a row.
Yes; random pivot selection is nlogn (unless you are very, very, statistically impossibly, unlucky. Or using very short arrays where it doesn't matter anyway).
But I'm pretty sure data sorted in either direction (i.e., 'reversed' or not, ascending or descending), and taking a pivot from either end, is n^2. It doesn't have to be reversed; you always end up with everything unsorted ending up on one side or the other of the pivot, with each recursive step being just one less comparison than the step prior, meaning it has N-1 + N-2 + ... + 1 comparisons regardless of which way the array is sorted, or N(N-1)/2 comparisons total (Gauss' formula but starting at one less than the total number of items N, since that's the number of comparisons each step), which is O(N^2). There is no cause where it's linear time, unless you are first iterating across the array to select the first pivot that is out of place (which may be a reasonable optimization, but can also be made to apply regardless of what direction the array is sorted).
In the standard there's things like "exactly N operations", but not seeing stuff for `istream`. There's like... an explanation of how things should work and I imagine you can derive complexity from it, but I think `istream` is a bit special since you're talking about this wrapper for (potentially) an arbitrary input source.
> Note that some implementations of sscanf involve a call to strlen, which makes their runtime linear on the length of the entire string. This means that if sscanf is called in a loop to repeatedly parse values from the front of a string, your code might run in quadratic time
Good. I'm so happy they put it there. It's a little thing, but such little things - documenting corner cases - can have great benefits.
I have a bad memory for all but most frequently used standard library calls, so I regularly end up refreshing my memory from cppreference.com, and I tend to instinctively scan any notes/remarks sections, as there's often critical information there. So now I can be sure I'll be reminded of this the next time I need to use scanf family.
I don't know if it is required to, but there doesn't really seem to be an upper bound to what glibc's scanf will eat for a %f (e.g. a gigabyte of zeroes followed by "1.5" will still be parsed as 1.5), so for that implementation there certainly isn't a trivial upper bound for the amount of input read and processed that is done for %f, like you would perhaps expect.
Yet another reason to not stringify floats. Just use hexfloats (but beware of C++ standard bugs involving >>) or binary.
Unfortunately "gigabytes of numerical data, but formatted as a text file" is commonplace. For some reason HDF5 is far less popular than it ought to be.
But why do strlen() at all? And why are all platforms (Linux, Windows, MacOS) seemingly doing that?
I think you're right that there is no upper bound but it shouldn't be necessary to do a full strlen() if you're instead scanning incremental. You could go char by char until the pattern '%f' is fullfilled and then return. That would solve the issue on it's root -- and who know how many programs would suddenly get faster...
I'm reading this source the first time, but I guess to not break anything one could introduce a new type of FILE* stringbuffer let's say in 'strops_incr.c' that is working incrementally reading one char at the time from the underlying string skipping the strlen()...
Would be awesome cool if GTA online would be loading faster under wine than on windows :-)
How would it help you knowing that it's O(n)? It needs to read all the characters of the float. Problem is that it's needlessly reading characters even after the float
You're joking, but now I'm thinking about the XML we parse at work and the library we're using to do it. We parse a lot of it, but I've always had this vague feeling that it takes a bit too long (given the codebase is C++).
The XML library we use is rather well-known, so if someone found a bug like this there, I'd suspect a general improvement of performance across the board in the entire industry. Efficient Market Hypothesis tells me it's unlikely the library has this problem, but then again, so I thought about AAA videogames, and then GTA Online thing came out.
Any sufficiently-complex library code likely has plenty of problems, often unavoidably so (e.g. trade-offs between best performance and edge cases). Whether they have been found or not is a function of many, many factors.
> Efficient Market Hypothesis
I've lived long enough to be very sceptical about that sort of thing. Markets tend to be efficient in aggregate, maybe, but on the single case they can fail quite brutally. Look at how "dramatic" bugs are overlooked even in critical pieces of infrastructure like openssl, for years and years; maybe it happens less for openssl than most equivalent libraries, but it still happens.
Also, once the "market" for standards moves on, network effects make it very hard to have any meaningful competition. I mean, who writes XML parsers nowadays? Whichever XML lib was winning when JSON "happened" is now likely to stay in control of that particular segment; and the likelihood that top developers will keep reviewing it, falls off a cliff. Sprinkle a bit of cargo-cultism on top, and "efficient markets" become almost a cruel joke.
There's a variant / corollary of the Efficient Market Hypothesis here, though.
Let's say the GP's XML library has The GTA Bug, i.e. it uses a quadratic-performance loop when parsing. The bug will go undiscovered until any one consumer of the library a) sees enough performance impact to care, b) has the expertise to profile their application and finds that the library is at fault, and c) reports the problem back to the library owner so that it can be fixed. This combination might be unlikely but since only one consumer has to have all those properties, the probability scales inversely with the number of library users.
It's possible. I've personally reduced the time spent for reading huge XML file on the startup of an application at least 10 times in the application I was in charge of, by avoiding the library dependence and writing a custom code. Having a lot of experience in such kinds of code and in the performance issues, it was quite a fast change with no negative effects.
The prehistory of that was simple: up to some point the amount of data stored was reasonably small. Then from some point on the amount of data grew significantly (a few orders of magnitude), and the startup times became very unpleasant.
There's a lot that is going on when loading huge XML files. As an example, don't forget all the possible Unicode conversions, all the possible allocations of the elements in the handling code, just to be discarded etc.
I don't suggest everybody doing it "just because" but if some specific use is known to have very specific assumptions and it is in the "hot path" and really dominates (profile first!) and it is known that only a small subset of all XML possibilities will ever be used it can be justified to avoid the heavy libraries. For example, in that specific case, I knew that the XML is practically always only read and written by the application, or by somebody who knew what he was doing, and not something that somebody random in some random form would regularly provide from the outside, and I knew that my change surely won't break anything for years to come, as I knew for sure that that part of application was not the "hot spot" of expected future changes.
So it was a win-win. Immensely faster application startup, which is something that improved everybody's work, while preserving the "readability" of that file for the infrequent manual editing or control (and easy diff).
I'm reminded of a 2008 article, Why is D/Tango so fast at parsing XML? [0]
One of the main factors seems to be that a lot of XML parser libraries, even the high-profile ones, did a lot of unnecessary copy operations. D's language features made it easy and safe to avoid unnecessary copying.
If you have a lot of nesting in the XML, and it is formatted for human reading (i.e. indented), you may want to consider not doing that. We had a project where we were creating human-readable versions of the XML (mostly for developer convenience) and then parsing it. When we stopped adding all the extra white space the parsing speed increased a couple of orders of magnitude. (The downside was we no longer had built in coffee breaks in our development process.)
That's interesting. I can't think of a mechanism why this would give so much of a performance boost, though - rejecting extra whitespace should be just a matter of a simple forward scan against a small set of characters, shouldn't it?
(Or maybe in your case something was running strlen() a lot during parsing, and just the difference in file size explains the boost?)
What about parsing that XML upfront, serialising to some binary format (e.g. CBOR, maybe with nlohmann's JSON library, or Cap'n Proto) and shipping the binary file?
Would be cool if we could that, but as things stand, enough various people want to occasionally look at these files, in environments where they can't just install specialized tooling and are using notepad.exe (or Notepad++ if already available), that we keep it text.
I like binary formats, but we can't afford the increased complexity around supporting a custom binary format, so I'm not pushing for changes here.
I did investigate replacing our pile of XML files with an SQLite database, which would give us fast and efficient format, and allow to use existing SQLite database viewers, or hit the file with trivial scripts, so we'd have no complexity supporting a dedicated tool. However, the data model we use would need such a large overhaul (and related retraining) that we tabled this proposal for now.
Seriously the most I have laughed in like 6 months. Which probably says a lot more about me than this joke. I know that jokes aren't really welcome on HN, and I generally really like this policy. But just had to mention this was just ... what I needed to read right now.
IMO, while I really don't come to HN to find dial-a-joke, or joke-of-the-day, I think some humor is essential in modern life.
Since we're talking about Matt Keeter, you will find he has a great sense of humor if you read his website or interact with him. Some of his jokes are ROTFL funny, but subtle.
>> So many years ago when I first took over the iOS string functions, I found that like 30% of boot time in debug builds was in strstr. <<
>> Needless to say, we did not fix that issue by writing a more efficient strstr. Removed the parser and then removed strstr from the environment where it had been in use =) <<
> ...Tomorrow, someone’s going to reduce the boot times of macOS by 90% by the same principle.
My 2019 MacBook often pauses when I connect the charging cable. Sometimes it just seizes, requiring a hard bounce.
Clearly there's a contended lock buried deep. Something non-obvious.
I'm certain everything these days has dozens of hidden quadratics and contended locks.
Which is one of the reasons I'm excited for stuff like structured concurrency (Java's Project Loom) and retoolings like systemd becoming the norm.
Ages ago I worked on kitchen sink app that had a very finicky startup. Any breeze would break it. Much consternation by mgmt. Apparently if we only clapped louder, Tinkerbell would fly. I couldn't take it any more. LARPing as a bulldozer, I replumbed the innards, changing from something like initd to be more like systemd with some lazy loading for good measure. Voila!
Back to GTA. The failure here is the product owner didn't specify a max load time, and then hold the team to it. Devs will mostly do the work that's expected of them. If load time wasn't measured (and managed), no one is going to bother with expunging sscanfs.
> My 2019 MacBook often pauses when I connect the charging cable. Sometimes it just seizes, requiring a hard bounce.
Yesterday my MBP kernel panicked because my keyboard was low on battery and the bluetooth connection kept dropping. There's something weird with MacOS where peripherals seem to really not be well isolated from the core OS runtime.
Oh peripherals on newer Macs are somehow very hit or miss. I have a very difficult time with external monitors, especially from sleep. My MBP 16" would just loop between initializing and failing to initialize, until I unplug, wait, and re-plug again. Or I have to press the `Extend` option instead of the `Mirror` option that I use. The older 2015 MBP would just connect fine.