I don't see how > "Not suprisingly, [...] global counters [...] are poorly distr...

nlitened · on July 20, 2021

> the values for distinct objects are more or less distinct

I think these words of author understate the requirement of good distribution of hash codes. As far as I understand, ideally the hash codes for different objects should be as distinct as practically possible, so that they are often put into separate buckets of a hash table.

Consecutively allocated objects will have almost all bits of their addresses equal.

jffry · on July 20, 2021

One neat trick could be to use a linear congruential generator to iterate through the 32-bit integers without repetition, as long as you choose the right parameters. Fortunately, these are relatively simple to pick [1] when we're dealing with powers of two.

This would give you fairly easy-to-generate IDs which still have a period of 2^32 but where subsequent allocations have an ID that shares fewer bits on average with its predecessor.

[1] https://en.wikipedia.org/wiki/Linear_congruential_generator#...

nlitened · on July 22, 2021

I think that’s similar to what the article describes.

jffry · on July 22, 2021

The article hadn't specifically discussed rng period but I see the linked code [1] uses an implementation [2] with a period of 2^128-1

[1] https://github.com/openjdk/jdk/blob/4927ee426aedbeea0f4119ba...

[2] https://en.wikipedia.org/wiki/Xorshift

MauranKilom · on July 20, 2021

I wouldn't trust any hash table where a hash function yielding consecutive integers would inherently lead to bad behavior. Can you tell me a popular hash-to-bucket reduction that performs badly for this case?

I will concede that there are plenty of schemes where insufficient entropy in lower bits causes problems. Combining those with the global counter hash and e.g. only inserting every 64th allocated object could be a failure case indeed. But this is still simple to defend against in the reduction scheme.

dragontamer · on July 20, 2021

I think its about the naive "rep_array[hashcode(obj) % bucket_size] = object" which is all too common.

If your rep_array is doing linear probing and/or robin-hood hasing, then incremental hashcodes (such as 1, 2, 3, 4, 5...) is a bad thing. Especially if you're doing both inserts and removals: this sort of incremental pattern would lead to many "runs" where linear probing would perform poorly.

Of course, it isn't very hard to do rep_array[(hashcode(obj) * large_constant_odd_number) % bucket_size] instead and get a good distribution. But the question is whether or not people know about those kinds of steps.

chrisseaton · on July 20, 2021

> The only case where this could matter is in an adversarial scenario where someone is trying to DoS your hash map with crafted inputs.

That happens in practice.

MauranKilom · on July 20, 2021

Does it also happen in practice where the crafted inputs are given at minimum 4 minutes apart, on the precondition that those minutes are filled with finely-controlled-by-the-attacker spinning?

an_opabinia · on July 21, 2021

People are obsessed with that siphash paper, and they're obsessed that Java hashcodes do not do the siphash mitigation, but I don't know if any of it matters. People are obsessed with thinking it matters though.