I think it was a mistake to put hashCode (and to an extent, equals) to all objects by default. It makes Objects bloated, and because of that, those who will never be put to a hash table also need to provide their own hashCode implementations, causing problems like the one mentioned in the article. There should have been a separate, dedicated "Hashable" interface. Strangely, C# shares the same caveat. Is there any benefit to making objects hashable by default?
I don't know about bloat per se, but I think putting `equals` on all objects is a design flaw. This implicitly assumes that every object has a single implementation that works for every case, which is frequently false. Some objects have no semantically meaningful concept of "equals". Others have situation-specific definitions.
A better design (which C# has made some steps towards) is defining interfaces for Equatable and Hashable, and requiring the object to implement methods that return them. Then they can return them or not, or have multiple implementations of each if needed. Or users of the objects can define their own, custom implementations easily.
The vast majority do though. This was a design decision to be useful 85% of the time and allow for graceful degradation into edge cases (you can always just return false from equals() always)
> Equatable and Hashable ... have multiple implementations of each if needed
i dont believe this would fix the problem, because the reason you'd want multiple implementations is because you'd want to use different ones under different circumstances that may only be determined at runtime anyway.
It's definitely convenient and served Java well in the sense that it always was a fairly easy language to get started with. Other languages make similar choices for the same reason; it trades off good enough against perfect. It wasn't a mistake but a deliberate choice. Also important to realize that this choice was made in the early nineties, 30 years ago.
But you are right that it's not free. Recent and upcoming changes to Java and the JVM are providing solutions in the form of e.g. value classes, records, etc.
Additionally, hotspot is doing lots of clever stuff under the hood where it makes sense. Finally, if you know what you are doing, Java provides plenty of ways to optimize things.
Moreover the Hashable interface really needs to take a `HashState` object that can be given blocks of memory regions. Then you could plug in arbitrary algorithms rather than having each object hard-coded to use some half-baked implementation that's usually wrong & definitely wrong for anything that needs cryptography/DOS protection in the hash table. The reason is that Hash(m1 + m2 + m3) is what you want as it's computationally optimal & lets you plug in crypto hashing algorithms, but the old technique that Java bakes into the language forces you to write Hash(Hash(m1) + Hash(m2) + Hash(m3)).
Given the algorithm used, I don't really understand what bloat you are referring to. Compared to assembly? I guess maybe. But this is a managed language with near-cpp performance. Small price to pay.
Also, there's utility for hashcode/equals far beyond the collections framework... having an identity in a toString() for instance is of great value.
Yes, seems it was just for hash tables. From the JDK 1.0 doc:
"""
Returns a hash code value for the object. This method is
supported for the benefit of hashtables such as those provided by
<code>java.util.Hashtable</code>.
The general contract of <code>hashCode</code> is:
Whenever it is invoked on the same object more than once during
an execution of a Java application, the <code>hashCode</code> method
must consistently return the same integer. This integer need not
remain consistent from one execution of an application to another
execution of the same application.
If two objects are equal according to the <code>equals</code>
method, then calling the <code>hashCode</code> method on each of the
two objects must produce the same integer result.
"""
hashCode is also related to object comparison (see the last line) and this was 2 Java releases before `Comparable` existed.
I'm not sure what you mean by bloat. Adding methods doesn't really affect individual object size, just class size. However, actual code weight (rarely) really impacts anything.
In the case of Java/C#, hashcode and equals are optional anyways. So if you don't override them, there's no extra space consumed.
The storage of the identity hash code takes ~4 bytes in the object header, and that is independent of whether the method is overridden or not. It's explained in the article.
But assuming that the other bit in those 4 bytes is required and there isn't some other place available where you could sneak it in, those ~4 bytes would still be allocated (for that one other bit) and only take less memory under some compression scheme.
And chances are that even if you did get by without that bit (or found another place for it) you'd in many cases spend far more memory on workarounds for stuff where the creator was too stingy for implementing "IHashable" but you actually need it
Well, in Virgil, a class without a superclass is a new root class, with no assumed methods. To use an object of that class as a key in a hashtable, one provides a hash function to the hashtable constructor. (The hash function can be a class method, or a top-level function; with generics, a hashtable key could as well be any type, including primitives). Of course it depends on how your write your code, but most of my objects do not end up as keys in a hashtable, so don't need a hash function nor storage for an identity hashcode.
So the smallest object header for a Virgil object is 4 bytes: it's just a type ID that is used for dynamic casts and as an index into a table for GC scanning. Arrays have two 4 byte header fields: the type ID and a 32-bit length. That's considerably more memory efficient than 2 or 3 64-bit header words for Java objects and arrays, respectively.
You generally need to reserve at least a pointer sized block of memory for a Java object because they can be used as locks with the sychronized keyword. Very few objects ever get used as a lock, so they make use of the bytes for other things like the hash.
Of course, that does imply the synchronized keyword has a fairly substantial memory cost.
I feel that if your application is so sensitive to this level of "weight", you may be either better off using something lower level (like C?), or use direct byte array buffers, and encode your data that way rather than as objects.
Not very many applications would have this sort of requirement tbh. Real-time applications, and perhaps, huge scale data applications.
I agree. Built-in equals/hashCode is never what I want to use. Every time built-in version was used in my code was because of mistake and had to be fixed by providing proper implementation. This functionality is a source of subtle bugs.
I believe C# shares a few caveats with Java because they wanted to make it easy to port Java codebases to it. Making objects hashable by default and having arrays be covariant, despite early plans to support generics, both aid in that.
Did you read the article? The JVM reserves ~4 bytes in the object header of every object to store the identity hash code, regardless if you override .hashCode() or not.
If you have an object that has only one field, it still consumes at least 3 words of memory; 2 for the header, and 1 for the field. That's a 200% memory overhead. Of course that's extreme, but I've seen estimates as high as 40% of memory space the heap in large Java applications is just object headers. You probably need at least 1 header word in any scheme, but 2 is wasteful IMHO.
There's a way to save memory by sacrificing .hashCode() runtime performance (on the assumption that most objects never get hashed) by keeping a separate lookup table for hashcodes, at the cost of having to do the indirect lookup. I don't recall which VM used that technique. There's also fun tricks like rewriting/optimizing the object header during a copying/compacting garbage collection.
Yeah, I know that trick. I implemented an identity hashmap[1] inside of V8 that uses that, because JS objects don't have an identity hash code[2]. It uses the object address as the basis of the hash code, which means the table needs to be reorganized when the GC moves objects, which is done with a post-GC hook. The identity maps in V8 are generally short-lived.
I am not aware of production JVMs that use this, because in the worst case they can use a lot more memory if a lot of objects end up needing hashcodes.