I think it was a mistake to put hashCode (and to an extent, equals) to all objec...

breischl · on July 20, 2021

I don't know about bloat per se, but I think putting `equals` on all objects is a design flaw. This implicitly assumes that every object has a single implementation that works for every case, which is frequently false. Some objects have no semantically meaningful concept of "equals". Others have situation-specific definitions.

A better design (which C# has made some steps towards) is defining interfaces for Equatable and Hashable, and requiring the object to implement methods that return them. Then they can return them or not, or have multiple implementations of each if needed. Or users of the objects can define their own, custom implementations easily.

exabrial · on July 21, 2021

The vast majority do though. This was a design decision to be useful 85% of the time and allow for graceful degradation into edge cases (you can always just return false from equals() always)

chii · on July 21, 2021

> Equatable and Hashable ... have multiple implementations of each if needed

i dont believe this would fix the problem, because the reason you'd want multiple implementations is because you'd want to use different ones under different circumstances that may only be determined at runtime anyway.

I much prefer the haskell way of thinking about typeclasses, but this mechanism is fairly difficult to implement in java...may be object algebra can potentially be used? see this paper https://docs.google.com/viewerng/viewer?url=www.cs.utexas.ed...

jillesvangurp · on July 21, 2021

It's definitely convenient and served Java well in the sense that it always was a fairly easy language to get started with. Other languages make similar choices for the same reason; it trades off good enough against perfect. It wasn't a mistake but a deliberate choice. Also important to realize that this choice was made in the early nineties, 30 years ago.

But you are right that it's not free. Recent and upcoming changes to Java and the JVM are providing solutions in the form of e.g. value classes, records, etc.

Additionally, hotspot is doing lots of clever stuff under the hood where it makes sense. Finally, if you know what you are doing, Java provides plenty of ways to optimize things.

vlovich123 · on July 20, 2021

Moreover the Hashable interface really needs to take a `HashState` object that can be given blocks of memory regions. Then you could plug in arbitrary algorithms rather than having each object hard-coded to use some half-baked implementation that's usually wrong & definitely wrong for anything that needs cryptography/DOS protection in the hash table. The reason is that Hash(m1 + m2 + m3) is what you want as it's computationally optimal & lets you plug in crypto hashing algorithms, but the old technique that Java bakes into the language forces you to write Hash(Hash(m1) + Hash(m2) + Hash(m3)).

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n398... which I'm sad hasn't been adopted by C++ yet. It looks like Rust has taken note of this paper.

rockwotj · on July 21, 2021

Abseil uses this form of hashing: https://abseil.io/docs/cpp/guides/hash

exabrial · on July 20, 2021

Given the algorithm used, I don't really understand what bloat you are referring to. Compared to assembly? I guess maybe. But this is a managed language with near-cpp performance. Small price to pay.

Also, there's utility for hashcode/equals far beyond the collections framework... having an identity in a toString() for instance is of great value.

cmckn · on July 20, 2021

Yes, seems it was just for hash tables. From the JDK 1.0 doc:

"""

Returns a hash code value for the object. This method is supported for the benefit of hashtables such as those provided by <code>java.util.Hashtable</code>.

The general contract of <code>hashCode</code> is:

Whenever it is invoked on the same object more than once during an execution of a Java application, the <code>hashCode</code> method must consistently return the same integer. This integer need not remain consistent from one execution of an application to another execution of the same application.

If two objects are equal according to the <code>equals</code> method, then calling the <code>hashCode</code> method on each of the two objects must produce the same integer result.

"""

hashCode is also related to object comparison (see the last line) and this was 2 Java releases before `Comparable` existed.

cogman10 · on July 20, 2021

I'm not sure what you mean by bloat. Adding methods doesn't really affect individual object size, just class size. However, actual code weight (rarely) really impacts anything.

In the case of Java/C#, hashcode and equals are optional anyways. So if you don't override them, there's no extra space consumed.

titzer · on July 20, 2021

The storage of the identity hash code takes ~4 bytes in the object header, and that is independent of whether the method is overridden or not. It's explained in the article.

usrusr · on July 20, 2021

But assuming that the other bit in those 4 bytes is required and there isn't some other place available where you could sneak it in, those ~4 bytes would still be allocated (for that one other bit) and only take less memory under some compression scheme.

And chances are that even if you did get by without that bit (or found another place for it) you'd in many cases spend far more memory on workarounds for stuff where the creator was too stingy for implementing "IHashable" but you actually need it

titzer · on July 20, 2021

Well, in Virgil, a class without a superclass is a new root class, with no assumed methods. To use an object of that class as a key in a hashtable, one provides a hash function to the hashtable constructor. (The hash function can be a class method, or a top-level function; with generics, a hashtable key could as well be any type, including primitives). Of course it depends on how your write your code, but most of my objects do not end up as keys in a hashtable, so don't need a hash function nor storage for an identity hashcode.

So the smallest object header for a Virgil object is 4 bytes: it's just a type ID that is used for dynamic casts and as an index into a table for GC scanning. Arrays have two 4 byte header fields: the type ID and a 32-bit length. That's considerably more memory efficient than 2 or 3 64-bit header words for Java objects and arrays, respectively.

nitwit005 · on July 21, 2021

You generally need to reserve at least a pointer sized block of memory for a Java object because they can be used as locks with the sychronized keyword. Very few objects ever get used as a lock, so they make use of the bytes for other things like the hash.

Of course, that does imply the synchronized keyword has a fairly substantial memory cost.

titzer · on July 21, 2021

Agree. Everything built-in to every object is a recipe for overhead.

chrisseaton · on July 20, 2021

They don't mean code weight - they mean field weight.

chii · on July 21, 2021

I feel that if your application is so sensitive to this level of "weight", you may be either better off using something lower level (like C?), or use direct byte array buffers, and encode your data that way rather than as objects.

Not very many applications would have this sort of requirement tbh. Real-time applications, and perhaps, huge scale data applications.

vbezhenar · on July 21, 2021

I agree. Built-in equals/hashCode is never what I want to use. Every time built-in version was used in my code was because of mistake and had to be fixed by providing proper implementation. This functionality is a source of subtle bugs.

ghusbands · on July 21, 2021

I believe C# shares a few caveats with Java because they wanted to make it easy to port Java codebases to it. Making objects hashable by default and having arrays be covariant, despite early plans to support generics, both aid in that.

moltenguardian · on July 20, 2021

It doesn't cost as much as you might think. The word it's stored in is overloaded with the object lock and garbage collection bits.

admax88q · on July 20, 2021

Where is the bloat? There are default behaviours for equals and hashCode, so you never need to define them if they're not useful for your Object.

titzer · on July 20, 2021

Did you read the article? The JVM reserves ~4 bytes in the object header of every object to store the identity hash code, regardless if you override .hashCode() or not.

t-writescode · on July 20, 2021

Compared to all the other bloat in Java, this is pennies

titzer · on July 20, 2021

If you have an object that has only one field, it still consumes at least 3 words of memory; 2 for the header, and 1 for the field. That's a 200% memory overhead. Of course that's extreme, but I've seen estimates as high as 40% of memory space the heap in large Java applications is just object headers. You probably need at least 1 header word in any scheme, but 2 is wasteful IMHO.

e_y_ · on July 21, 2021

There's a way to save memory by sacrificing .hashCode() runtime performance (on the assumption that most objects never get hashed) by keeping a separate lookup table for hashcodes, at the cost of having to do the indirect lookup. I don't recall which VM used that technique. There's also fun tricks like rewriting/optimizing the object header during a copying/compacting garbage collection.

titzer · on July 21, 2021

Yeah, I know that trick. I implemented an identity hashmap[1] inside of V8 that uses that, because JS objects don't have an identity hash code[2]. It uses the object address as the basis of the hash code, which means the table needs to be reorganized when the GC moves objects, which is done with a post-GC hook. The identity maps in V8 are generally short-lived.

I am not aware of production JVMs that use this, because in the worst case they can use a lot more memory if a lot of objects end up needing hashcodes.

[1] https://github.com/v8/v8/blob/dc712da548c7fb433caed56af9a021... [2] I guess they might have an invisible one now; I haven't looked at the implementation of WeakHashMap/Set.

admax88q · on July 21, 2021

What objects do you have lots of that you're not storing in a HashMap or similar?