It seems that the performance was dominated by memory management, so the compari...

beering · on April 23, 2020

Yeah, further down in the article they track down the gap between cpp and Go/Java to the ref-counting deallocation work. I'm not a cpp expert, but it seems surprising to me that GC would beat ref-counting in any scenario.

def-lkb · on April 23, 2020

That might seem counter-intuitive, but tracing GC is the fastest way to manage memory in general (arbitrary life time, graph shapes and allocation patterns).

It permits really fast allocation, defragmentation and is sometime the only way to manage memory (cyclic datastructures for which reachability cannot be known directly from program text).

Reference-counted GC on the other hand is notoriously slow, incomplete, and exhibit lots of bad performance patterns (that can be somehow managed by taking some elements of tracing GCs).

The drawback is that tracing GCs are very hard to implement, to tune, and sometimes not all of the benefits are available at the same time. And because they are so counter intuitive, programmers in general have a hard time understanding their runtime behavior... Even GC experts struggle (the skills required range from low-level hardware to control theory, otherwise performance at scale is unpredictable).

Rochus · on April 23, 2020

In C/C++ we have the possibility to avoid dynamic allocation altogether and to use system features like memory mapping. If we use C++ the same way as Java (everything dynamically) it's not too surprising the result is not (much) faster than Java.

aidenn0 · on April 23, 2020

Their original implementation was in common lisp. Common Lisp also lets you use mmap (in fact it's not that uncommon to do so if you have a large amount of mostly static data) to manage your memory manually.

They clearly wanted automatic memory management, so the C++ implementation is reasonable. A fairer comparison might have used MPS or boehm instead of refcounting, but I suspect the results would have been similar.

nemetroid · on April 23, 2020

> the C++ implementation is reasonable

Fig. 4 shows that deallocation, presumably of objects allocated during the first phase, takes half as long as the first phase itself. Unless the nature of the problem solved by the program requires you to have a ton of objects with shared mutability (the use case for shared_ptr), and requires you to make a lot of allocations up-front without knowing if you will need them, the memory thrashing occurring does not seem reasonable.

Rochus · on April 23, 2020

> Common Lisp also lets you use mmap

But not without allocating dynamic memory and copying data.

> They clearly wanted automatic memory management

Most likely because of some misconceptions.

> so the C++ implementation is reasonable.

How so?

> but I suspect the results would have been similar

Don't forget the data sets to be filtered, sorted an analyzed are up to 200 GB.

aidenn0 · on April 23, 2020

> > Common Lisp also lets you use mmap

> But not without allocating dynamic memory and copying data.

Sure it does. In SBCL you can force a stack allocation (though rarely does it improve performance), and very short-lived values do not leave registers in any case.

> > They clearly wanted automatic memory management

> Most likely because of some misconceptions.

There are both good and bad reasons to want automatic memory management. At least one good reason it it would decrease the porting effort by keeping the code similar.

> > so the C++ implementation is reasonable.

> How so?

Using reference counting is a reasonable way to get automatic memory management in C++

> > but I suspect the results would have been similar

> Don't forget the data sets to be filtered, sorted an analyzed are up to 200 GB.

Which is going to be rough on any automatic memory management system, which makes using a language with a better ecosystem of automatic memory management more performant.

candiodari · on April 23, 2020

> > Don't forget the data sets to be filtered, sorted an analyzed are up to 200 GB.

> Which is going to be rough on any automatic memory management system

You could equally say that it makes the case for actually designing memory allocation strategy (which only C++ really supports) that much more important.

You always see this in Java programs for large data analysis. They pick java because of memory management and the tooling. But it's just SO slow and after optimisation the only thing that stubbornly remains up there in the profiler data is memory and GC. And what do they do?

A global object of the following form:

  class DataStore {
    float theFloatsWeNeed[constHowMany];
    int theIntsWeNeed[anotherConst];
  }

You get the idea. Because this avoids memory allocation in java. And you use the flyweight pattern to pass data around. Or you fake pointer arithmetic in java. You create your own pointers by specifying indexes and you oh the horror use math on those indexes. Even then just checking those indexes actually becomes a significant time sink (and then you disable that, which of course kills memory safety in java, but you won't care).

The truth is you don't want memory management for large amounts of data. You don't want to allocate it, track it or deallocate it at all. You leave it in it's on-disk data format and never serialize/deserialize it at all. You want to mmap it into your program, operate on it and then just close the mmap when you're done. C++ definitely has the best tools for this way of working.

Rochus · on April 24, 2020

Yeah, right. The people who know this can apparently be counted on one hand when I look at the advertised publication and the discussion in this forum.

Rochus · on April 23, 2020

Common Lisp: It is permissible for an implementation to simply ignore such declarations. And you still have to copy.

Ref counting: only makes sense in a few special cases.

Avoiding dynamic memory management: have a look at mmap.

aidenn0 · on April 24, 2020

C++: It is permissible for an implementation to allocate every local variable on the heap.

Going back to my original point, you are suggesting a complete rearchitecture of their allocation system. That does not require switching languages to C++. If we are talking about working with 100s of GB of data, that's probably even the correct approach!

TFA does not, however, claim that they have a working set of 100s of GB of data. The data is 100s of GB at rest, but can be processed in chunks with a single pass. That, by itself, does not scream "mmap" to me. On top of that, the data is compressed at rest, so copying is inevitable.

Rochus · on April 24, 2020

So we are glad that we can use the data of the memory mapped file directly and do not have to allocate or copy anything. But of course you may solve the problem badly if you prefer so; after all, there are even "scientific" publications that do it that way, as the example shows.

kazinator · on April 24, 2020

It's permissible for a C compiler to emit shell scripts.

aidenn0 · on April 23, 2020

I guess I was wrong with this comment:

https://news.ycombinator.com/item?id=22959600

1. Reference counting is a form of GC; you could implement a JVM that used reference counting (though in order to be general a small amount of additional work is needed)

2. Reference counting causes extra work every time a reference appears or disappears. Tracing GCs amortize that cost across many allocations.

2.b. This is particularly hurtful to performance for short-lived objects, since most tracing GCs have zero GC overhead for short-lived objects (the cost of a nursery collection under most implementations scales with the amount of live data in the nursery, so objects that appear and disappear in the time-span of a single nursery GC are freed at zero extra cost). Furthermore a tracing GC

3. Malloc cannot move allocated data, so many implementations have a lot of complexity to avoid heap fragmentation, which comes at a cost to both allocating and freeing data. Many GC'd languages allocate small objects with a single instruction in the typical (just incrementing a pointer, the non-typical case would be when the nursery is full and a GC happens).

4. the JVM and Go both have a lot of effort put into their GC; the ref-counting implementation used by this test is probably a bit more naive. In particular they talk about large delays when a chain of links cause many allocations to die at the same time. A less naive refcounting implementation would queue deleted objects and spread that work out across a larger time period.

nostrademons · on April 23, 2020

They're duals. Seminal paper:

https://www.researchgate.net/publication/221321424_A_unified...

Performance characteristics depend where on that continuum your workload falls. For example, Erlang/BEAM uses a generational GC for most common heap objects, but refcounts large binary blobs. This is pretty much a perfect case for refcounting: new references are created infrequently, copying or moving is expensive, destruction is deterministic and happens immediately after the last reference disappears, and there're no pointers within the blob that would require tracing or cycle-detection.

Similarly, UI components within a GUI is another good case for refcounting (and presumably why Apple continues to use this for Objective-C and Swift in Cocoa). New references happen only in non-performance-critical code, most data remains live across collections, and copying/moving existing data would a.) be slow and b.) would invalidate any C pointers into contained data.

Sounds like the particular problem domain described in this article is one where heap allocations are frequent, which makes generational GCs more appropriate. That's probably the case with the vast majority of computational algorithms, but there are definitely problem domains where refcounting continues to beat GC.

tomp · on April 23, 2020

> you could implement a JVM that used reference counting

It would leak memory - reference counting cannot collect cycles (you need tracing GC for that, defeating the purpose of refcounting).

aidenn0 · on April 24, 2020

From right after what you quoted:

> (though in order to be general a small amount of additional work is needed)

Not also that there are two methods of cycle detection for a reference counted GC that are not just a backup tracing-GC

1. Trial deletion (known since at least the mid 80s)

2. Various tracing systems that exploit extra information known to reference-counted systems e.g. Levanoni/Petrank[1] which actually implemented a reference counted GC for Java.

1: https://www.cs.technion.ac.il/~erez/Papers/refcount.pdf

tomp · on April 24, 2020

Thanks, I didn't know that!

antender · on April 23, 2020

Refcounting can mess with caches in multithreaded scenarios, so results aren't really surprising to me. It also generates unnecessary work to maintain counters on all writing threads, while concurrent tracing GC batches all its actions on separate thread.

mcguire · on April 23, 2020

"With reference counting, objects are recognized as obsolete due to their reference counts dropping to zero. Deallocation of these objects leads to transitive deallocations of other objects because of their reference counts transitively dropping to zero. Since this is an inherently sequential process, this leads to a similar significant pause as with a stop-the-world garbage collector."

There are three primary performance penalties for reference counting:

1. This, where dropping one link causes a deallocation chain.

2. A typical heap implementation will leave live objects scattered throughout memory, leading to cache issues.

3. Maintaining the reference counts may cause writes to memory, leading to other cache issues, plus atomicity.

Rochus · on April 23, 2020

People often forget that memory allocation is very expensive; the C/C++ default allocator is very inefficient.