Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> .. durable execution, and is so new that most developers never have heard of it.

It's called checkpoint/restart, and was a feature of some early operating systems. Mostly for programs whose run time exceeded the mean time before failure of the system.

Tandem's whole system concept was built around that.

Amusingly, Second Life, of all things, has durable execution of the little LSL programs that make in-world objects go. They're checkpointed every minute or two, and if a region crashes, they are restarted, stack, heap, and all. They survive machine crashes and ports to new hardware. Even migration from a dedicated data center to AWS. Some have been running for well over a decade. Internally, they are Mono programs.



There's a decent chance checkpoint/restart has been with us since the days of paper tape.

It's moderately amusing to see "save state somewhere" described as so new people haven't heard of it.


I've been in the industry for so long now as to see this pattern repeat often enough that I've come to accept it as a sort of universal truth: nearly everything that people think is "new" is actually an echo and (hopefully) refinement of something that has come before.

I am hard-pressed to think of anything in the software world today that is actually, truly, new. We all stand on the shoulders of giants.


Checkpoint/Restore I feel is a bigger concept than just saving state. At the zeroth level it's a system that can correctly stop and serialize a running process (as criu https://github.com/checkpoint-restore/criu has shown is a huge pain in the ass to still not be perfect) in a way that can initiated from within the process itself.

The 1st level more-work-but-easier way to do this is to build or use a heavily constrained VM/language you run from within your main application that doesn't allow for most of the hard problems to even exist.

I can't find any ready-made tools to do this that I wouldn't consider an endeavor. Emacs has to be the most famous application to utilize dump/restore state but it's not exactly turnkey.


I remember there was one for unix back in the 90's, I think called Condor, that could be used to migrate long running processes to other machines. (I think the tricky part was restoring external connections like file handles.)

And before that, there were TeX and emacs, which do a memory dump and restore to save time loading their initial state. One of those, I think emacs, would tweak a core dump file to be a valid executable again.


The challenge is purely in how to make it perform well.

Dumping the whole memory was a lot more viable when that was only a few kilobytes.


Good point about Tandem, but it wasn't limited to them.

Checkpoint/restart has also long been a high-end capability found in HPC (High Performance Computing) systems. I recall it even it made it onto some commercial UNIXes, although I don't remember whether it was bundled in or took the form of some additional layered software such as LSF (Load Sharing Facility) (example: https://www.ibm.com/docs/en/spectrum-lsf/10.1.0?topic=admini... ).

There are some people who have tried to bring some checkpoint/restart to Linux. For example, see the CRIU project: https://criu.org/Checkpoint/Restore

I acknowledge that the original poster is talking about all this being done at the application level, not outside-the-app at the OS level...

For more references on the topic, see the wikipedia entry for Application checkpointing: https://en.wikipedia.org/wiki/Application_checkpointing




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: