Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It uses a mounted EBS(Elastic Block Store) so all the checkpoints, data etc. is already in the persistence storage. This is simply be re-attached to the next spot/onDemand instance after interruption.


Cool yeah that makes sense, makes total sense for ML where you just need to run over epochs, less clear for other workloads.

After looking around I thinking more about CRIU/docker suspend. The google stars aligned and I found this https://github.com/checkpoint-restore/criu-image-streamer + https://linuxplumbersconf.org/event/7/contributions/641/atta... which actually seems perfect. I wonder how fast it is

Edit: Also no GPU support AFAIK but https://github.com/twosigma/fastfreeze looks really nice, turnkey. I wonder if I write to a fast persistent disk if I can get higher maximum ram than over the NW

(or, hacking on a checkpoint idea, have a daemon periodically 'checkpoint' other programs so even if it's too slow over 60 seconds, revert to the last checkpoint. Even an rsync like application where only send the changes)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: