Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm not sure what this provides over what GCP already offers. 4 years ago I switched my co's ML training to use GKE (Google Kubernetes Engine) using a cluster of preemptible nodes.

All you need to do is schedule your jobs (just call the Kubernetes API and schedule your container to run). In the rare case the node gets preempted, your job will be restarted and restored by Kubernetes. Let your node pool scale to near zero when not in use and get billed by the _second_ of compute used.



Reminder: this is Show HN, so please try to be constructive.

However, there's at least a couple of things that matter here that aren't covered by "just use a preemptible node pool":

* SpotML configures checkpoints (yes this is easy, but next point)

* SpotML sends those checkpoints to a persistent volume (by default in GKE, you would not use a cluster-wide persistent volume claim, and instead only have a local ephemeral one, losing your checkpoint)

* SpotML seems to have logic around "retry on preemptible, and then switch to on-demand if needed" (you could do this on GKE by just having two pools, but it won't be as "directed")


Looks like SpotML is a fork of https://github.com/nimbo-sh/nimbo and https://spotty.cloud/

This is a hustle to gauge interest (and collect emails) in a service that is a clone of nimbo.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: