Hey thanks for the comment. Could you talk about what you see as an alternative....

mlthoughts2018 · on Oct 18, 2020

I do think managed solutions like Fargate & Sagemaker are good choices. Some providers, notably GCP, have no offerings that seriously match these (Cloud Run has too many limitations).

Kubernetes is very poor for workload orchestration for machine learning. It’s ok for simple RPC-like services in which each isolated pod just makes stateless calls to a prediction function and reports a result and a score.

But it’s very poor for stateful combinations of ML systems, like task queues or robustness in multi-container pod designs for cooperating services. And it is especially bad for task execution. Operating Airflow / Luigi on k8s is horrendous, which is why nearly every ML org I’ve seen ends up writing their own wrappers around native k8s Job and CronJob.

Kubeflow can be thought of like an attempt to do this in a single, standard manner, but the problem is that there are too many variables in play. No org can live with the different limitations or menu choices that kubeflow enforces, because they have to fit the system into their company’s unique observability framework or unique networking framework or unique RBAC / IAM policies, etc. etc.

I recommend leveraging a managed cloud solution that takes all that stuff out of the internal datacenter model of operations, move it off of k8s, and only use systems you have end to end control over (eg, do your own logging, do your own alerting, etc. through vendors & cloud - don’t rely on SRE teams to give you a solution because it almost surely will not work for machine learning workloads).

If you cannot do that because of organizational policy, then create your own operators and custom resources in k8s and write wrappers around your main workload patterns, and do not try to wedge your workloads into something like kubeflow or TFX / TF Serving, MLflow, etc. You may have occasional workloads that use some of these, but you need to ensure you have wrapped a custom “service boundary” around them at a higher level of abstraction, otherwise you are hamstrung by their (deep seated) limitations.

sandGorgon · on Oct 19, 2020

this was super-brilliant. thank you so much ! I wish you would write a blog on this.