CPU/Memory: scale horizontally when needed. Monitor cost. Disk: essentially limi...

jerf · on June 26, 2023

Letting an Erlang process crash is letting a process holding a small collection of resources, maybe a single TCP connection and several kilobytes of local state. It does not necessarily scale beyond that. Pretty much by definition, if you've got something that can run out of disk, when it runs out of disk and you nuke it, you're taking out a lot more than a single connection and a few kilobytes of state.

And "let it crash" is hardly "My service is invincible!" I'm sure any serious Erlang deployment has seen what I've seen as well, which is a sufficiently buggy process that crashes often enough that the system never gets into a stable state of handling requests. "Let it crash" is hardly license to write crap code and let the supervisor pick up the pieces. It's a great backstop, it's a shitty foundation.

sterlind · on June 26, 2023

I built a distributed system once on two principles: 1. let it crash, 2. make everything deterministic. obviously, this resulted in crashes being invisible and transient (good) or an infinite crash loop (bad.)

I haven't used Erlang, but my impression is that it's probably the same experience there?

jerf · on June 26, 2023

The way it builds on immutability means it naturally can lean in that direction, but the way it tends to be used for networks a lot undoes that because network communication by its nature is not deterministic in the sense you mean.

In my case, IIRC, it was something to the effect of, a lot of our old clients out in the field connected with a field that crashed the handler. Enough of these were connecting that the supervisor was flipping out and restarting things (even working things) because it thought there was a systemic issue. (I mean, it was right, though I had it configured to be a bit to aggressive in its response.) The fact that I could let things crash didn't rescue the fact my system, even if I fixed that config, would strain its resources constantly doing TLS negotiations and then immediately crashing out what were supposed to be long term connections.

Obviously, the problem was in the test suite, we were able to roll back, address the problem, ultimately this was a blip not a catastrophe. I just cite it as a case where "let it crash" didn't save me, because the thing that was crashing was just too broken. You still have to write good code in Erlang. It may survive not having "great" code but it's not a license to not care.

edejong · on June 26, 2023

Let it crash is anti-fragile. Either you understand that, or you try to make super robust and super expensive applications.

jerf · on June 26, 2023

Using Taleb's nomenclature, let it crash is not anti-fragile at run time. Erlang does not get progressively better at holding your code together the longer it crashes. It is only resilient and robust. Which is ahead of a lot of environments, but that's all.

Many software development processes considered as a whole are anti-fragile... mine certainly is, which is a great deal of why I love automated testing so much (I often phrase it as "it gives me monotonic forward progress" but "it gives me antifragility" is a reasonably close spin on it too)... but that's not unique to Erlang nor does Erlang have anything special to help with it as everything Erlang has for robustness is focused at run time. You can have anti-fragile development processes with any language. (The fact that I successfully left Erlang for another language entirely is also testament to that. I had to replace Erlang's robustness but I didn't have to replace its anti-fragility, since it didn't have it particularly.)

namaria · on June 27, 2023

Anti fragility is just a fancy name for 'ability to learn'. Erlang error handling philosophy enables learning by keeping thing simple and transparent. It's easy to see some component keeps failing, it doesn't bring your whole app down and you can look into it and improve it. Adding tonnes of third party machinery may be robust or even resilient, but if it makes things more opaque or demands bigger teams of deeper specialists, it precludes learning. Thus is not 'anti fragile'. You can keep your ability to learn healthy without Erlang, and you can use it without learning much over time.

It's not the tool, it's how you use it.

steveBK123 · on June 26, 2023

This is an adequate philosophy for like.. a CRUD app, some freemium SaaS, social media, etc. Stuff with millions of users and billions of sessions, etc.

However there are industries applying these lessons in HPC / data analytics / things that touch money live .. operating on scales of users in the 10s to maybe 100s. So stuff where downtime is far more costly both in dollars and reputation.

I'm also intrigued by the constant cloud refrain of "stuff crashes all the time so just expect it to" coming from a background where I have apps that run without crash for 6 months at a time, or essentially until the next release.

I'm all for scaling, recovery, etc.. I just fail to understand why it is desirable for this to be an OR rather than an AND.

What if stuff was highly recoverable and scalable but also.. we just didn't run out of disk needlessly?

tivert · on June 26, 2023

> I'm also intrigued by the constant cloud refrain of "stuff crashes all the time so just expect it to" coming from a background where I have apps that run without crash for 6 months at a time, or essentially until the next release.

IMHO, those aren't mutually exclusive. Your app code should be robust enough to run 6+ months at at time, and the "stuff crashes all the time so just expect it to" attitude should be reserved for stuff outside your control, like hardware failures.

steveBK123 · on June 26, 2023

Right, which is why I think brushing aside actually monitoring basic hardware stats that are leading indicators of error rates / API issues / etc makes no sense.

wizofaus · on June 26, 2023

How is that better than a simple monitor/alert for low disk space? That low disk space is likely caused by having an application store too much cumulative data in log files or temporary caches etc. and often easy enough to fix. And many applications out there simply don't need the level of scalability and extra-robustness you need that you can still expect decent levels of service in the immediate aftermath of having one node go down. Certainly from my experience it's less work (and cost) to put measures in place to minimise the chances of a fatal crash than it is to ensure the whole environment functions smoothly even if parts of it do crash regularly. I'd also note we can be grateful that the developers of OSes, web servers, VMs and database servers don't subscribe to "let it crash"!

pdimitar · on June 26, 2023

It looks like you misunderstood the article. "Let it crash" in the BEAM VM world pertains to a single green thread / fiber (confusingly called "process" in Erlang).

It pertains to e.g. single database connection, single HTTP request etc. If something crashes there your APM reports it and Erlang's unique runtime continues unfazed. It's a property of the BEAM VM that no other runtime possesses.

"Let it crash" is in fact "break your app's runtime logic to many small pieces each of which is independent and an error in any single one does not impact the others".

Scaling an Erlang node is very rarely the solution unless you literally run out of system resources.

edejong · on June 26, 2023

I understood the article just fine. The "Let It Crash" philosophy is scale invariant.

Please read the last three paragraphs in [0]: "a well-designed application is a hierarchy of supervisor and worker processes" and "It handles what makes sense, and allows itself to terminate with an error ("crash") in other cases."

I've personally designed and co-implemented mission critical real-time logistics systems which dealt with tens of thousands of events per second, with hundreds of different microservices deployed on a cluster of 14 heavy nodes. Highly complex logic. At first we were baby-sitting the system, until it became resilient by itself. Stuff crashed all the time. Functional logic was still intact. Then we had true silence for months on our pager alerts.

I call it anti-fragile and Taleb is right. You can't make a system resilient if you don't allow it to fail.

[0] https://wiki.c2.com/?LetItCrash

wizofaus · on June 26, 2023

Is that so different to how Java+Tomcat or .NET+IIS work? A crash handling one request generally can't/doesn't affect the ability to handle other requests. Unfortunately it does often mean you have limited control over how the end-user perceives that one failed request.

pdimitar · on June 26, 2023

It is only the same when you observe the visible results and your APM, and nowhere else. The stacks you mention -- and many others -- engage much more system resources per request compared to the BEAM VM. I have personally achieved 5000 req/s on a 160 EUR refurbished laptop with a Celeron J CPU, a pretty anemic one you know, in my local network -- by bombarding an Elixir/Phoenix web app; Elixir steps on Erlang if you did not know -- and that's without even trying to use cache.

RE: error handling, Elixir apps I coded and maintained never had a problem. Everything was 100% transparent which is another huge plus.

In general CGI and PHP had the right idea from the start but the actual implementations were (maybe still are? no idea) subpar.

Erlang's runtime is of course nowhere near what you will get with Rust and very careful programming with the tokio async runtime, but it's the best out there in the land of the dynamic languages. I've tried 10+ languages which is of course not technically exhaustive but I was finally satisfied with making super parallel _and_ resilient workloads when I tried Elixir.

For a lot of tasks I can just code in the Elixir REPL and crank out a solution in an hour, including copy-pasting from the REPL and into an actual program with tests. Same task in Rust took me a week, just saying (though I am still not a senior in Rust; that's absolutely a factor as well) and about 3/4 of a day in Golang.

The only other small-friction no-BS language and ecosystem that came kinda close for me is Golang. I like that one a lot but you have to be very careful when you do true parallelism or you have to reach for a good number of libraries to avoid several sadly very common footguns.