Using TLA+ to Model Cascading Failures

tluyben2 · on April 11, 2019

I have been modelling parts of our systems in TLA+ for years now and it really helps a lot working out complex scenarios.

Leace · on April 11, 2019

I'd love to read about that in detail. Are there any publicly available documents you could share?

Twirrim · on April 11, 2019

Anecdotally, the team I was in in AWS needed to build a complicated component, one that, should it get it wrong, would be disastrous for the service.

They'd estimated about 4-6 months for a two person team, made from some of the best engineers in the service, focused entirely on it to get it written, tested and out to production.

They decided to use TLA+ to model, despite neither engineer having used it before. They lost about a week to getting up and running with it (one engineer's only complaint was how tied in to Eclipse it was), and then spent the res of the month working on and modelling the whole task.

It found problems. A whole bunch of them. The fixed the model until finally TLA+ gave them an all clear.

Then came the coding. Well... that didn't take very long at all. The TLA+ model effectively outlined all the code and methods for them. The actual programming ended up being almost a cookie cutter simple code.

In total, the new and complicated component went from drawing board to tested and ready for production in about 2 months. Despite having had to learn TLA+, it ended up taking less time than if they'd not written the TLA+ models in the first place.

nickpsecurity · on April 11, 2019

Ive read a lot of industrial use of formal methods. If the method is easy, then almost all of them say what you said about the coding phase. I'll add that competing methods have code generators: Abstract, State Machines and B Method. ASM's look a lot like PlusCal with usability for about anything. Asmeta is one of tools with code generator. I think most doing it for B commercially use Atelier B and ProB. ProB does TLA+, too.

Now, far as TLA+-like stuff, tell them to check out SPIN model checker. It's seen more commercial use than most tools. Outperformed TLA+ in one paper but dont know if that extrapolates. Port some of your stuff to it to see what happens.

baq · on April 11, 2019

Lamport claims on his website that Amazon uses TLA+ and has used it for quite a while now. Saying that to confirm plausibility of this story.

Twirrim · on April 12, 2019

Amazon/AWS has published a few white papers about their use of TLA+ and other Formal Methods:http://lamport.azurewebsites.net/tla/formal-methods-amazon.p....

James Hamilton is a fan of TLA+, and talks about its use in Amazon: https://perspectives.mvdirona.com/2014/07/challenges-in-desi...

Tomte · on April 11, 2019

https://lamport.azurewebsites.net/tla/formal-methods-amazon....

Ixiaus · on April 11, 2019

Hillel Wayne[1] has written a bunch about TLA+ for lay programmers like me. I've found his blog posts pretty good, I also bought his book Practical TLA+ which I like. He also wrote the free web book Learn TLA+[2].

[1]: https://www.hillelwayne.com/post/using-formal-methods/ [2]: https://learntla.com/introduction/

lintuxvi · on April 11, 2019

He also runs workshops for teams wanting to incorporate this into their toolset.

tluyben2 · on April 11, 2019

I am hoping to find the time to write something, preferably a (free) book; I think I might find the time this summer for that.

Tomte · on April 11, 2019

When you decide to start, collect mail adresses of interested readers. Don't just assume people will stumble upon your book, seek out readers long before the book is finished.

I recently told this on Twitter to an Academic who had just finished writing a book on Cryptography in middle-age Venice (how cool is that?), but with publication still many months off.

She had never thought about collecting people's mail addresses, even though dozens (if not hundreds) of people replied on Twitter that they'd be interested in reading the book.

detaro · on April 11, 2019

> a book on Cryptography in middle-age Venice (how cool is that?), but with publication still many months off.

Sounds interesting, link?

Tomte · on April 11, 2019

Erm, not Middle Ages, Rennaissance, of course!

https://www.amazon.de/Venices-Secret-Service-Intelligence-Re...

victor106 · on April 11, 2019

Anyone here can shed some light on how to incorporate tla+ into an agile process?

pron · on April 11, 2019

Augmenting Agile with Formal Methods: https://www.hillelwayne.com/post/augmenting-agile/

usgroup · on April 11, 2019

Can anyone advise on whether it’s worth modelling general critical systems in TLA+ where they are not distributed?

In broad strokes, how does one make sure that your real system indeed behaves like your model?

pron · on April 11, 2019

> Can anyone advise on whether it’s worth modelling general critical systems in TLA+ where they are not distributed?

Absolutely, and not just critical systems. Any system that is either complex or may have some non-obvious subtleties can benefit from a specification.

> In broad strokes, how does one make sure that your real system indeed behaves like your model?

In general, TLA+ can be used to specify very large and very complex systems. It is currently infeasible to mechanically verify with absolute certainty that systems of such size conform to a specification, regardless of tools used. The only systems that can be verified to such an extreme extent (called end-to-end verification, namely interesting global properties are verified all the way down to the code, and even to the machine-code level) are very, very small (no more than about 10KLOC), and even then require a tremendous amount of effort by experts.

Specifications should be relatively short and clear. They are therefore useful for stating your assumptions about the system, and then checking the consequences of those assumptions. Whether the assumptions are accurate, approximate or wrong can then be verified by inspection -- this is certainly feasible and commonly done in practice. There are also relatively cheap mechanical ways to check that the system conforms to the spec, but not with absolute certainty. One is called trace checking, and its possible use with TLA+ is described here: https://pron.github.io/files/Trace.pdf

apta · on April 11, 2019

How common is it to use TLA+ on existing/pre-built systems, to retrospectively try to find issues or bugs in them?

pron · on April 11, 2019

It's hard to answer because using TLA+ in general is far from common. But among users, I would say that most cases are where they have an existing system. That was certainly how I first used TLA+ (when I realized there could be deep design issues with a system I worked on; TLA+ showed me that there indeed were two serious problems, and it also verified that my fix was correct).

hwayne · on April 11, 2019

In my experience this is how a lot of people practice learning TLA+: taking systems they already know and applying TLA+ to them. The advantage of this is you already know the system pretty well and can use that to check you modeled it correctly.

Also, it finds serious bugs scarily often. I've regularly received messages from people like "Yeah I tried modeling the system and it turns out we need to throw out six months of work due to a requirement bug".

I wrote more about this here: https://www.hillelwayne.com/post/using-formal-methods/

tunesmith · on April 11, 2019

It seems TLA+ would be a really great fit for modeling actor systems, has anyone found good examples of this, for instance for Akka or Erlang?

polskibus · on April 11, 2019

Akka has testkit and multi node testkit which you can use to test such systems. If you need to explore state space, you can use data driven and or combinatorial testing attributes from *unit libraries.

pron · on April 12, 2019

TLA+ and any kind of testing serve different purposes. I don't know how good of an analogy this is, but tests are more like strength tests on a structure, while a TLA+ is like a blueprint.

pron · on April 12, 2019

It's very easy to specify such systems in TLA+. I think this becomes apparent as you learn it.