I agree with the speaker -- Andy Gross, Principal Architect at Basho, makers of Riak. As he explains in this talk, we're building ever more applications in and for distributed computing environments, but we lack well-documented, standardized, reliable, OS-level tools for:
* managing consensus among many nodes (as he says, "where is my libPaxos?");
* testing distributed code (for example, to see how the system behaves under unusual/unexpected loads or scheduling scenarios); and
* devops (for example, monitoring and managing throughput across all nodes to avoid 'TCP incast'-type problems).
I agree with Andy that having a pluggable consensus service in the operating system would be an interesting avenue for reducing complexity for application developers.
However I am much more familiar with those algorithms than I am at kernel development. The benefits are clear but I'm not positive that this is a good idea. It might be best if they were implement as user-land services for now until there is a better idea of how these things should work.
The problem with Paxos for example is that it is actually a handful of protocols such as Synod. The reason for its complexity is that it tries to present a total-order for all messages to all replicas despite operating in an environment where the network cannot guarantee the order of messages to all nodes and processes can come and go. It works well but the implementation to make it happen is difficult to describe.
Conversely there are algorithms derived from a model of eventual-consistency where assumptions about the order of network delivery are not made and restrictions are instead placed on the programs (as far as I've seen). As long as you write your state transition functions with certain identities and use appropriate data structures to represent your state you can guarantee a partial ordering for messages and an eventually-consistent state (the only assumption is that all messages are delivered to the replica at some point).
(Interestingly I've seen a bit of research into tuning the bounds of when things will be consistent in eventually-consistent systems).
The reason I think that building these services into the kernel could be a bad idea is that they are still not battle-hardened and they are not well understood yet. I am not a kernel developer but I imagine it would be rather difficult to design and support a "consistent" API that would allow users to plug in their desired algorithm. Thoughts?
> As long as you write your state transition functions with certain identities and use appropriate data structures to represent your state you can guarantee a partial ordering for messages and an eventually-consistent state (the only assumption is that all messages are delivered to the replica at some point).
CRDTs are one such approach. Commutativity solves the problem of ordering in a very different way, and turns the consensus problem into a "set membership" rather than a "membership and order" problem. CRDTs have some really nice properties, and are a very active area of research.
We already have well-documented, reliable tools for managing where a job is being processed, testing various aspects of a distributed job, and monitoring them. This is because distributed computing (and distributed systems by extension) have been used for decades by research institutes, governments, and corporations. But most people on HN don't know about them because they weren't developed on github by silicon valley start-ups. (Sorry if that sounds condescending)
What does OS-level mean, by the way? In the kernel? A userland tool installed by default by most Unix-like operating systems? Something else?
As an aside, he seems to be asserting that you need to implement Paxos to have consensus, or that you have to have consensus for a distributed system to work. But your application should determine your systems design, not the other way around. For example, part of the impetus behind Paxos is asynchronous algorithms. But what critical distributed systems do you know of that are asynchronous? For example, if someone's life was on the line, would you rather a synchronous or asynchronous process be used? How about a billion-dollar financial transaction?
Some more fun aspects of Paxos: asynchronous distributed systems cannot have fixed timing, and therefore cannot be a reliable real-time system. And Paxos is dependent on the lack of Byzantine events, which we all know is ridiculous in real-world applications; Byzantine events happen all the time! Between attacks by blackhats to admins mis-configuring services and servers literally catching on fire (causing unexpected corruption of data before the CRC wrapper is applied), shit happens.
But truly synchronous systems are a bitch and a half to build and maintain, so most people settle for partially-synchronous systems, and then call them either synchronous or asynchronous depending on what sounds better to their customers. Anyway, it's not realistic to just focus on Paxos.
It's what I call "Hacker News Driven Development": When faced with a distributed systems problem, the temptation is to pile more immature code on top of it.
* managing consensus among many nodes (as he says, "where is my libPaxos?");
* testing distributed code (for example, to see how the system behaves under unusual/unexpected loads or scheduling scenarios); and
* devops (for example, monitoring and managing throughput across all nodes to avoid 'TCP incast'-type problems).