There's so many trade-offs within this statement I felt like it deserved some co...

junon · on Oct 3, 2020

This. Event-based systems do not satisfy the "passive" part of active/passive dist-sys design - ergo, they are not fault tolerant by themselves, and most machinery designed around making them quasi-fault-tolerant tend to be bigger headaches in the long run (see Apache Kafka).

I've had to explain this to a dozen teams so far in my career and most of them go ahead with the design anyway, regretting it not even 6 months later.

strken · on Oct 3, 2020

What's so bad about Kafka? I've only ever used mature Kafka systems that were already in place, so I don't know what the teething issues are from an operational or development perspective.

junon · on Oct 3, 2020

Kafka persists events locally, which when mishandled can cause synchronization issues. If an event-based system has to cold-restart, it becomes difficult if not impossible to determine which events must be carried out again in order to restart processes that were in progress when the system went down.

This is a characteristic with all event-based systems, but persistence-enabled event systems (such as Kafka) make it even harder because now there are events already "in flight" that have to be taken into account. Event-based systems that do not have persistence (and thus are simply message queues used as a transport mechanism) have a strong guarantee that _no_ events will be in-flight on a cold-start, and thus you have an easier time figuring out the current overall state of the system in order to make such decisions.

The only other way around this is to make every possible consumer of the event-based system strongly idempotent, which (in most of the problem spaces I've worked in) is a pipe dream; a large portion certainly can be idempotent, but it's very hard to have a completely idempotent system. Keep in mind, anything with the element of time tends not to be idempotent, and since event systems inherently have the element of time made available to them (queueing), idempotency becomes even harder with event based systems.

A rule of thumb when I am designing systems is that a data point should only have one point of persistence ("persistence" here means having a lifetime that extends beyond the uptime of the system itself). Perhaps you have multiple databases, but those databases should not have redundant (overlapping) points of information. This is the same spirit of "source of truth", but that term tends to imply a single source of truth, which isn't inherently necessary (though in many cases, very much desirable).

Kafka, and message queues or caches like it (e.g. Redis with persistence turned on), breaks this guarantee - if the persistence isn't perfectly synchronized, then you have, essentially, two points of persistence for the same piece of information, which can (and does) cause synchronization issues, leading you into the famously treaterous territory of cache invalidation problems.

As with most technologies, you can reduce your usage of them to a point that they will work for you with reasonable guarantees - at which point, however, you're probably better off using a simpler technology altogether.

kinow · on Oct 2, 2020

>Basically, I think you can remove your first sentence and leave the last to make a truer statement.

+1 not sure how event driven arch would have mitigated here. A well designed (with redundancy where necessary, better tooling, monitoring, scalable, etc) would.

nurettin · on Oct 3, 2020

>For example what if you spin up a 'pub' side without the corresponding 'sub' side?

Then the events will wait in a queue and be consumed when sub side is ready.

jlokier · on Oct 3, 2020

So the sub side boots up, says it is ready, accepts an event, then power goes down.

You reboot the sub side. The event never gets processed because pub already sent it and recorded this fact.

Or, the sub side doesn't reboot, the pub side does. The pub side accepts an event for publishing, send it to the sub side, and promptly loses power.

The pub side reboots and either it resends the event and the sub side receives the event twice (because the pub side didn't record that it had already sent it before power was lost), or it doesn't resend the event and the sub side never receives it (because the power loss killed the network link while the packet was on its way out).

If you think you can make these and other corner cases go away with a simple bit of acknowledging here and there, good luck!

The corner cases can be solved, but it's not half as simple as "wait in a queue and be consumed when ready".

nurettin · on Oct 5, 2020

I think that's just unfair. Synchronous system finishes a task, tries to change state, power goes out, system is in an inconsistent state. If you think you can fix this with a couple of write-ahead logs and a consistency checker, good luck! That's how you sound.

jlokier · on Oct 5, 2020

> If you think you can fix this with a couple of write-ahead logs and a consistency checker, good luck! That's how you sound.

I'm literally saying it can't be fixed with a simple solution, in the parent comment and other comments, so I'm not sure where you get the idea that I'm saying it can.

Dealing with inconsistent states from failures in a distributed system is solvable but it's not simple unfortunately. It's not even simple to describe why.

nurettin · on Oct 5, 2020

I don't think you are reading the response correctly. Software is affected badly from power outages or sigkills whether you have queues or not.

konschubert · on Oct 3, 2020

Yea, that’s kinda the point of going event-based...