I've seen the guts of a few major financial organizations, and there are some common themes regarding their infrastructures.
The one that really sticks out to me as an engineer is the fact that the whole system in most cases seems to be tied together by a fragile arrangement of 100+ different vendors' systems & middleware that were each tailored to fit specific audit items that cropped up over the years.
Individually, all of these components have highly-available assurances up and down their contracts, but combine all these durable components together haphazardly and you get emergent properties that no one person or vendor can account for comprehensively.
When the article says a full reset entails killing the power and restarting, this is my actual experience. These complex leviathans have to be brought up in a special snowflake sequence or your infra state machine gets fucked up and you have to start all over. When dependency chains are 10+ systems long and part of a complex web of other dependency chains it starts to get hopeless pretty quickly.
I’ve worked in a lot of banks, and other similar organisations, and the truth is that big enterprise just sucks at compliance. You can approach compliance frameworks in two ways, you can put a lot of effort into designing your infrastructure and compliance testing methodology (I would suggest Amazon as a canonical example of how to do this well), so that it performs well and meets all the requirements, or you can take out the sledgehammer, implement a new control for every occasion, and create a cumbersome bureaucracy.
The good design approach is obviously superior in many ways. But the downside of it is that you have to trust the competency of a lot of different business units to maintain it. A cumbersome bureaucracy on the other hand ensures that incompetent/lazy/low-initiative internal actors can’t impact your compliance. If you fail, at least you fail in a compliant and auditor-approved way.
That said, a lot of the failures I’ve seen in organisation like this stem from silo’d expertise. People don’t know that much about the systems that are outside their remit, so they will make changes that impact connected systems in ways they failed to imagine. As an example I have seen 3 seperate banks have non-trivial service disruptions stem from the same independently made mistake. A person enabling debug logging on the SIP phones. The traffic DOSes their networks, and all of a sudden, core network infrastructure starts to die. Afterwards they send the right reports off to the right people, make the correct adjustments to the bureaucracy, and proceed with their compliance intact.
There is also a 3rd way to approach compliance, via negligence. But the more you are in the regulatory spotlight, the less of an option that is.
> that were each tailored to fit specific audit items that cropped up over the years.
Or worse, "compliance" line items, that some tool or some company identified in their cookie-cutter processes. As long as that line item goes away, noone really cares what the long term implications are.
Yup. This is why you're seeing rapid adoption of event-based architectures. You can more confidently surmise the larger system state in a well-designed system.
There's so many trade-offs within this statement I felt like it deserved some color -
* Spinning up an event-based architecture is prone to the same issues GP describes - For example what if you spin up a 'pub' side without the corresponding 'sub' side?
* Event-based architecture does not inherently give you better observability of the whole system. I would argue it's actually worse to begin with because of the decoupled nature of the services. You have to build tooling to monitor traces across services or cross-system health
* Any 'well-designed' system will solve these problems. A monolith with good tooling and monitoring will be more easily understood than a poorly designed event-based system.
* I think metrics and monitoring are really the key to knowing about a system, regardless of the architecture.
Basically, I think you can remove your first sentence and leave the last to make a truer statement. A great system should have design considerations and tools to make it understood at various scopes by various people of various skills. I think people have falsely conflated event-based architecture with systems health due to having to rewrite a system almost from scratch, which results in better tooling since it's something you're thinking about actively.
This. Event-based systems do not satisfy the "passive" part of active/passive dist-sys design - ergo, they are not fault tolerant by themselves, and most machinery designed around making them quasi-fault-tolerant tend to be bigger headaches in the long run (see Apache Kafka).
I've had to explain this to a dozen teams so far in my career and most of them go ahead with the design anyway, regretting it not even 6 months later.
What's so bad about Kafka? I've only ever used mature Kafka systems that were already in place, so I don't know what the teething issues are from an operational or development perspective.
Kafka persists events locally, which when mishandled can cause synchronization issues. If an event-based system has to cold-restart, it becomes difficult if not impossible to determine which events must be carried out again in order to restart processes that were in progress when the system went down.
This is a characteristic with all event-based systems, but persistence-enabled event systems (such as Kafka) make it even harder because now there are events already "in flight" that have to be taken into account. Event-based systems that do not have persistence (and thus are simply message queues used as a transport mechanism) have a strong guarantee that _no_ events will be in-flight on a cold-start, and thus you have an easier time figuring out the current overall state of the system in order to make such decisions.
The only other way around this is to make every possible consumer of the event-based system strongly idempotent, which (in most of the problem spaces I've worked in) is a pipe dream; a large portion certainly can be idempotent, but it's very hard to have a completely idempotent system. Keep in mind, anything with the element of time tends not to be idempotent, and since event systems inherently have the element of time made available to them (queueing), idempotency becomes even harder with event based systems.
A rule of thumb when I am designing systems is that a data point should only have one point of persistence ("persistence" here means having a lifetime that extends beyond the uptime of the system itself). Perhaps you have multiple databases, but those databases should not have redundant (overlapping) points of information. This is the same spirit of "source of truth", but that term tends to imply a single source of truth, which isn't inherently necessary (though in many cases, very much desirable).
Kafka, and message queues or caches like it (e.g. Redis with persistence turned on), breaks this guarantee - if the persistence isn't perfectly synchronized, then you have, essentially, two points of persistence for the same piece of information, which can (and does) cause synchronization issues, leading you into the famously treaterous territory of cache invalidation problems.
As with most technologies, you can reduce your usage of them to a point that they will work for you with reasonable guarantees - at which point, however, you're probably better off using a simpler technology altogether.
>Basically, I think you can remove your first sentence and leave the last to make a truer statement.
+1 not sure how event driven arch would have mitigated here. A well designed (with redundancy where necessary, better tooling, monitoring, scalable, etc) would.
So the sub side boots up, says it is ready, accepts an event, then power goes down.
You reboot the sub side. The event never gets processed because pub already sent it and recorded this fact.
Or, the sub side doesn't reboot, the pub side does. The pub side accepts an event for publishing, send it to the sub side, and promptly loses power.
The pub side reboots and either it resends the event and the sub side receives the event twice (because the pub side didn't record that it had already sent it before power was lost), or it doesn't resend the event and the sub side never receives it (because the power loss killed the network link while the packet was on its way out).
If you think you can make these and other corner cases go away with a simple bit of acknowledging here and there, good luck!
The corner cases can be solved, but it's not half as simple as "wait in a queue and be consumed when ready".
I think that's just unfair. Synchronous system finishes a task, tries to change state, power goes out, system is in an inconsistent state. If you think you can fix this with a couple of write-ahead logs and a consistency checker, good luck! That's how you sound.
> If you think you can fix this with a couple of write-ahead logs and a consistency checker, good luck! That's how you sound.
I'm literally saying it can't be fixed with a simple solution, in the parent comment and other comments, so I'm not sure where you get the idea that I'm saying it can.
Dealing with inconsistent states from failures in a distributed system is solvable but it's not simple unfortunately. It's not even simple to describe why.
Full 100 services should have end-to-end integration testing and any change made to that chain of tooling should have to run through a massive integration test. If anything fails, the change is no longer acceptable.
You can comment this on any outage that ever happens. "Why didn't they have a test for that?"
The answer is that tests are never perfect. If you want to create an integration environment that mimics prod, you have to fork an entire parallel universe into your integ environment to run the test. Anything else will diverge from the reality of the future.
Even if every vendor's service or hardware had integration tests, that doesn't mean that the integration tests covered every case. It doesn't mean there's not an emergent property of two systems behaving in a slightly unexpected way that turns into a catastrophic result.
It's not necessarily even possible to have two copies of some of the systems; who knows how expensive a given vendor's hardware box is.
It's definitely not possible to exactly mimic future traffic. Perhaps in the test environment it works, then the prod environment, requests are different so it fails.
Hardware errors happen, and integ testing those is difficult to say the least.
The one that really sticks out to me as an engineer is the fact that the whole system in most cases seems to be tied together by a fragile arrangement of 100+ different vendors' systems & middleware that were each tailored to fit specific audit items that cropped up over the years.
Individually, all of these components have highly-available assurances up and down their contracts, but combine all these durable components together haphazardly and you get emergent properties that no one person or vendor can account for comprehensively.
When the article says a full reset entails killing the power and restarting, this is my actual experience. These complex leviathans have to be brought up in a special snowflake sequence or your infra state machine gets fucked up and you have to start all over. When dependency chains are 10+ systems long and part of a complex web of other dependency chains it starts to get hopeless pretty quickly.