This was a good read. But the question that I kept waiting to be asked was "What do I need to monitor?"
In my opinion, the way to avoid many of the problems and complexities that author lists is to start with the goals. What are the business/mission goals? What is the workload (not the monitoring tool) trying to accomplish? What does the thing do?
Once you have that, you can go about selecting the right tool for the job. The author notes that there's no consensus on what logs are for. Or is buried under signal-to-noise ratio for metrics. But when you start with a goal, and then determine a KPI you want to obtain to measure that, you'll then be able to ascertain where a log, metric, trace, or combination thereof can reflect that. And you only need create the ones needed to measure that. The scope of the problem becomes far more manageable.
It make take a couple tries to get right, and a user may find that they need to add more over time. But the monitoring doesn't need to change often unless the deliverables or architecture of the workload does. And creating those should be part of the development process.
By starting with goals and working from there, monitoring becomes a manageable task. The myriad of other logs and metrics created can be safely stored and kept quiet until they're needed to troubleshoot or review an incident. That falls into observability.
Observability is something you have.
Monitoring is something you do.
I haven't addressed the difficulties the author states in creating various traceability mechanisms, how to parse all of these logs from various platforms, or some of the other technical challenges he states. There are usually solutions for these (usually...), some easier than others. But by narrowing the scope down to what's needed to achieve your given insights, the problem set becomes only what's required to achieve that aim, and not trying to build for every conceivable scenario.
I think it's worth looking at those decisions as wanting to be made both top-down -and- bottom-up.
Top-down as in "start from what makes the business a viable business and then analyze downwards from there" which gets you things like user-facing site/API availability, performance, error rates etc. (and maybe things like per-user usage rates depending on what you're doing and what your business model is).
Bottom-up as in "start from what makes the infrastructure exist at all and work up from there" which gets you things like disk usage, RAM, CPU, network link saturation - all the low level stuff that won't affect your top-down metrics until it does, at which point everything will catch fire at once.
They'll hopefully meet somewhere in the middle in a way that makes sense, and you could perhaps argue with per-internal-service monitoring as being a sort of middle-outwards, but I suspect the highest and lowest level checks are probably the most useful ones to start with and then you extend from there as you get a feel for what situations cause those to fire off and start monitoring the mid-range of the '5 whys' rather than just 1 and 5.
(I'm not sure I've made this as clear as I wanted but such is the peril of waxing philosophical about things)
I did this stuff professionally for about a decade... the short version is, consider the nodes of the graph. Stuff breaks because the front end can't reach the back, the back can't take a message off the queue, the database stopped taking connections, etc. IE: most apps of any size or complexity have reliance on other systems or infrastructure components, and these becomes things that break, timeout, don't scale, etc. That's a good starting point.
In my opinion, the way to avoid many of the problems and complexities that author lists is to start with the goals. What are the business/mission goals? What is the workload (not the monitoring tool) trying to accomplish? What does the thing do?
Once you have that, you can go about selecting the right tool for the job. The author notes that there's no consensus on what logs are for. Or is buried under signal-to-noise ratio for metrics. But when you start with a goal, and then determine a KPI you want to obtain to measure that, you'll then be able to ascertain where a log, metric, trace, or combination thereof can reflect that. And you only need create the ones needed to measure that. The scope of the problem becomes far more manageable.
It make take a couple tries to get right, and a user may find that they need to add more over time. But the monitoring doesn't need to change often unless the deliverables or architecture of the workload does. And creating those should be part of the development process.
By starting with goals and working from there, monitoring becomes a manageable task. The myriad of other logs and metrics created can be safely stored and kept quiet until they're needed to troubleshoot or review an incident. That falls into observability.
Observability is something you have. Monitoring is something you do.
I haven't addressed the difficulties the author states in creating various traceability mechanisms, how to parse all of these logs from various platforms, or some of the other technical challenges he states. There are usually solutions for these (usually...), some easier than others. But by narrowing the scope down to what's needed to achieve your given insights, the problem set becomes only what's required to achieve that aim, and not trying to build for every conceivable scenario.