This was a good read. But the question that I kept waiting to be asked was "What...

mst · on June 27, 2023

> What do I need to monitor?

I think it's worth looking at those decisions as wanting to be made both top-down -and- bottom-up.

Top-down as in "start from what makes the business a viable business and then analyze downwards from there" which gets you things like user-facing site/API availability, performance, error rates etc. (and maybe things like per-user usage rates depending on what you're doing and what your business model is).

Bottom-up as in "start from what makes the infrastructure exist at all and work up from there" which gets you things like disk usage, RAM, CPU, network link saturation - all the low level stuff that won't affect your top-down metrics until it does, at which point everything will catch fire at once.

They'll hopefully meet somewhere in the middle in a way that makes sense, and you could perhaps argue with per-internal-service monitoring as being a sort of middle-outwards, but I suspect the highest and lowest level checks are probably the most useful ones to start with and then you extend from there as you get a feel for what situations cause those to fire off and start monitoring the mid-range of the '5 whys' rather than just 1 and 5.

(I'm not sure I've made this as clear as I wanted but such is the peril of waxing philosophical about things)

poulsbohemian · on June 27, 2023

>"What do I need to monitor?"

I did this stuff professionally for about a decade... the short version is, consider the nodes of the graph. Stuff breaks because the front end can't reach the back, the back can't take a message off the queue, the database stopped taking connections, etc. IE: most apps of any size or complexity have reliance on other systems or infrastructure components, and these becomes things that break, timeout, don't scale, etc. That's a good starting point.