I work for an acquired startup that tried to solve this problem.
It’s been around 8 years and we’re up to 50 or so people. I’d say we are okay at it.
We haven’t gotten fundamentally better over time recently, it’s more like there is some asymptote of how much you can really tell with a certain amount of insight into the systems between source and destination.
The only real progress we’ve made has been integrating with more and more sources of information about the state of the network.
Yes, and they partially have. Browsers are great at telling you where the chain has failed/ been cut, though some error messages seem to be intentionally uninformative as provided information would be meaningless to your average user.
That said, from an enthusiast perspective, running traceroute to the nearest google service (1e100.net for example) will already give you a huge tip on where things went wrong.
I regularly run `mtr 1.1` for monitoring network condition. One of its display modes gives you a 3D view: x-axis is time, y-axis is the hops, and each cell’s colour and character indicates how long the ping took (or if it got no response). This is frequently very valuable at identifying where a problem is, which is generally one of these three: between computer and router, router and ISP, ISP and public internet. It can show also where packet loss or latency jumps are occurring, and patterns where something goes wrong for a few seconds so that you can determine where the problem is (this is where the time axis is crucial).
One thing that becomes apparent when you monitor diverse ISPs and endpoints this way is the inconsistency: in a normally-functioning situation, although most hops will have 0% loss, some will have absolutely any value from 0%–100%. The network I’m on at present has ten hops from _gateway to one.one.one.one; hop five is 100% loss, hop six varies around 40–50% loss, hop seven is about 60–62% loss, the rest are all 0% loss. It does host name lookup as well which can be a little bit useful for figuring out what’s probably local, probably ISP and probably public internet, but the boundaries are often a bit fuzzy.
1.1: short spelling of 1.0.0.1, the second address for Cloudflare’s 1.1.1.1 DNS server.
You can switch between the display modes with the d key, or start in this mode with MTR_OPTIONS=--displaymode=2 in the environment (which is how I do it, as it’s almost always what I want; if it weren’t, I’d probably make some kind of alias for `mtr --displaymode=2 1.1` instead).
> some will have absolutely any value from 0%–100%.
Seeing packet loss in mtr is not entirely indicative of the health of the host.
Some public servers filter out ICMP all together, and others add a firewall traffic shaping limit to the number of pings they reply to. You might be seeing that.