> Why should the functioning of a browser be dependent on telemetry working? It ...

dralley · on Feb 2, 2022

Thus proving OP's point.

"The amount of blame that is assigned to the Firefox team is staggering"

barrkel · on Feb 2, 2022

That's not quite right. A single socket thread does all the requests and telemetry is multiplexed with user traffic. If telemetry is different in some way to other network traffic, then it's always possible for it to cause problems with user traffic.

Telemetry is different to user traffic - it's less important! - but of course any in-process QoS would still create a point of interaction with user traffic.

moeris · on Feb 2, 2022

> It isn't.

So you're saying that Firefox did not on fact have an outage due to a change in their telemetry servers? That's not what the article said.

I understand that you mean to say that it isn't intended for networking to be taken down by telemetry. That's nonetheless what happened, and it could have been prevented by treating telemetry as a different class of traffic (not collocating it with normal requests), or by not having it, as others point out.

marcan_42 · on Feb 2, 2022

So you're saying telemetry should be handled as a separate process that has nothing to do with the rest of the browser, and treated like a hostile service? Because that's the only way you'd have avoided this.

It's natural for all the network stuff that goes in inside a browser to share code. You can say what you want about telemetry (I'm not a huge fan, personally), but this was a dumb bug and it is completely unreasonable to expect some kind of adversarial design "just in case a freak bug triggers on telemetry network requests".

mananaysiempre · on Feb 2, 2022

> So you're saying telemetry should be handled as a separate process that has nothing to do with the rest of the browser, and treated like a hostile service? [... T]his was a dumb bug and it is completely unreasonable to expect some kind of adversarial design "just in case a freak bug triggers on telemetry network requests".

I absolutely agree that this a dumb bug having little to nothing to do with telemetry. It is not even the first case-sensitivity HTTP/3 bug I’m personally encountering in the course of completely casual use[1]. Probably not the last, either, those joints ain’t gonna oil themselves.

At the same time, you know what? I’m glad you suggested this, because I certainly didn’t think of it. Yes, in an ideal world, telemetry absolutely should be a separate process (or thread, or at least not share an event loop—a separate “hang domain”, a vat[2] if you want). And so should everything else off the critical path.

I’m not saying Firefox is bad for doing it differently. I’m saying it’s silly that Firefox is forced to play OS to such an extent because the actual one isn’t up to its demands.

[1] https://github.com/ndilieto/uacme/pull/11

[2] http://www.erights.org/elib/concurrency/vat.html

Treblemaker · on Feb 3, 2022

(off topic)

I read that at first glance as

> Probably not the last, either, those joints ain’t gonna roll themselves.

and thought, hm, I need to remember this debugging technique next time I'm stumped.

acdha · on Feb 2, 2022

They're saying what is clearly explained in the article:

“This is why users who disabled Telemetry would see this problem resolved even though the problem is not related to Telemetry functionality itself and could have been triggered otherwise.”

KronisLV · on Feb 2, 2022

Yes, but the fact that telemetry is in place was the cause for the issue.

> So you're saying that Firefox did not on fact have an outage due to a change in their telemetry servers?

Not the telemetry code. Not the fact that it "could" happen elsewhere. But rather the fact that it was in place and in this instance happened because of it.

Not that it matters that much. Regardless of the particular cause, a browser failing to work because of something changing externally is crazy (at least to me), no matter how you look at it.

Edit: this is now largely a duplicate of the other comment, hmm: https://news.ycombinator.com/item?id=30179023

LeifCarrotson · on Feb 2, 2022

How do you reach that conclusion? From the article:

> It just so happens that Telemetry is currently the only Rust-based component in Firefox Desktop that uses the [viaduct/Necko] network stack and adds a Content-Length header. This is why users who disabled Telemetry would see this problem resolved ...

The article contradicts your conclusion. If Firefox did not have telemetry, the bug would have had no impact, and users would not have suffered an outage.

> ...even though the problem is not related to Telemetry functionality itself and could have been triggered otherwise.

And then the article contradicts itself and agrees with you using some heavy-duty doublethink. Sure, if there were hypothetically other Rust services using the buggy network stack, they'd also have hit the bug: BUT THERE ARE NONE. The bug was in code which is only running because it's used by the telemetry services, so even though it might be in a different semantic layer it's the fault of the browser trying to send telemetry.

As a user, I place very low (often negative) importance on the tools I use collecting telemetry data, or on protecting DRM content, or on checking licensing status. They should focus on doing the job I'm trying to do with them on my computerr, serving the uses of the user, rather than doing something that someone else wants them to do. Sure, I understand that debugging and quality monitoring are easier with logs and maybe with telemetry, so I can understand using a few resources in the background to serve some of that data, but it must never get in the way of actual work getting done.

acdha · on Feb 2, 2022

> The article contradicts your conclusion. If Firefox did not have telemetry, the bug would have had no impact, and users would not have suffered an outage.

This is your mistake: as explained in the article, it could have affected any component. Telemetry happened to hit it first but anything using HTTP/3 with that path would have been affected.

“This is why users who disabled Telemetry would see this problem resolved even though the problem is not related to Telemetry functionality itself and could have been triggered otherwise.”

KronisLV · on Feb 2, 2022

> ...as explained in the article, it could have affected any component. Telemetry happened to hit it first but anything using HTTP/3 with that path would have been affected.

Is this really relevant, though? To the users who were unable to use their browsers normally it doesn't matter that this problem could have occurred elsewhere as well, but rather that it did occur here in particular.

If particular sites would break, then that could be debugged separately, but as it stands even people who'd be perfectly fine with browsing regular HTTP/1.1 or HTTP/2 sites were also now impacted, not even due to opening a site that they wanted to visit themselves, but rather some background piece of functionality.

That's not to say that i think there shouldn't be telemetry in place, just that the poster is correct in saying that this wouldn't be such a high visibility issue if there was no telemetry in place and thus no HTTP/3 apart from sites the user visits.

acdha · on Feb 2, 2022

The comment I was replying to worded in a way which was trying to attribute blame to the telemetry service. As shown in this thread, there's a certain ideological position which welcomes any attacks on telemetry and I think that's a distraction from the technical discussion about how Mozilla could better have avoided a bug in their networking libraries. Recognizing this as a bug in the network stack first triggered by Telemetry makes it clear that this is not the place to have the millionth iteration of flamewars about that service but rather questions like the design of that network loop or not having test suite of the intersection of those particular libraries.

KronisLV · on Feb 2, 2022

> Recognizing this as a bug in the network stack first triggered by Telemetry makes it clear that this is not the place to have the millionth iteration of flamewars about that service but rather questions like the design of that network loop or not having test suite of the intersection of those particular libraries.

Surely one could adopt a "shared nothing" approach, or something close to it - a separate process for the telemetry functionality which only reads things from either shared memory or from the disk, where the main browser processes could put what's relevant/needed for it.

If a browser process fails to work with HTTP/3, i don't think the entire OS would suddenly find itself not having any network connectivity. For example, a Nextcloud client would still continue working and synchronizing files. If there was some critical bug in curl, surely that wouldn't necessarily bring down web browsers, like Chromium, either!

Why couldn't telemetry be implemented in a similarly decoupled way and thus eliminate the possibility of the "core browser" breaking due to something like this? Let the telemetry break in all the ways you're not aware of but let the browser continue working until it hits similar circumstances (if it at all will, HTTP/3 isn't all that common yet).

I don't care much for flame wars or "going full Stallman", but surely there is an argument to be made about increasing resiliency against situations like this one. Claiming that the current implementation of this telemetry is blameless doesn't feel adequate.

angus-prune · on Feb 2, 2022

> I can understand using a few resources in the background to serve some of that data, but it must never get in the way of actual work getting done.

Which is exactly how the code was intended to work. Firefox did not design their software to hang in the event of telementry losing internet access.

I don't know firefox's internal architecture or its development, what follows is pure conjecture.

Their intention seems to be to slowly migrate the codebase from C++ to Rust. That telemetry is the only function to so far rely on their new rust networking library viaduct (and thus trigger the bug) could be because they wanted use their least important fucntionality as a test bed. In which case, if there wasn't any telemetry, a different piece of code would have been migrated to rust first and triggered this same bug. Without the telemtry, it would have presumably taken them longer to realise that things had broken, let alone resolve it.