I'd claim that no other company is criticized as harshly as Mozilla around here. The amount of blame that is assigned to the Firefox team is staggering.
To me, this is a perfectly valid write-up with a good lessons learned. They have written it in a very diplomatic way, but to me, it is absolutely clear that Google screwed up here. How can you make such a change to a default behavior of critical infrastructure unannounced? That's just reckless towards your customers, and solidifies my belief to stay away from GCP.
If they had properly announced the change, even if the Firefox team hadn't then tested beforehand, at least the DevOps team would have put one and one together and just changed back to HTTP/2 and the outage would have lasted maybe 10 minutes. Instead, they frantically went through their git log to see what in the code base might have triggered this bug. Everyone who has been in such a position knows how incredibly stressful this is. I'd be absolutely livid at Google in their position. That it took two hours to fix this is clearly their fault.
> I'd claim that no other company is criticized as harshly as Mozilla around here. The amount of blame that is assigned to the Firefox team is staggering.
They set themselves a higher standard by marketing as the good guys who fight for the user, and then made any number of moves that said users viewed as not being in their interests. Of course they get more blame. Like, Chrome has issues, but they're issues in line with being made by an adtech company; we might be unhappy at Google breaking adblockers (https://www.eff.org/deeplinks/2021/12/chrome-users-beware-ma...), but it's not out of character. Mozilla can say "More power to you. Mozilla puts people before profit, creating products, technologies and programs that make the internet healthier for everyone." (https://www.mozilla.org/en-US/) or they can, say, make Google the default engine ($), bake in a proprietary service (Pocket), rip out features (RIP compact theme), overrule user autonomy (Want to install an extension? Better upload it to Mozilla to get signed so they permit you to run it on your own computer!), ship a marketing extension through the "experiments" feature (https://blog.mozilla.org/en/products/firefox/update-looking-...).... but not both. Either empower the user, or don't, but don't pretend to empower the user while ripping away their control.
Yep, you are correct. Each of those decisions was made over the protests of a vocal but relatively small group of users.
You can't please all people all of the time, and I agree the pocket integration, and the looking glass add were mistakes, but the other items were directly related to sustainability of the project ($, eng cycles), or user safety.
You can disagree with them as much as you like, but Firefox continues to support the ultimate in user control by releasing their product as open source. Roll your own build that doesn't require those features, sideload your add-ons, and/or fork the product.
As a user, the average Firefox user has far more control over the browser than Chrome, Edge, or Safari users do, and have the flexibility to use one of many Firefox forks that have the same beef as you.
Since the first thing that group protested was telemetry, I don't know how we could possibly know that it's a "vocal but relatively small group of users". In general, though, "you can't please everyone, and not that many people objected" isn't really a compelling argument; the criticism is still valid, and people being unwilling to make the effort to make a fuss, fork, find workarounds, or switch browsers doesn't mean that they're okay with it. For that matter, there's not a lot of feedback in general; how many people objected, and how many said they were in favor, compared to the overwhelming majority who never said anything?
> You can disagree with them as much as you like, but Firefox continues to support the ultimate in user control by releasing their product as open source. Roll your own build that doesn't require those features, sideload your add-ons, and/or fork the product.
By that standard Chrome is a paragon of user control. Firefox, as it actually exists, in the thing that Mozilla offers users to download, claims to care about user empowerment while constantly reducing users' power.
In fairness, it's hard to tell what's Firefox throwing away the thing that made them special vs Google abusing its monopoly position to push its way into the browser market.
I agree that Google is at fault here for failing Firefox. But Firefox is guilty of failing its users. Why should the functioning of a browser be dependent on telemetry working? It sounds like if there is high enough latency in their telemetry, or if request for telemetry start failing, it's possible for that to disrupt using the network stack at all. They have a massive design flaw, and they didn't even mention that in the article. Maybe they have good reasons for designing a single point of failure that relies on a cloud provider, but it's not clear what those might be since they don't address it.
>> Why should the functioning of a browser be dependent on telemetry working?
That was my thought after reading the start of it. Like "Oh no, Firefox has fallen into that void where their need for telemetry trumps users". Another product falling down at doing its primary function. But after reading the entire report that's just not fair at all. A bug relating to telemetry and their network stack caused failure in that networking code which affected everything. That is entirely different than software depending on telemetry to function properly. It wasn't by design that failing to phone home broke the software, it really was just a bug - a fairly obscure one. Sounds like if someone wanted they could just as easily blame the use of Rust in Firefox since some of the code involved was written in Rust. But that's not a fair or accurate conclusion either.
> Why should the functioning of a browser be dependent on telemetry working?
It isn't. The bug was in the networking stack, and it just happened to be triggered by a GCP change which effected the telemetry service. Firefox having telemetry has nothing to do with the issue here.
That's not quite right. A single socket thread does all the requests and telemetry is multiplexed with user traffic. If telemetry is different in some way to other network traffic, then it's always possible for it to cause problems with user traffic.
Telemetry is different to user traffic - it's less important! - but of course any in-process QoS would still create a point of interaction with user traffic.
So you're saying that Firefox did not on fact have an outage due to a change in their telemetry servers? That's not what the article said.
I understand that you mean to say that it isn't intended for networking to be taken down by telemetry. That's nonetheless what happened, and it could have been prevented by treating telemetry as a different class of traffic (not collocating it with normal requests), or by not having it, as others point out.
So you're saying telemetry should be handled as a separate process that has nothing to do with the rest of the browser, and treated like a hostile service? Because that's the only way you'd have avoided this.
It's natural for all the network stuff that goes in inside a browser to share code. You can say what you want about telemetry (I'm not a huge fan, personally), but this was a dumb bug and it is completely unreasonable to expect some kind of adversarial design "just in case a freak bug triggers on telemetry network requests".
> So you're saying telemetry should be handled as a separate process that has nothing to do with the rest of the browser, and treated like a hostile service? [... T]his was a dumb bug and it is completely unreasonable to expect some kind of adversarial design "just in case a freak bug triggers on telemetry network requests".
I absolutely agree that this a dumb bug having little to nothing to do with telemetry. It is not even the first case-sensitivity HTTP/3 bug I’m personally encountering in the course of completely casual use[1]. Probably not the last, either, those joints ain’t gonna oil themselves.
At the same time, you know what? I’m glad you suggested this, because I certainly didn’t think of it. Yes, in an ideal world, telemetry absolutely should be a separate process (or thread, or at least not share an event loop—a separate “hang domain”, a vat[2] if you want). And so should everything else off the critical path.
I’m not saying Firefox is bad for doing it differently. I’m saying it’s silly that Firefox is forced to play OS to such an extent because the actual one isn’t up to its demands.
They're saying what is clearly explained in the article:
“This is why users who disabled Telemetry would see this problem resolved even though the problem is not related to Telemetry functionality itself and could have been triggered otherwise.”
Yes, but the fact that telemetry is in place was the cause for the issue.
> So you're saying that Firefox did not on fact have an outage due to a change in their telemetry servers?
Not the telemetry code. Not the fact that it "could" happen elsewhere. But rather the fact that it was in place and in this instance happened because of it.
Not that it matters that much. Regardless of the particular cause, a browser failing to work because of something changing externally is crazy (at least to me), no matter how you look at it.
How do you reach that conclusion? From the article:
> It just so happens that Telemetry is currently the only Rust-based component in Firefox Desktop that uses the [viaduct/Necko] network stack and adds a Content-Length header. This is why users who disabled Telemetry would see this problem resolved ...
The article contradicts your conclusion. If Firefox did not have telemetry, the bug would have had no impact, and users would not have suffered an outage.
> ...even though the problem is not related to Telemetry functionality itself and could have been triggered otherwise.
And then the article contradicts itself and agrees with you using some heavy-duty doublethink. Sure, if there were hypothetically other Rust services using the buggy network stack, they'd also have hit the bug: BUT THERE ARE NONE. The bug was in code which is only running because it's used by the telemetry services, so even though it might be in a different semantic layer it's the fault of the browser trying to send telemetry.
As a user, I place very low (often negative) importance on the tools I use collecting telemetry data, or on protecting DRM content, or on checking licensing status. They should focus on doing the job I'm trying to do with them on my computerr, serving the uses of the user, rather than doing something that someone else wants them to do. Sure, I understand that debugging and quality monitoring are easier with logs and maybe with telemetry, so I can understand using a few resources in the background to serve some of that data, but it must never get in the way of actual work getting done.
> The article contradicts your conclusion. If Firefox did not have telemetry, the bug would have had no impact, and users would not have suffered an outage.
This is your mistake: as explained in the article, it could have affected any component. Telemetry happened to hit it first but anything using HTTP/3 with that path would have been affected.
“This is why users who disabled Telemetry would see this problem resolved even though the problem is not related to Telemetry functionality itself and could have been triggered otherwise.”
> ...as explained in the article, it could have affected any component. Telemetry happened to hit it first but anything using HTTP/3 with that path would have been affected.
Is this really relevant, though? To the users who were unable to use their browsers normally it doesn't matter that this problem could have occurred elsewhere as well, but rather that it did occur here in particular.
If particular sites would break, then that could be debugged separately, but as it stands even people who'd be perfectly fine with browsing regular HTTP/1.1 or HTTP/2 sites were also now impacted, not even due to opening a site that they wanted to visit themselves, but rather some background piece of functionality.
That's not to say that i think there shouldn't be telemetry in place, just that the poster is correct in saying that this wouldn't be such a high visibility issue if there was no telemetry in place and thus no HTTP/3 apart from sites the user visits.
The comment I was replying to worded in a way which was trying to attribute blame to the telemetry service. As shown in this thread, there's a certain ideological position which welcomes any attacks on telemetry and I think that's a distraction from the technical discussion about how Mozilla could better have avoided a bug in their networking libraries. Recognizing this as a bug in the network stack first triggered by Telemetry makes it clear that this is not the place to have the millionth iteration of flamewars about that service but rather questions like the design of that network loop or not having test suite of the intersection of those particular libraries.
> Recognizing this as a bug in the network stack first triggered by Telemetry makes it clear that this is not the place to have the millionth iteration of flamewars about that service but rather questions like the design of that network loop or not having test suite of the intersection of those particular libraries.
Surely one could adopt a "shared nothing" approach, or something close to it - a separate process for the telemetry functionality which only reads things from either shared memory or from the disk, where the main browser processes could put what's relevant/needed for it.
If a browser process fails to work with HTTP/3, i don't think the entire OS would suddenly find itself not having any network connectivity. For example, a Nextcloud client would still continue working and synchronizing files. If there was some critical bug in curl, surely that wouldn't necessarily bring down web browsers, like Chromium, either!
Why couldn't telemetry be implemented in a similarly decoupled way and thus eliminate the possibility of the "core browser" breaking due to something like this? Let the telemetry break in all the ways you're not aware of but let the browser continue working until it hits similar circumstances (if it at all will, HTTP/3 isn't all that common yet).
I don't care much for flame wars or "going full Stallman", but surely there is an argument to be made about increasing resiliency against situations like this one. Claiming that the current implementation of this telemetry is blameless doesn't feel adequate.
> I can understand using a few resources in the background to serve some of that data, but it must never get in the way of actual work getting done.
Which is exactly how the code was intended to work. Firefox did not design their software to hang in the event of telementry losing internet access.
I don't know firefox's internal architecture or its development, what follows is pure conjecture.
Their intention seems to be to slowly migrate the codebase from C++ to Rust. That telemetry is the only function to so far rely on their new rust networking library viaduct (and thus trigger the bug) could be because they wanted use their least important fucntionality as a test bed. In which case, if there wasn't any telemetry, a different piece of code would have been migrated to rust first and triggered this same bug. Without the telemtry, it would have presumably taken them longer to realise that things had broken, let alone resolve it.
Firefox also said that this switch to default was an unannounced change. But a quick Google shows that it was announced
> In the coming weeks, we’ll bring HTTP/3 to more users when it's enabled by default for all Cloud CDN and HTTPS Load Balancing customers: you won't need to lift a finger for your end users to start enjoying improved performance.
In their blog on June 22, 2021. [1]. It probably should have been it's own standalone message sent to users (a "this should be a no-op" email), bit to claim that it was unannounced is misleading.
That’s half a year earlier and it’s described as an opt-in change until the very end, where it’s mentioned as a default changing in a few weeks. That’s far different from what, say, AWS does proactively sending email and SNS notifications with a time range and usually listing the affected instances.
You expect everyone to read the google cloud blog? The distinction between "unannounced" and "not usefully announced" isn't of merit. If they did not specifically make their affected customers aware of the change and when it would actually happen, it was unannounced. And caused a major outage for at least one of their customers.
This afternoon I tried to clone a git repo which, in the morning, was highlighted as containing a useful example to start from in the work I had targeted next.
The clone failed with a mysterious error. After some minutes I checked the accompanying web site. The web site failed too, but, on refresh, this time I got a holding page explaining that the service was down. So I check the overall ticket system, and I find a change ticket, for the git system, saying there is planned maintenance, at 8am for one hour. Unadvertised because hey, it's 8am, most people aren't at work at 8am and this is a regular (Wednesday 8am) maintenance slot.
And I scroll down and I find that nobody remembered to actually do the task. They wrote it up, submitted, got it OK'd and then, eh, never did it. By the time the people who were supposed to do it were reminded it was 9am already. So, astoundingly, the service owner OK'd just doing it after lunch instead.
That failed 8am change was actually a re-run, of a re-run, of a re-run, of an upgrade that keeps failing and definitely takes over an hour to complete.
So instead of "It's fine to do this when nobody is at work and it's low risk" suddenly "It's fine to do this for 2 hours in the middle of the working day, though it'll probably fail and we have no roll back plan".
That's pretty shoddy. Glad to know an "Enterprise" cloud offering is hardly better.
> They have a massive design flaw, and they didn't even mention that in the article.
From the article:
> This is why users who disabled Telemetry would see this problem resolved even though the problem is not related to Telemetry functionality itself and could have been triggered otherwise.
Telemetry traffic is multiplexed with user traffic on a single socket thread, per article. That creates a single point of failure where telemetry can affect user traffic.
Of course all network access is shared for a machine so it's not possible to not have a single point of failure, but there are different ways of slicing up the access.
You're grasping at straws with this argument. That it shares a thread is a technicality. I'm sure the socket management is asynchronous and telemetry wouldn't normally affect normal traffic. This was an infinite loop bug. What if it had been a memory corruption bug instead, would you be saying that telemetry needs to be a separate process, not just a separate thread? The design was reasonable. Dumb bugs can happen anyway and cause things not to work as designed. That's what happened here.
What happened to the famed intelligence level of Hacker News? Every single person (almost) in this thread is blaming telemetry while it was clear even when the bug was ongoing that it was unrelated to telemetry. I for example had had telemetry disabled and still hit the bug through other traffic and had to temporarily disable HTTP3 from about:config.
> I'd claim that no other company is criticized as harshly as Mozilla around here. The amount of blame that is assigned to the Firefox team is staggering.
Mozilla has opened themselves up here, as they market as a privacy and user respecting alternative, so when they fail to live up to their own marketing people are more annoyed while they expect random startup #456 to not care about their users privacy and have telemetry out the wahoo.
(I used to work for Mozilla, and spent a few years on the team that at the time owned the telemetry component.)
The way Mozilla does telemetry is different from how most places do.
I think that the biggest issue with these discussions is that there always seems to be this assumption that there is only one way to do telemetry, it always contains super invasive PII, Mozilla's telemetry must do the same, and therefore Mozilla's telemetry is just as evil as anybody else's.
Mozilla is remarkably open about how its telemetry works, beyond just being open source. Maybe this is more a problem of that information not being surfaced well, I dunno. I get that some people are philosophically opposed to telemetry no matter what, but I have seen enough cases of, "Wow, I didn't know it worked that way, and I'm actually okay with this," to know that informed users are not universally opposed to it.
All network requests share at least the IP address, which is PII, and should only happen after obtaining informed consent unless they are required for the requested user action. Since telemetry would be pointless if its the same for everyone there will inevitably be more information that can ultimately be used to identify users. You can argue as much as you want that you are doing it "better" than others (there is always someone worse and the software industry's disregard for user rights is well known) or that it is useful (many unethical actions can be useful but the ends do not justify the means, especially when alternatives like bug reports often go ignored) but that does not change the fact that you are sharing PII without informed consent.
More importantly Mozilla knows that there are people who do not want them to upload this information yet they continue to do it anyway by default without ever asking for consent. Worse, Mozilla keeps adding new leaks that concerned users will have to watch out for and disable after each update. This is of course by no means a problem unique to Mozilla - the software industry as a whole has not yet learned that no means no - but it is also a Mozilla problem and as long as they want to use privacy to market their software they will rightfully receive the loudest criticism. Thankfully laws are beginning to catch up with the digital age and people will have better recourse than asking software vendors nicely to not be mistreated.
If you're running Firefox, you can go to `about:telemetry` and see what data is there. Note that some of that data might be populated even if you have telemetry turned off. Don't reach for your pitchfork quite yet: I assure you that the data isn't being sent.
I meant something documenting how telemetry works, though perhaps the source tree is the source for that?
> Don't reach for your pitchfork quite yet
No need to assume people are out of their minds or even critical. I am curious about how it's done on a technical level, with the old local ad system in mind (which I thought was a brilliant solution to Internet commerce and privacy). I've supported and contributed to Mozilla since before Firefox.
Yeah, please don't take the pitchfork thing personally, that was more intended for anybody reading that comment who immediately assumes the worst.
As for high level docs about how it works, I haven't been involved in quite a few years, so I'm not 100% sure about the best source, but this link looks like a good place to start:
> Instead, they frantically went through their git log to see what in the code base might have triggered this bug.
This seems like you're embellishing this part to tell a story? It's not supported by the linked post, and from the bugzilla bugs it seems like it was known almost immediately that the ESR builds were affected as well and so it almost had to be an external service, they just weren't sure which one at first.
To me, this is a perfectly valid write-up with a good lessons learned. They have written it in a very diplomatic way, but to me, it is absolutely clear that Google screwed up here. How can you make such a change to a default behavior of critical infrastructure unannounced? That's just reckless towards your customers, and solidifies my belief to stay away from GCP.
If they had properly announced the change, even if the Firefox team hadn't then tested beforehand, at least the DevOps team would have put one and one together and just changed back to HTTP/2 and the outage would have lasted maybe 10 minutes. Instead, they frantically went through their git log to see what in the code base might have triggered this bug. Everyone who has been in such a position knows how incredibly stressful this is. I'd be absolutely livid at Google in their position. That it took two hours to fix this is clearly their fault.