It's positively galling how happy he is about such a terrible decision. How many...

TazeTSchnitzel · on March 29, 2015

"It's positively galling how happy the inventors of robots.txt were about such a terrible decision.

How many billions (trillions?) of wasted robots.txt requests have been made because they didn't bother to engineer this the right way and make it a <link> tag in the document?

I wish we could send him the bill for all the wasted power, all the wasted developer time (even ten seconds per developer for all 100 million+ sites out there adds up to quite a lot.) Then see how glad he was he checked this misfeature in."

near · on March 30, 2015

Oooooh I see now, you took what I said and changed one of the words. Totally missed that before wareya's comment.

robots.txt and favicon.ico share nothing in common. Browsers do not automatically request robots.txt every time you load a site. If you count a bot as one visitor, they make up maybe 0.01% of your traffic. It's also an explicit request, the bot actually wants that file. Users don't actually explicitly ask for favicon.ico, their browsers do it for them when they type in pagename.html. It's also not part of the page like the icon is. favicon.ico is like having javascript.js and style.css in your root directory auto-scanned as well.

Unlike favicon.ico, you couldn't just slap a line in your HTML here instead, it would already be too late for crawling. If we tried to put this into the headers for a bot's HEAD request, then people with only file drop access to hosting wouldn't be able to change the robot settings.

That said, I'm always open to considering better ways of doing things. Maybe we could have web servers optionally parse robots.txt locally, and bots could get the information from a special OPTIONS / HEAD request instead.

wareya · on March 30, 2015

Robots.txt is different. Without it, bots have no way of knowing whether to get any other data from the site. You would need "bots allowed" information in the HTTP handshake itself to prevent bots from accidentally hitting pages they shouldn't. This can already be Very Bad.

TazeTSchnitzel · on March 30, 2015

> You would need "bots allowed" information in the HTTP handshake itself to prevent bots from accidentally hitting pages they shouldn't.

Humans could also hit such pages. If your GET requests change state, there's no helping you.

wareya · on March 30, 2015

The whole point of robots.txt is that there are pages which people may hit that bots can't. What are you on?

samspenc · on March 30, 2015

Isn't the best solution to the problem this: browsers turn off automatic checks for /favicon.ico and only load the favicon if it exists in the HTML in the form of the official spec?

near · on March 30, 2015

Yes, that would be the perfect solution to this problem. No sarcasm. Unfortunately, I don't see it happening. Probably millions of sites now that don't have the <link> tags, and you'd have a torrent of complaints as a browser vendor if you tried this.

enraged_camel · on March 30, 2015

Hindsight is 20/20 my friend. Go easy on the guy.

near · on March 30, 2015

That's what bothers me. He's happy about it in 2013 (this article's two years old.) He has hindsight now and still thinks it was a good idea to request resources on a web server without being asked. No consideration at all to the cumulative effects of such an action =(

And apparently a lot of people here agree with him. Scary :/

enraged_camel · on March 30, 2015

You are misinterpreting it. He says:

"But now I look back & realize that we did the right thing. Seriously, how risky was this feature?"

By "the right thing" he's referring to checking in a relatively minor and risk-free feature late at night without going through the corporate bureaucracy at Microsoft.

Besides, I think you're making a big deal out of it. It's supposed to be an entertaining story about how favicon.ico came into existence, not a deep analysis of whether it was the correct way to implement it.

serf · on March 30, 2015

I wrote a big long thing, and made a mountain out of a molehill, but here's the condensed version :

Even though this article is just intended to be a cute quip about the origins of something we all take for granted, if I were a person who was negatively affected due to the author skipping 'corporate bureaucracy at Microsoft'( perhaps a victim of this kind of bug[1] ), i'd probably find the authors "how risky was this feature?" statement to be pretty irritating.

Is there any way to know that such a bug would have been included had they followed procedure? Of course not. But, we do know that those procedures and processes exist for that very reason : to ship a higher quality product

[1]: https://technet.microsoft.com/library/security/ms99-018

enraged_camel · on March 30, 2015

>>But, we do know that those procedures and processes exist for that very reason : to ship a higher quality product

Not always. It is possible for procedures and processes to exist simply because someone somewhere is trying to justify receiving a paycheck. This is especially true for large bureaucracies.

And sometimes, procedures and processes exist because the organization simply does not realize that they are outdated and unnecessary. "This is how we have always done it" is a common saying in large organizations that have been around for a long time.

Bottom line: don't put much faith in procedures and processes. Question everything.

near · on March 30, 2015

But it wasn't the right thing. That bureaucracy would have saved the world billions upon billions of useless HTTP GET requests that just return 404s. It was relatively minor to him only because he only thought about it on a small scale. But when you're talking about the web, you have to take the full scale into account.

And no, it's not really a crisis-level issue. The modern web is full of hundreds of little annoyances, it's pretty much the nature of the game. But I felt it was on-topic all the same.

But yeah, complaining about it here certainly isn't going to change anything; and I'll agree that the story behind it was interesting to hear even if I despise the result of it.

B-Con · on March 30, 2015

> But it wasn't the right thing

Given what they knew at the time it was the right thing to do. That was the point.

kitcar · on March 30, 2015

Although I can't stop the requests from being sent, a standard line I add into all my nginx confs to at least minimize my logfiles:

location = /favicon.ico { access_log off; log_not_found off; }

near · on March 30, 2015

Unfortunately, even by disabling the logging, browsers like Chrome will keep wasting your server's resources asking for it if you return a 404. I know it's not going to overwhelm anyone's servers alone, but if you ever get a massive traffic surge that starts causing some visitor requests to fail, you'll probably enjoy having one less socket request per visitor.

I add <link rel="icon" href="data:;base64,iVBORw0KGgo="> into my HTML head section. I don't recall offhand if it stops the initial request in all browsers, but it does at least prevent repeated requests.

Most people would just tell you to drop a blank favicon.ico if you really don't want a site icon (and I don't), but I'm not going to clutter up my root directory with junk (this, two Apple icon files, and whatever else people decide to do in the future.)