Kuro5hin ran a block-all in their robots.txt[0] so this is more historic interne...

k5mumble · on May 2, 2016

I wrote a k5 screen scraper, and have 95% of k5 diaries, unfortunately I don't have any stories. Date range for my archive is: 2001-1-4 to 2015-7-22 For a total of 161,942 diaries: http://k5.semantic-db.org/diary-slurp/161942--archive-diarie...

Here is a summary of what I have in my archive: http://kr5ddit.com/post/759

Here are some tables showing which kuron had the highest number of posts over the lifetime of k5: http://kr5ddit.com/post/759

k5mumble · on May 2, 2016

BTW, rusty was certainly aware of my screen scraping project, and did not tell me to stop. And it would have been very obvious in the log files.

llimllib · on May 2, 2016

#200!

Some of my very first publicly useful code was a script to download and backup my diaries, so I have that extremely embarrassing reminder of my college years.

ocschwar · on May 2, 2016

Thank you!! Downloading to get my old stuff.

userbinator · on May 2, 2016

The robots.txt you see now is probably the one on the server "parking" the domain, and not that of the original site.

A "site:kuro5hin.org" query shows results, so I'm guessing the original one isn't a block-all.

ntoll · on May 2, 2016

Nothing via the wayback machine... :-(

http://web.archive.org/web/*/http://kuro5hin.org

All that lost content...

firloop · on May 2, 2016

IIRC, a (current) robots.txt retroactively will remove public access to the Archive. RIP.

AnonymousPlanet · on May 2, 2016

This annoys me to no end. It basically means that you can retroactively censor content by buying a domain that was parked for 10 years and then remove the old content with one little robots.txt. All of this regardless of the wishes of the original creators.

vanderZwan · on May 2, 2016

Will it also delete the content?

userbinator · on May 2, 2016

I remember reading that they don't actually delete anything, even if served DMCA notices --- they just remove access to it, with the justification being that laws may change and one day allow access again. Makes sense for an archive, I guess.

vanderZwan · on May 2, 2016

Ah good, so there still is hope :)

josteink · on May 2, 2016

>The robots.txt you see now is probably the one on the server "parking" the domain, and not that of the original site.

Still enough for TWBM to kill all its archives. It's ridiculous :(