Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Kuro5hin ran a block-all in their robots.txt[0] so this is more historic internet content that will dissapear into the abyss.

Anyone know of a mirror or an effort to mirror and host the site?

[0] http://www.kuro5hin.org/robots.txt



I wrote a k5 screen scraper, and have 95% of k5 diaries, unfortunately I don't have any stories. Date range for my archive is: 2001-1-4 to 2015-7-22 For a total of 161,942 diaries: http://k5.semantic-db.org/diary-slurp/161942--archive-diarie...

Here is a summary of what I have in my archive: http://kr5ddit.com/post/759

Here are some tables showing which kuron had the highest number of posts over the lifetime of k5: http://kr5ddit.com/post/759


BTW, rusty was certainly aware of my screen scraping project, and did not tell me to stop. And it would have been very obvious in the log files.


#200!

Some of my very first publicly useful code was a script to download and backup my diaries, so I have that extremely embarrassing reminder of my college years.


Thank you!! Downloading to get my old stuff.


The robots.txt you see now is probably the one on the server "parking" the domain, and not that of the original site.

A "site:kuro5hin.org" query shows results, so I'm guessing the original one isn't a block-all.


Nothing via the wayback machine... :-(

http://web.archive.org/web/*/http://kuro5hin.org

All that lost content...


IIRC, a (current) robots.txt retroactively will remove public access to the Archive. RIP.


This annoys me to no end. It basically means that you can retroactively censor content by buying a domain that was parked for 10 years and then remove the old content with one little robots.txt. All of this regardless of the wishes of the original creators.


Will it also delete the content?


I remember reading that they don't actually delete anything, even if served DMCA notices --- they just remove access to it, with the justification being that laws may change and one day allow access again. Makes sense for an archive, I guess.


Ah good, so there still is hope :)


>The robots.txt you see now is probably the one on the server "parking" the domain, and not that of the original site.

Still enough for TWBM to kill all its archives. It's ridiculous :(




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: