Does anyone know how easy it is to extract just the text / html out of an Encarta CD/DVD? If it's easy, I'll go out and buy one right now.
For no particularly good reason, I have this urge to create a local txt or basic html version of one of the major encyclopedias (something a bit more concise than the full Wikipedia). I love the idea that I could keep a summary of most human knowledge in a 400mb plain text file on my hard drive.
You should check out my comment at https://news.ycombinator.com/item?id=20741101 for a summary of the current state of some of the existing projects that do some of this, but also check out the Wikipedia Vital Articles lists. Typical "vital" articles are large, about 100K, so in 400MB you could manage about 4000 articles, which is most of the ones in the fourth-level Vital Articles list.
For no particularly good reason, I have this urge to create a local txt or basic html version of one of the major encyclopedias (something a bit more concise than the full Wikipedia). I love the idea that I could keep a summary of most human knowledge in a 400mb plain text file on my hard drive.