Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Data sets released by Google (svonava.com)
170 points by supo on Sept 27, 2013 | hide | past | favorite | 17 comments


If you want to play around with data, here's another good list of open/free datasets: http://bitly.com/bundles/hmason/1


Nice, thanks! I wish there was a cleaned up repository of datasets like these, in a unified format, directly accessible to a public MapReduce engine like Elastic at AWS.


here's some other data hubs/search engines, endless lists:

http://datahub.io/

http://blog.bigml.com/2013/02/28/data-data-data-thousands-of...

http://tm.durusau.net/?p=39312

http://dvn.iq.harvard.edu/dvn/

_____________

this subreddit seems like a decent place to ask questions

http://www.reddit.com/r/datasets


Another one from Google, 1000 scanned books for OCR and other scanned document processing research: http://commondatastorage.googleapis.com/books/icdar2007/READ...



BitTorrent Please! Why does it cost so much? They grabbed our data for free and they have enough free Bandwidth. Let's assume they are greedy, then they could at least offer it through BitTorrent. DVD's for that amount of data is ridiculous. I don't even have a DVD-Reader…

Can't afford buying all that + shipping to Europe, but would like to play with the Data for my NLP Project.


I agree ! I too can't afford it but would really love to play around with that data because i'm just beginning to learn about NLP and stuff. I too feel that shouldn't have been priced and not in a DVD!


Here's another good one.

http://archive.ics.uci.edu/ml/


Here is a good one, http://cleandatahub.org/ They are trying to aggregate cleaned data sets across the web.


no links...

Remember the days when people used to make links on the web because they weren't greedy with their pagerank?

At least Google left us some machine learning data sets after they took all the links. You just can't find them because nobody links to them.


I'm sorry for not making it more obvious, but each bullet point in the list ends with a link.


Fantastic links throughout this thread.

When playing with new programming languages instead of a 'todo' list I always end up building an XKCD password generator. Interestingly enough, I've never found a frequency/comprehension list worth using to populate it for public consumption.


Is there any data set that embodies human relationships with every day objects ?



The ML competition site Kaggle should also get a mention here. http://www.kaggle.com/competitions


Where is the Web1T dataset? Would you not consider it useful for Machine Learning?


I think this list only includes free datasets.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: