Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I was looking for a big and novel dataset to work on as a personal project. Can devote a couple months after work to it. What are interesting things that could be done with this now as a non-scientist, but that could be useful to share too?


Provide examples of uses of words and possibly help us rephrase in a more idiomatic way. Basically, answer "how do people use this word usually?"

A bit like linguee but for scientists: (at the bottom of the page): https://www.linguee.fr/francais-anglais/search?source=auto&q...


I would be interested in seeing how different branding terms evolve in the literature; e.g., "machine learning" vs "artificial intelligence" vs "neural net" or "surrogate model" vs "digital twin" vs "response surface". There a number of terms of art that have substantial overlap, but which term ends up being used depends on the audience, which often includes grant providers. I suspect the popularity of these terms evolves according to what terms appeal the most to funding agencies.


An annoyingly broad answer, but: broadcast an opportunity in which you provide vetoed access to the server/instance holding the data (eg, with standard user accounts, private home directories and the data located read-only in a dir off of /). Would create a small storm of overhead ("please install ..." ad nauseum), but provide interesting opportunities for open-ended creativity.

Also, while unusual, you could potentially extract value by lightly imposing requests ;) on others to (where viable/a good fit) at least have a go at helping you with problem-solving/busywork-type steps in the research you're doing (however experimental/uncertain). Since everyone would be looking at the same dataset, this may bring interesting/unexpected ideas to the table (and maybe even shed light on potential opportunities for collaboration down the road). For individuals who are reasonably independent and self-directed but have no experience working with huge amounts of data, this would also provide a cool chance to play around in a real-world-scale environment without the failure risks attached to eg fulfilling^Wfiguring out business requirements etc.

(Now I'm reminded of https://tilde.club/, which operates a shared Linux box a (very) large bunch of people have collective access to. It's a stab in the dark (ie, the one reference I'm aware of), but maybe the admins there would have interesting insight about managing shared access Linux boxes.)


I was looking for a big and novel dataset to work on as a personal project

As an aside, one interesting data-set that is out there, legally and freely available, which is decent sized (not as big as this, of course) is the United States Fire Administration's NFIRS[1] data.

Roughly speaking, NFIRS is a record of all fire calls run in the USA, including details like date, time, location, responding department, incident type, dollar value of damages, number of deaths, number of injuries, etc.

I say "roughly speaking" because strictly speaking participation in NFIRS is voluntary and not every department in the USA participates. If memory serves correctly, for example, I think FDNY does not. But as far as I know, a significant majority of the fire departments in the US do send their data to NFIRS and so it becomes a pretty interesting snapshot of fire activity across the entire country.

Edit: from the NFIRS page:

The NFIRS database comprises about 75% of all reported fires that occur annually.

[1]: https://www.usfa.fema.gov/nfirs/


Create a search engine, and put it online. I'd love to search the index but can't download 7 terabytes of data to do so.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: