I was looking for a big and novel dataset to work on as a personal project. Can devote a couple months after work to it.
What are interesting things that could be done with this now as a non-scientist, but that could be useful to share too?
I would be interested in seeing how different branding terms evolve in the literature; e.g., "machine learning" vs "artificial intelligence" vs "neural net" or "surrogate model" vs "digital twin" vs "response surface". There a number of terms of art that have substantial overlap, but which term ends up being used depends on the audience, which often includes grant providers. I suspect the popularity of these terms evolves according to what terms appeal the most to funding agencies.
An annoyingly broad answer, but: broadcast an opportunity in which you provide vetoed access to the server/instance holding the data (eg, with standard user accounts, private home directories and the data located read-only in a dir off of /). Would create a small storm of overhead ("please install ..." ad nauseum), but provide interesting opportunities for open-ended creativity.
Also, while unusual, you could potentially extract value by lightly imposing requests ;) on others to (where viable/a good fit) at least have a go at helping you with problem-solving/busywork-type steps in the research you're doing (however experimental/uncertain). Since everyone would be looking at the same dataset, this may bring interesting/unexpected ideas to the table (and maybe even shed light on potential opportunities for collaboration down the road). For individuals who are reasonably independent and self-directed but have no experience working with huge amounts of data, this would also provide a cool chance to play around in a real-world-scale environment without the failure risks attached to eg fulfilling^Wfiguring out business requirements etc.
(Now I'm reminded of https://tilde.club/, which operates a shared Linux box a (very) large bunch of people have collective access to. It's a stab in the dark (ie, the one reference I'm aware of), but maybe the admins there would have interesting insight about managing shared access Linux boxes.)
I was looking for a big and novel dataset to work on as a personal project
As an aside, one interesting data-set that is out there, legally and freely available, which is decent sized (not as big as this, of course) is the United States Fire Administration's NFIRS[1] data.
Roughly speaking, NFIRS is a record of all fire calls run in the USA, including details like date, time, location, responding department, incident type, dollar value of damages, number of deaths, number of injuries, etc.
I say "roughly speaking" because strictly speaking participation in NFIRS is voluntary and not every department in the USA participates. If memory serves correctly, for example, I think FDNY does not. But as far as I know, a significant majority of the fire departments in the US do send their data to NFIRS and so it becomes a pretty interesting snapshot of fire activity across the entire country.
Edit: from the NFIRS page:
The NFIRS database comprises about 75% of all reported fires that occur annually.