Out of curiosity, is accessing & working with large datasets a problem in your areas of work? I run a weather/climate site that makes some of this less painful, taking datasets such as GFS or ERA5/ERA5-Land much faster to access. We have some enterprise clients who really value the time-saving aspect of this but I also feel like everyone has their own data-processing set up and problems are different for everyone.
There are a couple issues I see with basic access and working with large datasets. Ease of access for typical users is also a valid issue.
First, we still mostly move the data to the computation when we should be moving the computation to the data. Moving the data works fine when data is small but if the data volumes are large (as sensor/geo data tends to be) then it can take an incredibly long time to move the data. In many cases, more time is spent shoveling data over the network than actually doing the computation. This has become worse as storage density has increased, hundreds of TB/server is ordinary.
Second, the data is rarely organized in a way that makes it efficient to extract arbitrary subsets. There is still a lot of what is essentially "grep at scale" going on. Again, not a problem if the data is small but if I need a specific 50TB subset of a 10PB source, this becomes prohibitively slow. The data needs to be organized such that we can slice and dice it with high selectivity in place, much more like a proper database and less of a distributed filesystem. Because spatiotemporal analysis tends to involve iterative join-like operations, you want this to be efficient as possible.
The other big problem is many of these data sources are too large for everyone to have their own copy. Or if they did have their own copy, it would be extraordinarily wasteful. This is adjacent to the first issue. EDIT: And herein is the likely business model.
Want to make sure you're familiar with the Pangeo community: www.pangeo.io
I don't think any of these challenges are "solved", but there's a groundswell of technology that is well-situated to make a big impact in these domains. The largest barriers that still remain are the ownership of engineering processes/workflow to transform larger gridded datasets to ARCO (analysis-ready, cloud-optimized) formats, as well as tooling to mediate between heterogeneous datasets (e.g. combining regular vs irregular or arbitrarily gridded data, such as land surveys or ZIP codes).
There are definitely players in the space working on these, but much is left to be done here.
+1 for Pangeo. We use these toolsets heavily (Xarray, Zarr, Dask) to run our service, which is essentially what you described as taking the larger gridded datasest to ARCO format. I think this is still a bit too heavy for casual Excel/GIS analysts so we try to make it as simple as possible for them to get climate data in CSV or NetCDF format for their work.
This sounds really interesting. Would be really interested to work around these things. Thinking and working a lot with similar-ish systems. But not sure how to enter the related green-tech space when living in Europe. Would love to try to build a product myself but then I need a customer to try ideas with.
We are building data storage and processing infrastructure at a company that connects DERs. This involves storing large amounts of structured, time-series data. This data is then further processed and used for all kinds use cases, many of which were mentioned in the first half of the post's article. (e.g. energy management systems, load management etc.)
Any chance you guys provide a free api for the little guy? I would love to have access to climate data via a json rest api. Specifically historical temperature and precipitation data at minimum.
I poked around a while back and wasn't able to find much of anything on the web. Maybe I missed it?
Certainly - take a look (https://oikolab.com) and let me know your use case. There is a free tier but we've also given free access to a quite a few number of researchers, non-profits and university students for their projects when they reached out to us.
Sounds good - can you sign up for the free tier and send me an email to support@oikolab.com? This comes to me so I'll know which account belongs to you and we can take it from there.