There's 0 chance they stop releasing open stuff. They are still way behind, so they have all the same incentives they had before to release models - brand recognition, devs engaged with their stack, slow down competition and so on. They will keep some models internal, or further train them on different datasets (internal?) for their own product use, but the "small" models (1-200ish B params) will be released in open weights for the foreseeable future.
The article you link is not using the CLT correctly.
The CLT gives a result about a recentered and rescaled version of the sum of iid variates. CLT does not give a result about the sum itself, and the article is invoking such a result in the “files” and “lakes” examples.
I’m aware that it can appear that CLT does say something about the sum itself. The normal distribution of the recentered/rescaled sum can be translated into a distribution pertaining to the sum itself, due to the closure of Normals under linear transformation. But the limiting arguments don’t work any more.
What I mean by that statement: in the CLT, the errors of the distributional approximation go to zero as N gets large. For the sum, of course the error will not go to zero - the sum itself is diverging as N grows, and so is its distribution. (The point of centering and rescaling is to establish a non-diverging limit distribution.)
So for instance, the third central moment of the Gaussian is zero. But the third central moment of a sum of N iid exponentials will diverge quickly with N (it’s a gamma with shape parameter N). This third-moment divergence will happen for any base distribution with non-zero skew.
The above points out another fact about the CLT: it does not say anything about the tails of the limit distribution. Just about the core. So CLT does not help with large deviations or very low-probability events. This is another reason the post is mistaken, which you can see in the “files” example where it talks about the upper tail of the sum. The CLT does not apply there.
Postscript: looking at the lesswrong link referenced by the post above, you will notice that the “eyeball metric” density plots happen to be recentered and scaled so that they capture the mass of the density. This is the graphical counterpart of the algebraic scaling and centering needed in the CLT.
I passed through the highly-irreproducible eras described in the section you link, and that you summarize in your last paragraph. There was so much different FP hardware, and so many different niche compilers, that my takeaway became “you can’t rely on reproducibility across any hardware/os version/compiler/library difference”.
But your point is that irreproducibility at the level of interrupts or processor scheduling is not a thing on contemporary mainstream hardware. That’s important and I hadn’t realized that.
The GRACE measurement of mass change is one of the more revolutionary advances in Earth science remote sensing in the last few decades. It has provided a unique and completely novel view of groundwater mass change. Grace is the main reason we know so much about the massive groundwater loss in the Oglala aquifer in the US Midwest, in the Central Basin in California, and in northern India. Water well data exists but it is very sparse and idiosyncratic.
It’s also our main window into mass losses in ice sheets in high latitudes (Greenland, Antarctica). We have radar altimetry data from Antarctica, but because of glacial rebound and other effects, it’s not easy to translate height changes into mass changes. Grace measures mass change directly.
Several authors of the cited study are on the science team. It is a JPL instrument.
The original Grace pair used radio to measure separation and velocity, while the follow-up Grace-FO uses a laser. I assume the small wavelength of the laser provides a more accurate measurement. It’s possible that Grace-FO has a slightly higher spatial resolution (I’ve worked with Grace but not Grace-FO); the horizontal resolution of Grace is about 100km or about 1 degree.
From an inference perspective the measurement is very interesting. They pool about a month’s worth of observations of the distance and velocity of a pair of satellites, and do a Bayesian inversion to obtain a parameterized gravitational potential for that month. The map from gravitational potential to observation is known analytically, so it’s readily possible to get a spatial covariance for the gravitational potential, as well as the point estimate.
I was also pleased to see large deviations, although the lecture notes don’t actually define what a large deviation is.
They do give an example of a Chernoff (exponential) bound for a sum of iid random variables. The bound of course has an exponential form - they just don’t call it a large deviation. So it’s a bit of a missed opportunity, oven that the name is in the chapter title.
These bounds come up all over the place in CS, but especially lately in learning theory.
For reasons explained in the article, we are bad at estimating small probabilities.
Similarly, we are bad at estimating small proportions ("easily shave 2%"). What is being claimed in the parentheses here is that there's a probability distribution of "how much costs are shaved" and that we can estimate where the bulk of its support is.
But we're not really good at making such estimates. Maybe there is some probability mass around 2%, but the bulk is around 0.5%. It seems like that's a small difference (just 1.5%!) but it's a factor of 4 in terms of savings.
So now we have a large number (annual spend), multiplied by a very uncertain number (cost shave, with poor experimental support), leading to a very uncertain outcome in terms of savings.
And it can be that, in reality, the costs of changing service turn out to overwhelm this outcome.
When modern advertising is a spectrum of “lies, damn lies, and statistics,” I don’t blame folks for crying foul and demanding a baseline level of truth in advertising. When folks trust but verify, this is seen as a change in the status quo by folks, and some of those folks who protest about it in those terms are trying to sell you something.
I believe the deep-ocean vents you mention are beside the point. The article is discussing the upwelling of cold, CO2-rich water in the Southern Ocean - not emissions from vents.
Also, it’s worth noting that the PNAS article does not mention CO2 per se, only upwelling. The article summary of the press release does draw the CO2 connection.
Besides the connections you mention, the PNAS article points out that this result illustrates that current models of ice/ocean interaction are not producing these observational trends.
Yeah - my guess is this was just a very roundabout solution for setting axis limits.
(For some reason, plt.bar was used instead of plt.plot, so the y axis would start at 0 by default, making all results look the same. But when the log scale is applied, the lower y limit becomes the data’s minimum. So, because the dynamic range is so low, the end result is visually identical to having just set y limits using the original linear scale).
Anyhow for anyone interested the values for those 3 points are 2.0000 (exact), 1.9671 (trapezoid), and 1.9998 (gaussian). The relatives errors are 1.6% vs. 0.01%.
He is thinking about a random choice among the 20 edges branching out from each vertex.