The study's abstract ( http://science.sciencemag.org/content/351/6280/1433 ) say...

forapurpose · on Sept 17, 2018

> We found a significant effect in the same direction as in the original study for 11 replications (61%); on average, the replicated effect size is 66% of the original.

EDIT: Similar results to the well-known psychology replication study of a few years ago:

Strictly on the basis of significance — a statistical measure of how likely it is that a result did not occur by chance — 35 of the studies held up, and 62 did not. (Three were excluded because their significance was not clear.) The overall “effect size,” a measure of the strength of a finding, dropped by about half across all of the studies. Yet very few of the redone studies contradicted the original ones; their results were simply weaker.

https://www.nytimes.com/2015/08/28/science/many-social-scien...

thaumasiotes · on Sept 17, 2018

Assume you measure an effect and get two values, the estimated effect size in standard deviations and a p-value representing the probability that, if the effect size were zero in reality, you would have estimated an effect size at least as large as the one you did get.

In phase two, assume you publish a paper if the p-value is less than 0.05, regardless of the effect size. If p >= 0.05, you don't publish anything.

We are now done with the assumptions.

p-values get smaller as the effect size increases, and they get smaller as the sample size increases. There is a concept called statistical power which measures the smallest effect size it is possible for a study to detect at a given p-value, given the (fixed) sample size that that study has to work with. Larger sample sizes mean more statistical power and mean that it's possible to estimate a smaller effect size for the same p-value.

Adding this all up, we can see that:

- if the true effect size is small;

- AND if the sample size is "small", defined relative to the true effect size;

- AND if we filter studies by whether they meet a p-value threshold ("reach statistical significance");

- THEN the only studies that can be published will find effect sizes that are too large. They do not have the power to detect the true effect size; the only possible results are "no effect" and "unrealistically large effect".

The quick summary is that a reduced effect size on replication is a strong indicator that the original finding was spurious. As replications continue they will trend toward the higher of the true effect size or the floor set by the available statistical power.

AstralStorm · on Sept 17, 2018

Actually any strength of effect is possible. From tiny and misreported as big to huge and underreported. This is what it means the studies do not have the statistical power to measure effect size. Reproducing the results and taking an average over all results with multiple comparison and pooling correction (metastudy) could then give a valid estimate of the effect size.

P-value only checks perhaps for non-null result, if not circumvented.

It would be good to produce a funnel graph for effect sizes reported in those underpowered studies. Perhaps ones showing small (but non-null) effect sizes do not get published.

overeater · on Sept 17, 2018

There's a simpler mathematical explanation. It's called Regression Towards the Mean:

https://en.wikipedia.org/wiki/Regression_toward_the_mean

thaumasiotes · on Sept 17, 2018

That is not actually an explanation; you can only apply regression to the mean if you know what the mean is. The explanation I give correctly predicts that replicated experiments will see their effect sizes decline. Saying "regression to the mean" does not.

(It is quite possible to interpret this as regression driven by the p-value threshold, but if you do that, you're relying on the explanation I gave.)

overeater · on Sept 20, 2018

That's not true. If I take the top 5 students based on performance in a test, and put them in a group. They will likely do worse the second time around. No need to know the mean.