This is a thoughtful and thought-provoking blog post. I think it's worth asking similar questions of computer science. I think you'll find some math-like patterns -- there's basically no chance Quicksort or any other fundamental algorithm paper will fail to replicate -- and some patterns which will fail to replicate, like in software engineering.
Some of the early results on pseudorandom generators and hash functions aren't holding up well, but I think that's just progress. We understand the problem a whole lot better than we did back then.
Perhaps more interesting is the literature on memory models. The original publications of the Java and C11 memory models had lots of flaws, which took many years to fix (and that process might not be totally done). I worry that there are a bunch of published results that are similarly flawed but just haven't gotten as much scrutiny.
There was that time when it was discovered that nearly all published binary searches and mergesorts were wrong. [1]
And yet, the concepts of binary search and merge sort are fine.
I think that's quite similar to the situation in math papers? Because math isn't executable, a math paper being "significantly" wrong would be like discovering that a program uses a fatally flawed algorithm and is trying to do the impossible. It can't be fixed.
Rather than "wrong", I would describe those implementations as "not fully general". They work perfectly when `n < some large number` as opposed to `n < some large number * 2`. The latter is the best you can do with the function signature, but that is somewhat arbitrary. You could easily choose a 64 bit index and exceed all practical needs.
Many of these examples aren't really in C but in C-like pseudo-code. In that domain they are perfectly accurate. Even in C, 'int' bitlength is defined by the implementation. int only needs to be 16 bits or more, but it could easily be enough bits to exceed number of known atoms in the universe in which case overflow is impractical.
I'd say that's an example of the difference between science (the authors don't need to show every detail of a practical implementation and can assume infinite bits in int) and engineering where you do need to make such consideration.
From my experience in ML, I'd suspect that the "crisis" isn't that the research is false so much as it's useless (algorithm x with parameter set w happens to work well on one particular dataset, conclusion: I have revolutionised the field).
This isn't unique to ML. A lot of research is about adding an epsilon to an existing paper, which probably doesn't interesting anyone except a small community working in their very own niche topic.
But does it mean there's a crisis? maybe that's just a way to foster an environment that will let great ideas emerge.
I’d rather be in the world where we have too many papers tweaking the details of power posing and exactly measuring how much each contributes to the effect. At least we’d know the effect is real.
The parts of CS that are the most math-like (which include fundamental algorithms) don't have a replication crisis, but the ones that are the most social-science like probably do, or would. I would bet large sums of money that a lot of the literature on stuff like "does OOP lead to better code", "does IDE syntax highlighting lead to fewer bugs" etc. would fail to replicate if anyone bothered trying.
The thing is, the general sense I get is that people in CS already have so little confidence in these results that it's not even considered worth the time to try and refute them. Which doesn't exactly speak well of the field!
I worry about ML papers in particular. models are closely guarded, often impractical to train independently due to ownership of the training/test set, or computing power or details left out of the paper. there's no way to mathematically prove any of it works, either. it's like social science done on organisms we've designed.
Some measurements are interesting and valuable without being replicable. For example, the number of online devices or the number of websites using wordpress. Take the same measurement at a later point in time and the results are different. Yet I wouldn't call those fields maths-like.
Research into this stuff is very young and so I think it's fair to be skeptical of the results. I'm hoping we'll eventually come up with more rigorous, reproducible results.
In competitive programming you could basically assume the pseudocode in a paper is not literally correct and requires some tweaking to work, despite a “proof” of its correctness. Particularly with string algorithms.
rote translating pseudocode into your target language isn't likely to pan out well.
so instead you run the pseudocode in your mind, develop an intuition on how it works, and that's the "replication" bit this post talks about with reviewing math papers.
but both the pseudocode and your code will likely have edge cases you didn't handle. this isn't a problem for math - that's the category of common trivial/easily fixable proof errors that don't really affect the paper. but they're a problem for machines that run them literally.
maybe a good compromise strategy for formal verification is to declare the insight of the algorithm - recurrence relation or whatever - as an axiom, and then use the prover to whack the tricky edge cases.
Yes, I'm convinced tons of published results are flawed! I heard top researchers tell their students "don't spend too much time on the proofs, nobody reads them"). And much CS scientific papers don't get a lot of attention. But it's not necessarily bad, other researchers builds on top of this work and results consolidate over time.
No, it's not. In that specific case, the supervisor thought the value of the paper didn't lie in the proofs, plus it was a rank B conference. He rather has his student working on a different paper than spending 1 week on the proof.
Some of the early results on pseudorandom generators and hash functions aren't holding up well, but I think that's just progress. We understand the problem a whole lot better than we did back then.
Perhaps more interesting is the literature on memory models. The original publications of the Java and C11 memory models had lots of flaws, which took many years to fix (and that process might not be totally done). I worry that there are a bunch of published results that are similarly flawed but just haven't gotten as much scrutiny.