Empirical evaluations break down quickly as the complexity of what is being measured increases, so your best bet in doing a study is to focus on one or a few features.
PLs are of course evaluated over time in the market place over time, like any other designed artifact.
I love that report and find it very inspiring (and it gives me an intuition that Haskell has an edge over the other languages), but the methodology is very disappointing:
- The requirements were mostly up to the interpretation of implementors, which more or less decided the scope of their programs.
- All results were self-reported. Even ruling out dishonesty, there are a lot of ways uncontrolled experimenters can report incorrect results.
- Many implementations weren't even runnable.
- No code was run by the reviewers, at all.
I really would love to see a more serious attempt at this experiment (probably with more modern languages).