> One could note the number of times that a 25% probability was quoted, over a long period, and compare this with the actual proportion of times that rain fell.
it still depends on many samples, or "over a long period" in your doc.
You can't escape the fact that there are only one or two samples, no matter how much math you throw around.
> You can't escape the fact that there are only one or two samples, no matter how much math you throw around.
That depends on what question you're asking. "How well calibrated are the electoral predictions that FiveThirtyEight makes?" is a sensible question with a lot of data points, seems to speak directly to the crowing about the one call being bad, and seems well suited to the application of a scoring rule for comparison between people making predictions about the same things.
I'm not seeing a formula there.