Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Early in the A-B craze (optimal shade of blue nonsense), I was talking to someone high up with an online hotel reservation company who was telling me how great A-B testing had been for them. I asked him how they chose stopping point/sample size. He told me experiments continued until they observed a statistically significant difference between the two conditions.

The arithmetic is simple and cheap. Understanding basic intro stats principles, priceless.



> He told me experiments continued until they observed a statistically significant difference between the two conditions.

Apparently, if you do the observing the right way, that is a sound way to do that. https://en.wikipedia.org/wiki/E-values:

“We say that testing based on e-values remains safe (Type-I valid) under optional continuation.”


This is correct. There's been a lot of interest in e-values and non-parametric confidence sequences in recent literature. It's usually refered to as anytime-valid inference [1]. Evan Miller explored a similar idea in [2]. For some practical examples, see my Python library [3] implementing multinomial and time inhomogeneous Bernoulli / Poisson process tests based in [4]. See [5] for linear models / t-tests.

[1] https://arxiv.org/abs/2210.0194

[2] https://www.evanmiller.org/sequential-ab-testing.html

[3] https://github.com/assuncaolfi/savvi/

[4] https://openreview.net/forum?id=a4zg0jiuVi

[5] https://arxiv.org/abs/2210.08589


Did you link the thing that you intended to for [1]? I can't find anything about "anytime-valid inference" there.


Thanks for noting! This is the right link for [1]: https://arxiv.org/abs/2210.01948


Sounds like you already know this, but that's not great and will give a lot of false positives. In science this is called p-level hacking. The rigorous way to use hypothesis to testing is to calculate the sample size for the expected effect size and only one test when this sample size is achieved. But this requires knowing the effect size.

If you are doing a lot of significance tests you need to adjust the p-level to divide by the number of implicit comparisons, so e.g. only accept p<0.001 if running ine test per day.

Alternatively just do thompson sampling until one variant dominates.


To expand, p value tells you significance (more precisely the likelihood of the effect if there were no underlying difference). But if you observe it over and over again and pay attention to one value, you've subverted the measure.

Thompson/multi-armed bandit optimizes for outcome over the duration of the test, by progressively altering the treatment %. The test runs longer, but yields better outcomes while doing it.

It's objectively a better way to optimize, unless there is time-based overhead to the existence of the A/B test itself. (E.g. maintaining two code paths.)


I just wanted to affirm what you are doing here.

A key point here is that P-Values optimize for detection of effects if you do everything right, which is not common as you point out.

> Thompson/multi-armed bandit optimizes for outcome over the duration of the test.

Exactly.


The p value is the risk of getting an effect specifically due to sampling error, under the assumption of perfectly random sampling with no real effect. It says very little.

In particular, if you aren't doing perfectly random sampling it is meaningless. If you are concerned about other types of error than sampling error it is meaningless.

A significant p-value is nowhere near proof of effect. All it does is suggestively wiggle its eyebrows in the direction of further research.


> likelihood of the effect if there were no underlying difference

By "effect" I mean "observed effect"; i.e. how likely are those results, assuming the null hypothesis.


Many years ago I was working for a large gaming company and I was the one who developed a very optimal and cheap way to split any cluster of users into A/B groups. The company was extremely happy with how well that worked. However I did some investigation on my own a year later to see how the business development people were using it and... Yeah, pretty much what you said. They were literally brute forcing different configuration until they(more or less) got the desired results.


Microsoft has a seed finder specifically aimed at avoiding a priori bias in experiment groups, but IMO the main effect is pushing whales (which are possibly bots) into different groups until the bias evens out.

I find it hard to imagine obtaining much bias from a random hash seed in a large group of small-scale users, but I haven't looked at the problem closely.


We definitely saw bias, and it made experiments hard to launch until the system started pre-identifying unbiased population samples ahead of time, so the experiment could just pull pre-vetted users.


This is form of "interim analysis" [1].

[1] https://en.wikipedia.org/wiki/Interim_analysis


And yet this is the default. As commonly implemented, a/b testing is an excellent way to look busy, and people will actively resist changing processes to make them more reliable.

I think this is not unrelated to the fact that if you wait long enough you can get a positive signal from a neutral intervention, so you can literally shuffle chairs on the Titanic and claim success. The incentives are against accuracy because nobody wants to be told that the feature they've just had the team building for 3 months had no effect whatsoever.


This is surely more optimal if you do the statistics right? I mean I'm sure they didn't but the intuition that you can stop once there's sufficient evidence is correct.


Bear in mind many people aren’t doing the statistics right.

I’m not an expert but my understanding is that it’s doable if you’re calculating the correct MDE based on the observed sample size, though not ideal (because sometimes the observed sample is too small and there’s no way round that).

I suspect the problem comes when people don’t adjust the MDE properly for the smaller sample. Tools help but you’ve gotta know about them and use them ;)

Personally I’d prefer to avoid this and be a bit more strict due to something a PM once said: “If you torture the data long enough, it’ll show you what you want to see.”


Perhaps he was using a sequential test.


Which company was this? was it by chance SnapTravel?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: