discussdiscusses the problem with fishing expeditions in experiments. I have an idea, based on a paper by Microsoft folks, about how to minimize some problems with experiments.
One solution to avoid the problems of fishing expeditions (namely, the fact that researchers look for several combinations of models in order to find significant results at 5% level) is to replicate studies. The problem, however, is that there isn’t much incentive for people to replicate studies. Then, we have a suboptimal number of replication studies being published, allowing for wrong claims to persist.
It is known that one way to avoid the problem of multiple comparisons is to correct analysis for it, be it something like Bonferroni correction, or multilevel Bayes. But a lot of people don’t know this. Thus, my proposal is something simple than that, and what will improve things, at least a bit (or so I hope). My proposal, in some sense, is a way to get the power of replication, without the need to wait for a replication.
In marketing, an simple experiment with a single treatment group and a single control group is called an A/B test. In general, A is the control group, and B is the treatment group. The Microsoft folks mentioned before wrote a paper in which they suggest that people do a A/A test, meaning two control groups and no treatment. In this case, we know in advance that there is no difference between groups, but random sampling variation. This allow you to avoid believing something is significant, when we know in advance that they’re a not. They suggest this to big data analysis, because with big data, everything is significant. But I think we can adapt it to small n studies and avoid the problem of fishing expeditions.
My suggestion is what I call a double control group experiment (I just invented this name!). In addition to an experiment with several treatments and one control group, let’s require that people include another, second control group. This will serve as prior information as what kind of difference we can expect to have from sampling variation in data and variation from the way things are implemented. So, this is a way of using prior information, without the need to incorporate prior information in a formal way. Surely, if you know and can include this prior information in a formal way, go for it. But my proposal is simple and demand less of researchers. And, as Gelman pointed out, one reason they probably do fishing expeditions is that they don’t know better. So, requiring them to know better is laudable, but not quite realistic, at least in the short run.
What do you guys think of my proposal? One possible criticism is that people will game the system, the same way they do know. Well, that’s possible, but at least people that now act in good faith will publish better findings, which is some improvement from what we have now.
Another criticism is that the double control group may seem just like splitting the control group into two groups, but actually performing the experiment with one single group. But they’re not the same thing. As the people from Microsoft have shown, you really should include two control groups and perform the experiment twice. This will allow one to detect not only what’s can be expected from sampling variability, but also from the research design, the way the experiment is implemented and the modeling strategy. Also, it can be extended to some quasi-experiment designs. For instance, in some regression discontinuity design studies, there is some law or rule that separate people into two groups, and it is claimed that these two groups are balanced. One example, in Brasil, is the rule for granting the Bolsa Familia. Parents with income below a given threshold (say, R$ 100,00) are allowed to receive the benefit, and people above it are not. Then you can estimate the effect of Bolsa Familia on several outcomes (say, some measure of health). Now, if we think of another threshold, relatively close to the first one, but that it’s not real, i.e., there is no single police based or correlated with it, then the groups around this threshold (say, R$ 130,00) should be balanced, but it shouldn’t have any effect on the outcome of interest. And an estimate of this relationship should provide prior information on what we can expect on the estimates of the main model. So, in my (fake) example, you would assign one (receiving Bolsa Familia) for people above R$ 130,00 and 0 (not receiving Bolsa Familia) for people above the threshold. Since this is arbitrary and we know in advance that both groups don’t receive Bolsa Familia, estimates would generate prior information about no effect.
update: A reader pointed in the comments that my idea already exists, but with another name. Thanks for the point, Gustavo.