When is statistical significance not significant?

Nota: Abaixo é um texto em inglês. Como é sobre um tema de estatística, creio que dos meus 2 ou 3 leitores, 1 ou 2 talvez não leiam inglês, mas certamente não estão perdendo muita coisa.

The above is the title of an article, published by Figueiredo Filho et al. , at the Brazilian Political Science Review. In what follow, I’ll provide some comments on the content of the article, since I think they discuss some important things, but there is some disagreement here (at least with me). Full disclosure (I’m a friend of Dalson, the first author). I intend to transform this in an article (or comment) and submit it to the journal. Any criticism or comment are welcome.

The paper does a good job of explaining what a p-value is, and what it is not. They even provide some applied examples, that help to illustrate some common pitfalls and what to do to avoid them. For example, they clearly show that a low p-value doesn’t mean a good fit of your model to your data, which is a quite important point. And they second the idea of using graphics to visualize your data, which is good to all of us who support the use of graphics to make inferences.

Furthermore, the paper is in the spirit of a long tradition on applied fields. In fact, the list of applied fields with discussion about how to proper interpret and/or use (if at all) significance tests and p-value is quite long: psychology (Rozeboom, 1960; Killeen, 2005, Hubbard, Lindsay, 2008) , ecological studies (Anderson, Burnham, Thompson, 2000), Educational research (Carver, 1978, 1993) , Medicine/Epidemiology (Feinstein, 1998, Goodman, 1999; Sterne, Smith, Cox, 2001; Sterne, 2002), economics (McCloskey , 1985; Keuzenkamp , Magnus, 1995; McCloskey, Ziliak, 1996, 2004; Hoover, Siegler, 2008), organizational studies (Orlitzky, 2012) and in the social sciences (Winch, Campbell, 1969; King, 1986, Gill, 1999). Even within the statistics profession we may be tempted to conclude that there is no agreement about how to properly interpret the p-value[1].

However, the fact that all those fields have discussed the matter for a long time and yet no visible progress seems to have been made, makes me wonder how much these kind of discussions can help the methodological advancement of the field. And this is my first criticism. I do think that it is better that papers like this be published, because they advance the methodological knowledge of the field. But their impact seems to be quite limited, which makes me wonder how useful these papers are. I’m quite skeptical of papers like this. But, as should be obvious, I should be skeptical of comments like this as well! All in all, I think this is not a good way to make progress, but while we don’t discover better ways, it is better to keep trying than make nothing at all.

On page 33, the authors say that a p-value can provide “an objective measure to inform decisions about the validity of the generalization”. This phrasing (with a frequentist flavor) is unfortunate, since while true in simple textbook cases; it is not in most applied settings. Consider, for instance, a regression model. A p-value depends on several regression assumptions and, critically, in my vision, on model assumptions (which variables to include or not and the functional form of the model). There is no such a thing as an objective measure of validity of a generalization. All such measures are conditional on some subjective assumptions made by the researcher. So, what a p-value means in most political science Brazilian papers? Not much, as far as I can tell. I’ll give one example, to illustrate my point.

Consider the paper by Limongi and Figueiredo (2003), in which they study the effect of budget amendments on the vote pattern of deputies. In the usual frame of the issue, researchers would have written that they would test the null hypothesis that the effect is exactly zero. But, does anyone really believe that it is really possible that the effect of amendment is exactly zero for all deputies? We know ahead of time that this is not possibly true. The effect may be small, even practically irrelevant, but not zero. In fact, in social sciences, rarely if ever something minimally plausible has zero effect on other things. So, if the effect isn’t zero, with a sufficient large n, the p-value will converge to zero and we will be able to reject the null hypothesis. Thus, in practice, the p-value is only a measure of sample size. Limongi and Figueiredo, being the great researchers that they are, acknowledge this and go beyond p-value and hypothesis test in their paper and discuss the fit of the model. Which is exactly my point here. The framework is flawed, and we should move to something better, possibly along the line of the thesis promoted by Gelman and Shalizi (2011).

The reader may be skeptical of the claims made above, especially because it seems that most methodological sophisticated papers published in the best world journals use this frame. But this is misleading. First, most of the best papers use clever research designs to avoid the necessity of strong assumptions (regression and model assumptions), such as quasi-experiments or experiments. In some of these setting, there is the need to test a null hypothesis of no effect, since the claims of these articles are, in general, that the design or method is such that there is no imbalance between groups. That’s the case of regression discontinuity design, that provide selection of variables at random. So, in this particular case, it does make sense to test a null hypothesis of no effect and to compute a p-value, like Fisher did many years ago with the Layde tea test.

Second, although p-values are frequently reported, there are discussions of the effect size and the direction of the effect, which is what we are mostly interested on. Thus, despite the presence of an ugly table with p-values and stars on the estimated coefficients, the most important part of these papers is on the discussion of effect size (and the uncertainty of the effect given by a confidence interval, if frequentist, or credible interval, if Bayesian) and the direction of the effect (positive or negative).

I’ll provide one evidence alone, as a case in point. Consider, for instance, the 2012 winner of the Miller prize. Devin & Sekhon (2012) study the incumbent bias in close races in US low chamber elections. Figure 2 of the paper presents a p-value. But, as I said, it is a test of imbalance as a result of a regression discontinuity design.

The problem of mixing the Neyman-Pearson with the Fisherian paradigm.

The authors mixed the approach of Neyman-Person with that advanced by Fisher. I’ll not extend here about the issues. There are several papers out there about the controversies among the authors and where their approach can be mixed or not. For the purpose of the paper, the fact that a significance level, type I and type II errors, which are all concepts based on the work of Neyman and Pearson, are mixed with p-values (A Fisherian concept) is really unfortunate. Significance level (the famous alpha) is not a measure of evidence, but the p-value (arguably) is. To put things in another way: when you have a null hypothesis and an alternative hypothesis (like in the Neyman-Pearson approach), you might think that the lower the p-value, the higher the likelihood that the alternative is true. But this is not the case. And the confusion arises because you’re mixing two approaches. Being a little bit imprecise, we can say that the p-value was created in the context of a null hypothesis, without any alternative hypothesis and can’t provide any evidence about the alternative hypothesis. It only says something about the null hypothesis. And even in this case, as the authors correctly pointed, it is not the probability of the null hypothesis being true, given the data, but the opposite, i.e., the probability of the data, given the null hypothesis. I understand this is hard to grasp, and I’m afraid most people will interpret the p-value in the wrong way. But I don’t think this is that bad, because in general the lower the p-value, the lower the probability of the null hypothesis being true.

So, as I see the issue, I would have avoided any mention to type I and type II errors, and focused on the p-value. At most, I’d said what I said above: don’t mix both approaches, because they don’t fit well together. And I’d stick with the Fisherian approach because, as Gelman uses to say, in social sciences you don’t make Type I or Type II errors at all. We know in advance that the null hypothesis isn’t true (exactly zero effect) and that the alternative hypothesis isn’t true either (at least point hypothesis).

Last, but not least, I’d like to comment about the issue of non-random sample. Coincidently, this is an issue discussed in a recent post by Gelman. The authors claim that, if a sample is not random, a p-value has no meaning.

“It is pointless to estimate a p-value for non-random sample” (p. 39).

This is technically true. But what should we, as applied researchers, do with a statement like this? Rarely, if ever, we work with random sampling in Brazil. Should we stop, then, of publishing quantitative papers with our observational data or convenience sample? Of course not. As said elsewhere, “we go to the war with the data we have, not the data we want“.

What is at issue, here, I guess, is a (frequentist) purism that is counter-productive. We learn in our statistics classes that randomness come from the sampling procedure, and random sampling is what will ensure that central limit theorem and alike will work, allowing us to make inferences.

But the real point is to know if your sample is representative of your population or not. We know (in a quantitative way) that random sampling will generate, in general, good samples. But if our sampling scheme is not random, what we need to do is to model the data generating process, so we can assess the uncertainty of our estimates. And although I prefer to do this in a Bayesian way, it is possible to do this in a frequentist way and, thus, to use and interpret p-values. Post-stratification is an example, bias modeling is another, and, in general, as long as we model our data generating process, we can make inferences. Sure, we’ll be making more assumptions and our conclusions will be as good as our assumptions are good. If the data is bad, no amount of modeling will circumvent this fact. As the saying goes, garbage in, garbage out. But if the data has information about the population, then with good modeling choices and assumptions (some of which can be tested), we can learn things with the data at hand. So, my conclusion is not that, without random sampling, p-values are pointless. I’d rather say that, without random sampling, you will need more assumptions and more modeling. And you conclusions will depend on how good are your modeling assumptions and on how much the data depart from these modeling assumptions. But this is true in general of any modeling assumptions, and I don’t see why this is more important than, saying, testing regression assumptions.

All in All, is a most welcome paper, specially in Brazil, where rarely we discuss methodological issues. I don’t intend to pick on the authors here. I”m just trying to keep the ball (of methods discussion) rolling, after they publish a paper like this in one of our best journals.

[1] See, for instance, the thread at Andrew Gelman’s blog, where him and the statistician Larry Wassserman, among others, discussed how to proper interpret p-value, without agreeing on the issue.

3 respostas para When is statistical significance not significant?

Rafael disse:

julho 12, 2013 às 8:53 pm

“it is possible to do this in a frequentist way and, thus, to use and interpret p-values. Post-stratification is an example, bias modeling is another, and, in general, as long as we model our data generating process, we can make inferences.”

Mas para isso você precisa saber os parâmetros populacionais, não? Ou pelo menos ter suposições muito boas sobre eles. Algumas variáveis são mais fáceis (sexo), mas como você faz pós-estratificação por grau de instrução, por exemplo? Imagino que usando algo como a PNAD, que, por sua vez, é feita por amostragem.

De modo geral, acho que sou um pouco mais crítico do que você sobre a publicação de artigos didáticos em revistas de pesquisa, a não ser que eles realmente tragam algo não disponível em livros-texto. De qualquer forma, acho que suas observações são importantes e merecem espaço na revista. Principalmente sobre hipóteses nulas sem sentido e mistura de paradigmas, daria até pra se estender um pouco mais.

Abs
Manoel Galdino disse:

julho 12, 2013 às 9:04 pm

1. De acordo sobre a pós-estratificação. Mas o censo tá aí pra isso, hein? Tem grau de instrução no censo.

2. Sobre artigos didáticos em revista. Eu tenho mixed feelings. De um lado concordo com você. Se não tem nada novo, não publique. No limite, acontece o que aconteceu na medicina, onde um cara publicou uma prova (como se fosse nova) do teorema fundamental do cálculo (faz uns 3, 4 anos). Mas de outro lado, veja o seguinte. A rigor, mesmo muito que se publica na political analysis já foi publicado em revista de econometria ou de estatística, com mais matemática e inacessível pro cientista político médio (que não é metodologista). É o que o Gary King faz muitas vezes, por exemplo. Então, onde começa a novidade: na clareza de exposição? No uso de exemplos da nossa realidade? No não-uso de matemática? . De todo modo, no Brasil, a situação é tão precária, que prefiro que esses artigos sejam publicados que não publicados.

3. Eu tenho uma versão do post em que desenvolvo a questão da mistura de paradigmas. Explico em mais detalhes a questão. Mas achei que tava longo demais, e decidi cortar. Pra submeter à revista vou incluir essa parte.

Vlw pelos comentários.
Pingback: P-valor – again and again | Blog Pra falar de coisas