15 March 2012

False Positive Science

Writing in the journal Psychological Science, Simmons et al. (2011, here in PDF) identify a problem in the psychological literature which they call "false positive psychology." They describe this phenomenon as follows:
Our job as scientists is to discover truths about the world. We generate hypotheses, collect data, and examine whether or not the data are consistent with those hypotheses. Although we aspire to always be accurate, errors are inevitable.

Perhaps the most costly error is a false positive, the incorrect rejection of a null hypothesis. First, once they appear in the literature, false positives are particularly persistent. Because null results have many possible causes, failures to replicate previous findings are never conclusive. Furthermore, because it is uncommon for prestigious journals to publish null findings or exact replications, researchers have little incentive to even attempt them. Second, false positives waste resources: They inspire investment in fruitless research programs and can lead to ineffective policy changes. Finally, a field known for publishing false positives risks losing its credibility.

In this article, we show that despite the nominal endorsement of a maximum false-positive rate of 5% (i.e., p ≤ .05), current standards for disclosing details of data collection and analyses make false positives vastly more likely. In fact, it is unacceptably easy to publish “statistically significant” evidence consistent with any hypothesis.
Why does this phenomenon occur?
The culprit is a construct we refer to as researcher degrees of freedom. In the course of collecting and analyzing data, researchers have many decisions to make: Should more data be collected? Should some observations be excluded? Which conditions should be combined and which ones compared? Which control variables should be considered? Should specific measures be combined or transformed or both?

It is rare, and sometimes impractical, for researchers to make all these decisions beforehand. Rather, it is common (and accepted practice) for researchers to explore various analytic alternatives, to search for a combination that yields “statistical significance,” and to then report only what “worked.” The problem, of course, is that the likelihood of at least one (of many) analyses producing a falsely positive finding at the 5% level is necessarily greater than 5%.

This exploratory behavior is not the by-product of malicious intent, but rather the result of two factors: (a) ambiguity in how best to make these decisions and (b) the researcher’s desire to find a statistically significant result. A large literature documents that people are self-serving in their interpretation of ambiguous information and remarkably adept at reaching justifiable conclusions that mesh with their desires (Babcock & Loewenstein, 1997; Dawson, Gilovich, & Regan, 2002; Gilovich, 1983; Hastorf & Cantril, 1954; Kunda, 1990; Zuckerman, 1979). This literature suggests that when we as researchers face ambiguous analytic decisions, we will tend to conclude, with convincing self-justification, that the appropriate decisions are those that result in statistical significance (p ≤ .05).

Ambiguity is rampant in empirical research.
The problem of "false positive science" is of course not limited to the discipline of psychology or even the social sciences. Simmons et al. provide several excellent empirical examples of how ambiguity in the research process leads to false positives and offer some advice for how the research community might begin to deal with the problem.

Writing at The Chronicle of Higher Education, Geoffrey Pullam says that a gullible and compliant media makes things worse:
Compounding this problem with psychological science is the pathetic state of science reporting: the problem of how unacceptably easy it is to publish total fictions about science, and falsely claim relevance to real everyday life.
Pullam provides a nice example of the dynamics discussed here in the recent case of the so-called "QWERTY effect" which is also dissected here. On this blog I've occasionally pointed to silly science and silly reporting, as well as good science and good reporting -- which on any given topic is all mixed up together.

When prominent members of the media take on an activist bent, the challenge is further compounded. Of course, members of the media are not alone in their activism through science. The combination of ambiguity, researcher interest in a significant result and research as a tool of activism makes sorting through the thicket of knowledge a challenge in the best of cases, and sometimes just impossible.

The practical conclusion to draw from Simmons et al. is that much of what we think we know based on conventional statistical studies published in the academic literature stands a good chance of just not being so -- certainly more than the 5% threshold used as a threshold for significance. Absent solid research, we simply can't distinguish empirically between false and true positives, meaning that we apply other criteria, like political expediency. Knowing what to know turns out to be quite a challenge.


  1. Great post. Maybe Revkin should read this.

  2. I am unconvinced by the argument that false positives are usually more costly than false negatives. I think the relative cost of false positives and false negatives varies from case to case and the authors here are far too quick to generalize on the basis of hand-waving.

    At the same time, the dangers of the file-drawer effect and of foolish reliance on a 5% threshold ought, by this point, to be common knowledge. Ziliak and McClosky's The Cult of Statistical Significance (U. Michigan, 2008) is the wittiest, but hardly the only, polemic against this. See also, Mock and Weisberg, "Political Innumeracy: Encounters with Coincidence, Improbability, and Chance" Am. J. Pol. Sci. 36(4), 1023-46 (1992). Andrew Gelman has also blogged frequently on this and similar matters.

  3. -2-Jonathan Gilligan

    Thanks ... you write that the arguments presented by Simmons et al. "ought, by this point, to be common knowledge."

    Indeed, I had good fun in grad school with Mock and Weisberg (thanks for reminding me of it!) (Though I do think that Simmons et al. address a different issue than just the 5% threshold issue, but rather ambiguity in research design and implementation).

    All that said, if we are still discussing the topic after 20 years, and there is good evidence that it is pervasive, then perhaps simply knowing about the effects is not enough (deficit model?).

    Simmons et al. take things a step further by recommending some courses of action by scientists and reviewers. To me they seem unsatisfactory/weak tea, which probably says more about the nature of the problem than anything else.


  4. The null hypothesis is boring and results that fail to depart from it may not even be publishable. This creates a huge conflict of interest for researchers who need the paper to get tenure or sustain a grant.

    I've seen proposals for a 'Journal of Negative Results' , so that the null result at least gets written up -- this is actually an important issue -- and perhaps the researcher gets some credit for it. More potent but perhaps more difficult would be a parallel 'stick' approach (as opposed to 'carrot') where research that fails duplication on more than one occasion is re-reviewed and, if necessary, forced to be withdrawn with prejudice.

    But the main problem is poor training and mentoring of scientists. I've seen uncountable instances of scientists repeating experiments multiple times until they finally get agreement with theory (which they then write up). And theorists will repeat the same calculations by multiple methods until they finally get agreement with experiment. Both are guaranteed, given enough reps, to work and thereby produce a false result. And the multiple failed tries will not be reported, of course, because to do so would prevent any weight being given to the false positive.

    By the way, this is Gerard Harbison. I've been forced to abandon my myopenid account because it won't verify any more.

  5. -4-Gerard Harbison

    Perhaps we should then ... reverse the null hypothesis? (Sorry, couldn't resist;-)

  6. Venture capitalists say that they are unable to replicate published findings half to 2/3 of the time. Stats profs say that misuse of stats leads to statements re 95% which are really only 50/50. Journal editor says a majority of published studies turn out to be flawed.

    When the subject matter has serious political implications, be even more wary. And when the researchers are also activists? Fuggedaboutit.

  7. I assume you are familar with the paper by Dr. John P. A. Ioannidis, "Why Most Published Research Findings Are False" (http://goo.gl/q1Jmn). These problems are pervasive across all fields of science. People are surprised to discover that science is not by Vulcans and is actually done by humans.

  8. This is the old problem, faced in medicine, by routinely reporting BOTH the Sensitivity and Specificity of the results of a statistical 'test':


  9. If researchers had to publish their hypothesis/experimental design/planned statistical analysis BEFORE they did the work in order to get published, the entire profession would be turned upside down. In a publish-or-perish world, everyone knows that the process is fudged constantly along the process to get something/anything published and on the CV.

  10. The scientific method has already provided a means to mitigate risk of this nature: competing interests. The effectiveness has been demonstrated and confirmed with the current global warming campaign.

  11. Further -- with the pressure to publish, reproducibilty becomes the first victim, "The Truth Wears Off" http://psychology.okstate.edu/faculty/jgrice/psyc3214/TruthWearsOff.pdf

    Apparently, peer review ain't what it used to be, and maybe never was. "doveryai, no proveryai" -- trust, but verify.

  12. The problem has of course been known and discussed at least since the days of RA Fisher

    I always designed my psych research so that the hypothesis was answered by a single number -- usually r. And even if r was ns at <.05 I wrote up the result anyway. I sometimes had to resubmit to several journals to get acceptance but only a small minority of my papers never got published. So you CAN get "negative results" published but it may take some effort

  13. Roger you mention a few examples of false positives but is there any example of a whole, multi-disciplinary scientific field being overwhelmed by such an effect?

  14. To be more specific:

    Sensitivity = number of true positives/ (number of true positives + number of false negatives)


    Sensitivity = the probability of a positive test when the 'prediction' is 'true'.

    Specificity = number of true negatives/ (number of true negatives + number of false positives)


    Sensitivity = the probability of a negative test when the 'prediction' is 'false'.

  15. Thank you, JR; I hadn't heard anyone mention RA Fisher in a long time. I feel a tiny bit less ancient.

  16. Worshiping RA Fisher is the problem; sharp hypothesis testing is the wrong hammer for most screws: many problems are more appropriately framed as estimation problems (yes, even model selection).

  17. JS- I didn't say that I worshipped him; I just said I remembered him.
    I spent my early career with field experiments; actually posing nature a question and waiting for the answer.
    I continue to think that science is stronger when your hypotheses can be proven to be wrong.

  18. is there any example of a whole, multi-disciplinary scientific field being overwhelmed by such an effect?

    Well, there are whole fields which lay claim to scientific foundations which are complete nonsense from beginning to end. People continue to publish "scientific" papers on them though.

    Freudian Psychoanalysis.
    String Theory.

  19. Try to explain any of the issues raised here to a public who want to believe a certain point of view... that most studies don't have sufficient power to draw the conclusions they draw, or to phrase it another way, that most published research in many fields turns out to be wrong most of the time... Nor is this something researchers in these fields want to draw attention to. Working scientists tend to react defensively to arguments that there are methodological, sociocultural and psychological factors that influence their work. They prefer to call you a post modernist as they barge out of the room, their white coat trailing behind them majestically. ;-)

  20. Sharon F., accusing you of hero worship is certainly unfair; I should have worded that comment differently. You say, science is stronger when your hypotheses can be proven to be wrong.

    Yes, that's the problem: thinking that a p-value "proves" anything. Almost never does a model provide predictions sharp enough to be proven wrong. You'd need to have a model that puts zero probability on some event, and then observe that event to "prove" the model wrong. The case is far more often that no event is given zero probability, and we have several models which give various weights to many events. Proving models wrong is an uninteresting waste of time, because we already know all of our models are wrong (or else they wouldn't be very useful). On the other hand, measuring and estimating how much and when the model is in error is useful knowledge work because it provides valuable information to the people who may use those model results to support decisions some day.

    Briggs has an interesting post on this article too.