The end of p values?

This is the second of a two-part post discussing the scientific journal Basic and Applied Social Psychology editor’s controversial decision to ban the use of p-values in their publications. Part one can be found here.

The Royal Statistical Society (RSS) 2015 Conference featured a debate on the use of p values. The discussant was Dr. David Colquhoun, University College London,. But why was there such a debate at the RSS conference? Earlier this year, the Basic and Applied Social Psychology (BASP) journal decided to ban the use of p values. The reason? Well, one of the reasons was that statistics might be used to support ‘lower-quality research’ (Woolston, 2015). So, the RSS debate was there to promote a discussion around the strict BASP decision and to rethink the use of p values.

The RSS 2015 debate was arguably one of the most effective scientific discussions of the year. This is because it showcased both sides of the debate (for and against) in light of sound reasoning. Also, it featured a room packed with some of the best statisticians in the world. An important outcome of the debate was that it highlighted the need for university degrees to teach their students to be more critical to what they learn. This idea makes sense, because the goal of science is to insure all claims are thoroughly critiqued; therefore it is important that all evidence is considered. p values are just one piece of evidence in any story, hence should not be the be-all and end-all for the results of the study. For example, were you taught at your university to consider the possibility that the use of p values might be flawed for the study? Have you learned other alternatives to hypothesis testing? This is an important, more constructive approach to teaching statistics in the science classroom.

What is a p value?

In hypothesis testing, p values represents the probability (or likelihood) that the result obtained is ‘more extreme’ than what would normally be observed, given a specific hypothesis. Also, one uses a threshold (significance level) above which it one would not reject the null hypothesis, and below which one would reject the null hypothesis in favor of the alternative one. A threshold of 5% is often used. The use of p values started in the 1770s and their use in statistics were later popularized by Ronald Fisher (Ibid.).

The use of p values is best done with an already existing set of large data because generally the goal is to determine if one specific set of data is unusually different from what normally happens. This can be done for either inductive or deductive processes.

Inductive and deductive

In all forms of scientific inquiry, there are two ways to draw reasonable conclusions about a topic; that is through a process of either induction or deduction. One of the first and most commonly cited definitions of inductive and deductive reasoning was written by John Dewey in 1910, who states that “...building up the idea is known as inductive discovery; the movement toward developing, applying, and testing, as deductive proof.” (p 244).

In other words, induction is about taking evidence already known and attempting to make sense of it. The example Dewey gives is to think of someone who left a room for a period of time and comes back to find objects in the room scattered about in disarray. Inductive logic would be to consider the presence of a burglar. No burglar was observed directly, but with the brief evidence provided, it seems to be the best conclusion. Using meteorology as an example, perhaps you use p values to determine that one particular rain event was statistically more intense than is usual for the area. You may suspect that something influenced this high rain event, but your p value does not tell you what that could be. At this point, you are using your information to establish evidence that something happened, but more research needs to be done to figure out what.

Deduction involves testing already existing theories. Therefore, in Dewey’s example, deduction begins once data collection and evidence derivation beings. The person may search their valuables, check windows and doors for entry marks, or perhaps anything unusual in the room that the burglar may have left behind. The idea during deductive logic is to confirm the original theory that a burglar was in the room and (potentially) identify who the burglar was. This is equivalent to using several p values to narrow down the reason for your high rain event in the meteorology example. Perhaps you use p values to explore other atmospheric phenomenon leading up to your intense precipitation event. Maybe wind direction, wind speed, and temperature values were within expected ranges, but dewpoint values were higher than normal. What caused the higher dewpoint values? Further research would have to be done. This process of continually testing and hypothesizing using p values is building evidence and is inductive reasoning.

Therefore, the problem with relying too heavily on p values is that when it is used inductively (as one piece of evidence to suggest that something is different) without following it up with further research to determine why the precipitation event is more intense than usual, we can end up with conclusions drawn from poor use of scientific inquiry, which is not how p values were intended to be used. When conclusions are drawn prematurely using p values, it is inevitable that ‘false discoveries’ occur, which can lead to a whole host of new problems.

False discovery rate

One of the major issues in the use of p values was pointed out by Colquhoun in his address at the RSS 2015. The use of a threshold of 5% leads to a number of false positive tests (or false discoveries) - leading to a false discovery rate that could be as high as 30% or more (Colquhoun, 2014). So, what threshold should one use? In Colquhoun’s talk, he emphasized that a threshold of 4% (p=0.04) does not mean one has discovered something, it only means that ‘it might be worth another look’ (Colquhoun, 2015). In order to say one has discovered something, a threshold of 0.5% or 0.1% (p value of 0.005 or 0.001, respectively) should be used (Johnson, 2013). A p value of 0.001 (or less) will give a false discovery rate of less than 2% (Colquhoun, 2015).

Concluding remarks

The p value discussion is still ongoing, but one thing is certain: we should not take what we learn at university as something for granted. One has to critically evaluate what is learned and keep apace with new ways of thinking. Most importantly, p values should be used in deductive reasoning; as a way to derive a variety of evidence to support a conclusion. Not as a be-all and end-all to draw conclusions. Finally, when working with hypothesis testing, one needs to think about the following points:

- If you still decide to use p values, use them with caution. Be sure it is one piece of a variety of evidence, not the only piece of evidence.
- Remind yourself that a 5% threshold has a high false discovery rate, so you would be “wrong at least 30% of the time” (Colquhoun, 2014)
- Conduct tests at the 0.005 or 0.001 level of significance (Johnson, 2013) and according to Colquhoun (2014): “never use the word ‘significant’”. It is easily confused with meaningful.

- Michel and Morgan

`References:`

`Colquhoun, D. (2015). “P-values debate.” Retrieved 28 October, 2015, from https://rss.conference-services.net/programme.asp?conferenceID=4494&action=prog_list&session=33652`

Colquhoun, D. (2014). "An investigation of the false discovery rate and the misinterpretation of p-values." Royal Society Open Science 1(3).

Dewey (1910): Systematic Inference: Induction and Deduction. How We Think. D.C. Heath & Company.

`Johnson, V. E. (2013). "Revised standards for statistical evidence." Proceedings of the National Academy of Sciences 110(48): 19313-19317.`

`Woolston, C. (2015). Pscychology journal bans P values. Nature, Macmillan Publishers Limited. 519: 9.`