The 20% Statistician: March 2015

If you teach methods and statistics courses, organize talks or symposia on good research practices, or simply want to read some good papers, check out this OSF page with a list of papers on research methods. Inspired by recent illustrations of the gender imbalance in contributors to special issues on how to improve research methods, and the gender imbalance in speakers in symposia on good research practices, a group of people came together to create a list of research methods papers first-authored by women. Feel free to contribute suggestions for papers not yet included, or join the Mendeley group created by Kirstie Whitaker to keep track of new additions. You might be able to use the list as inspiration when creating your course syllabus on research methods, or when inviting speakers for a symposium on good research practices. Below is the description of the list, copied from the wiki page on the OSF, and written by Michael Kane:

This document began in light of thoughtful blog posts (e.g., Ledgerwood, Haines, & Ratliff; Jussim & Vazire) -- and a lengthy and impassioned discussion thread on the ISCON Facebook page -- about the lack of diversity of voices in current debates about best practices in psychological research. Upon reading these, I nervously opened the pdf of the syllabus for my graduate research course on psychological research methods (PSY 624). Only 1 of the 40-some required primary articles featured a female lead author. I’m embarrassed to say that this had never occurred to me, despite my teaching a course in which about 70% of the students were women. I now note that, not only have the graduate students in my department had all of their formal instruction in methods and stats delivered by professors who are men, but virtually all the voices they’ve read in my course have also been men’s. Although many of our students are expertly mentored in research by faculty advisors who are women, I am concerned that I have deprived them of additional role models.

I tweeted (@kane_WMC_lab) about my syllabus and started a conversation with Rogier Kievit (@rogierK) and Sanjay Srivastava (@hardsci), in which we estimated that the lack of diversity in my syllabus was also likely true of many others. Rogier suggested to me a few nice papers by female lead authors that I should check out, and then had the wonderful idea to start a crowdsourced list of other work that more instructors and beginning students of research methods should be aware of.

Such a list is important for several reasons. For example, recent empirical research (Malinak et al., Sugimoto et al.) shows that women are half as likely to be first author of a paper, and even when they do take up leading author positions, they are consistently cited less than men, all else being equal

The list is available at the References List by Topic link; most of the initial entries were provided by Rogier Kievit (@rogierK), Jessica Logan (@jarlogan), and Kirstie Whitaker(@kirstie_j), with help from Michael Kane and Daniel Lakens (@lakens). This is a public document that we hope will grow to include dozens of papers that faculty and students alike may benefit from reading. To access any of these articles you're interested in, please join the linked Mendeley Group established by Kirstie Whitaker.

Via comments on the main page of this project (see the blue speech-bubble icon in the upper right corner), please add citations to your favorite articles, chapters, or books about psychological research methods and design, data interpretation, philosophy of science, or “best practices” written by women. We'll aim to incorporate suggested citations into the main list soon after they're suggested.

Ideally this list will focus on those articles that may be especially germane to introductory graduate courses.

-- Michael J. Kane (with Rogier Kievit, Daniel Lakens, Jessica Logan, Brian Nosek, Sanjay Srivastava, Simine Vazire, & Kirstie Whitaker)

Sugimoto, C. R., Lariviere, V., Ni, C. Q., Gingras, Y., & Cronin, B. (2013). Bibliometrics: Global gender disparities in science. Nature, 504(7479), 211-213.

Maliniak, D., Powers, R., & Walter, B. F. (2013). The gender citation gap in international relations. International Organization, 67(04), 889-922.

This blog post is presented in collaboration with a new interactive visualization of the distribution of p-values created by Kristoffer Magnusson (@RPsychologist) based on code by JP de Ruiter (@JPdeRuiter).

Question 1: Would you be inclined to interpret a p-value between 0.16- 0.17 as support for the presence of an effect, assuming the power of the study was 50%? Write down your answer – we will come back to this question later.

Question 2: If you have 95% power, would you be inclined to interpret a p-value between 0.04 and 0.05 as support for the presence of an effect? Write down your answer – we will come back to this question later.

If you gave a different answer on question 1 than on question 2, you are over relying on p-values, and you’ll want to read this blog post. If you have been collecting larger sample sizes, and continue to rely on p < 0.05 to guide your statistical inferences, you’ll also want to read on.

When we have collected data, we often try to infer whether the observed effect is random noise (the null hypothesis is true) or a signal (the alternative hypothesis is true). It is useful to consider how much more likely it is that we observed a specific p-value if the alternative hypothesis is true, than when the null-hypothesis is true. We can do this by thinking about (or simulating, or using the great visualization by Kristoffer Magnusson) how often we could expect to observe a specific p-value when the alternative hypothesis is true, compared to when the null-hypothesis is true.

The latter is easy. When the null-hypothesis is true, every p-value is equally likely. So we can expect 1% of p-values to fall between 0.04 and 0.05.

When the alternative hypothesis is true, we have a probability of finding a significant effect, which is the statistical power of the test. As power increases, the p-value distribution changes (play around with the visualization to see how it changes). Very high p-values (e.g., p = 0.99) become less likely, and low p-values (e.g., p = 0.01) become more likely. This increase is really slow. If you have 20% power, a p-value between 0.26 and 0.27 is still just as likely under the alternative hypothesis (1%) as under the null hypothesis (only higher p-values are less likely under the null hypothesis than under the alternative hypothesis). When power is 50%, a p-value between 0.17-0.18 is just as likely when the alternative hypothesis is true as when the null hypothesis is true (both are again 1% likely to occur).

If the power of the test is 50%, a p-value between 0.16-0.17 is 1.1% likely. That means it is slightly more likely under the alternative hypothesis, than under the null hypothesis, but not very much (see Question 1).

If the power of the test is 50% (not uncommon in psychology experiments), p-values between 0.04 and 0.05 can be expected around 3.8% of the time, while under the null hypothesis, these p-values can only be expected around 1% of the time. This means that it is 3.8 times more likely to observe this p-value assuming the alternative hypothesis is true, than assuming the null hypothesis is true (dear Bayesian friends: I am also assuming the null hypothesis and the alternative hypothesis are equally likely a-priori). That’s not a lot, but it is something. Take a moment to think about which ratio you would need before you would consider something ‘support’ for the alternative hypothesis (there is no single correct answer).

As power increases even more, most of the p-values from statistical tests will be below 0.01, and there will be relatively few p-values between 0.01 and 0.05. For example, if you have 99% power with an alpha of 0.05, you might at the same time have 95% power for an alpha of 0.01. This means that 99% of the p-values can be expected to be lower than 0.05, but 95% of the p-values will also be below 0.01. That leaves only 4% of the p-values between 0.01 and 0.05. I think you’ll see where I am going.

If you have 95% power (e.g., you have 484 participants, 242 in each of two conditions, and you expect a small effect size of Cohen’s d = 0.3 in a between participants design), and you observe a p-value between 0.04 and 0.05, the probability of observing this p-value is 1.1% when the alternative hypothesis is true. It is still 1% when the null hypothesis is true (see Question 2).

This example shows how a p-value between 0.16-0.17 can give us exactly the same signal to noise ratio as a p-value between 0.04-0.05. It shows that when interpreting p-values, it is important to take the power of the study into consideration.

If your answer on Question 1 and Question 2 is not the same, you are relying too much on p-values when drawing statistical inferences. In both scenarios, the probability of observing this p-value when the alternative hypothesis is true is 1.1%, and the probability of observing this p-value when the null-hypothesis is true is 1%. This is not enough to distinguish between the signal and the noise.

When power is higher than 96%, p-values between 0.04 and 0.05 become more likely under the null-hypothesis than under the alternative hypothesis. In such circumstances, it would make sense to say: “p = 0.041, which does not provide support for our hypothesis”. If you collect larger sample sizes, always calculate Bayes Factors for converging support (and be careful if Bayes Factors do not provide support for your hypothesis). Try out JASP for software that looks just like SPSS, but also provides Bayes Factors, and is completely free.

Even though it is often difficult to know how much power you have (it depends on the true effect size, which is unknown), for any study with 580 participants in each condition, power is 96% for a pretty small true effect size of Cohen’s d = 0.2. In very large samples p-values above 0.04 should not be interpreted as support for the alternative hypothesis. Now that we are seeing more and more large-scale collaborations, and people are making use of big data, it’s important to keep this fact in mind. Obviously you won’t determine whether a paper should be published or not based on the p-value (all interesting papers should be published), but these papers should draw conclusions that follow from the data.

One way to think about this blog post, is that in large sample sizes, we might as well use a stricter Type 1 error rate (e.g., 0.01 instead of 0.05). After all, there is nothing magical about the use of 0.05 as a cut-off, and we should determine the desired Type 1 error rate and Type 2 error rate when we design a study. Cohen suggests a ratio of Type 2 error rates to Type 1 error rates of 4:1, which is reflected in the well known ‘minimum’ recommendation to aim for 80% power (which would mean a 20% Type 2 error rate) when you have a 5% Type 1 error rate . If my Type 2 error rate is only 5% (because I can be pretty sure I have 95% power) than it makes sense to reduce my Type 1 error rate to 1.25 (or just 1) to approximately maintain the ratio between Type 1 errors and Type 2 errors Cohen recommended.

Relying too much on p-values when you draw statistical inferences is a big problem – everyone agrees on this, whether or not they think p-values can be useful. I hope that with this blog post, I’ve contributed a little bit to help you think about p-values in a more accurate manner. Below is some R code you can run to see the probability of observing a p-values between two limits (which I made before Kristoffer Magnusson created his awesome visualization of the original Mathematica code by JP de Ruiter). Go ahead and play around with the online visualization to see how much more likely p-values between 0.049 and 0.050 are when you have 50% power.

P.S.: But Daniel, aren’t p-values ‘fickle’ and don’t they dance around so that they are really useless to draw any statistical inferences? Well, when 95% of them end up below p=0.05, I think we should talk of the march of the p-values instead of the dance of the p-values. They will nicely line up and behave like good little soldiers. All they need is a good commander that designs studies in a manner where p-values can be used to distinguish between signal and noise. If you don’t create a good and healthy environment for our poor little p-values, we can’t blame the children for their parent’s mistake.

P.P.S.: You might think: “Daniel, be proud and say it loud: You are a Bayesian!” Indeed, I am asking you to consider what the true effect size is, so that you can draw an inference about your data, assuming the alternative hypothesis is true. As Beyoncé would say: “'Cause if you liked it, then you should have put a ring on it.” As a Bayesian would say: 'Cause if you liked it, you should have put a prior probability distribution on it’. Yes, I sometimes get a weird tingly feeling when I incorporate prior information in my statistical inferences. But I am not yet sure I always want to know the probability that a hypothesis is true, given the data, or that I even can come close to knowing the probability a hypothesis is true. But I am very sure I want to control my Type 1 error rate in the long term. This blog is only 10 months old, so let's see what happens.

The 20% Statistician

Monday, March 30, 2015

Diversifying Research Methods Syllabi

Friday, March 20, 2015

How a p-value between 0.04-0.05 equals a p-value between 0.16-017