A blog on statistics, methods, philosophy of science, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Friday, June 27, 2014

Too True to be Bad: When Sets of Studies with Significant and Non-Significant Findings Are Probably True


Most of this post is inspired by a lecture on probabilities by Ellen Evers during a PhD workshop we taught (together with Job van Wolferen and Anna van ‘t Veer) called ‘How do we know what’s likely to be true’. I’d heard this lecture before (we taught the same workshop at Eindhoven a year ago) but now she extended her talk to the probability of observing a mix of significant an non-significant findings. If this post is useful for you, credit goes to Ellen Evers.

A few days ago, I sent around some questions on Twitter (thanks for answering!) and in this blog post, I’d like to explain the answers. Understanding this is incredibly important and will change the way you look at sets of studies that contain a mix of significant and non-significant results, so you want to read until the end. It’s not that difficult, but you probably want to get a coffee. 42 people answered the questions, and all but 3 worked in science, anywhere from 1 to 26 years. If you want to do the questions before reading the explanations below (which I recommend), go here

I’ll start with the easiest question, and work towards the most difficult one.

Running a single study

I asked: You are planning a new study. Beforehand, you judge it is equally likely that the null-hypothesis is true, as that it is false (a uniform prior). You set the significance level at 0.05 (and pre-register this single confirmatory test to guarantee the Type 1 error rate). You design the study to have 80% power if there is a true effect (assume you succeed perfectly). What do you expect is the most likely outcome of this single study?

The four response options were:

1) It is most likely that you will observe a true positive (i.e., there is an effect, and the observed difference is significant).


2) It is most likely that you will observe a true negative (i.e., there is no effect, and the observed difference is not significant)


3) It is most likely that you will observe a false positive (i.e., there is no effect, but the observed difference is significant).


4) It is most likely that you will observe a false negative (i.e., there is an effect, but the observed difference is not significant)



59% of the people chose the correct answer: It’s most likely that you’ll observe a true negative. You might be surprised, because the scenario (5% significance level, 80% power, the null hypothesis (H0) and the alternative hypothesis (H1) are equally likely to be true) is pretty much the prototypical experiment. It thus means that a typical experiment (at least when you think your hypothesis is 50% likely to be true) is most likely not to reject the null-hypothesis (earlier, I wrote 'fail', but in the comments Ron Dotsch correctly points out not rejecting the null can be informative as well). Let’s break it down slowly.

If you perform a single study, the effect you are examining is either true or false, and the difference you observe is either significant or not significant. These four possible outcomes are referred to as true positives, false positives, true negatives, and false negatives. The percentage of false positives equals the Type 1 error rate (or α, the significance level), and false negatives (or Type 2 errors, β) equal 1 minus the power of the study. When the null hypothesis (H0) and the alternative hypothesis (H1) are a-priori equally likely, the significance level is 5%, and the study has 80% power, the relative likelihood of the four possible outcomes of this study before we collect the data is detailed in the table below.



H0 True
(A-Priori 50% Likely)
H1 True
(A-Priori 50% Likely)
Significant Finding
False Positive (α)
2.5%
True Positive (1-β)
40%
Non-Significant Finding
True Negative (1- α)
47.5%
False Negative (β)
10%


The only way a true positive is most likely (the answer provided by 24% of the participants) given this a-priori likelihood of H0 is when the power is higher than 1-α, so in this example higher than 95%. After asking which outcome was most likely, I asked how likely this outcome was. In the sample of 42 people who filled out my there were people who responded intuitively, and those who did the math. Twelve people correctly reported 47.5%. What’s interesting is that 16 people (more than one-third) reported a percentage higher than 50%. These people might have simply ignored the information that the hypothesis was equally likely to be true, as it was that it’s false (which implies no outcome can be higher than 50%), and intuitively calculated probabilities assuming the effect was true, while ignoring the probability it was not true. The modal response for people who had indicated earlier that they thought it was most likely to observe a true positive also points to this, because they judged it would be 80% probable that this true positive was observed.

Then I asked: 

“Assume you performed the single study described above, and have observed a statistical difference (p < .05, but you don’t have any further details about effect sizes, exact p-values, or the sample size). Simply based on the fact that the study is statistically significant, how likely do you think it is you observed a significant difference because you were examining a true effect?”

Eight people (who did the math) answered 94.1%, the correct answer. All but two people who responded intuitively underestimated the correct answer (the average answer was 57%). The remaining two answered 95%, which indicates they might have made the common error to assume that observing a significant result means it’s 95% likely the effect is true (it’s not, see Nickerson, 2000). It’s interesting that people who responded intuitively overestimated the a-priori chance of a specific outcome, but then massively underestimate the probability of having observed a specific outcome if the effect was true. The correct answer is 94.1% because now that we know we did not observe a non-significant effect, we are left with the remaining probabilities that the effect is significant. There was 2.5% chance of a Type 1 error, and a 40% chance of a true positive. That means the probability of observing this positive outcome, if the effect is true, is 40 divided by the total, which is 40+2.5. And 40/(40+2.5)=94.1%. Ioannidis (2005) calls this, the post-study probability that the effect is true, the positive predictive value, PPV, (thanks to Marcel van Assen for pointing this out).

What happens if you run multiple studies?

Continuing the example as Ellen Evers taught it, I asked people to imagine they performed three of the studies described above, and found that two were significant but one was not. How likely would it be to observe this outcome of the alternative hypothesis is true? All people who did the math gave the answer 38.4%. This is the a-priori likelihood of finding 2 out of 3 studies to be significant with 80% power and a 5% significance level. If the effect is true, there’s an 80% probability of finding an effect, times 80% probability of finding an effect, times 20% probability of finding a Type 2 error. 0.8*0.8*0.2= 12.8%. If you calculate the probability for the three ways to get two out of three significant results (S S NS; S NS S; NS S S) you multiply it by 3, and 3*12.8 gives 38.4%. Ellen prefers to focus on the single outcome you have observed, including the specific order in which it was observed.

I might have not formulated the question clearly enough (most probability statements are so unlike natural language, they can be difficult to formulate precisely), but I tried to ask not for the a-priori probability,but for the probability that given these observations, the studies examined a true effect (similar to the single study case above, where the answer was not 80%, but 94.1%). In other words, the probability that H1 is true, conditional on the acceptence of H1, which Ioannidis (2005) calls the PPV. This is the likelihood of finding a true positive, divided by the total probability of finding a significant result (either a true positive or a false positive).

We therefore also need to know how likely it is to observe this finding when the null-hypothesis is true. In that case, we would find a Type 1 error (5%), another Type 1 error (5%), and a true negative (95%), and 0,05*0,05*0,95 = 0.2375%. There are three ways to get this pattern of results, so if you want the probability of 2 out of 3 significant findings under H0 irrespective of the order, this probability is 0.7125%. That’s not very likely at all. 

To answer the question, we need to calculate 12.8/(12.8+0.2375) (for the specific order in which the results were observed) or 38.4/(38.4+0.7125) (for any 2 out of 3 studies) and both calculations give us 98.18%. Although a-priori it is not extremely likely to observe 2 significant and 1 non-significant finding, after you have observed this outcome, it is more than 98% likely to have observed 2 significant and one non-significant result in three studies when the effect is true (and thus only 1.82% when the effect is not true).

The probability that, given that you observed a mix of significant and non-significant studies, the effect you observed was true, is important to understand correctly if you do research. In a time where sets of 5 or 6 significant low-powered studies are criticized for being ‘too good to be true’ it’s important that we know when a set of studies with a mix of significant and non-significant studies is ‘too true to be bad’. Ioannidis (2005) briefly mentions you can extend the calculations for multiple studies, but focusses too much on when findings are most likely to be false. What struck me from the lecture Ellen Evers gave, is how likely some sets of studies that include non-significant findings are to be true.

These calculations depend on the power, significance level, and a-priori likelihood that H0 is true. If Ellen and I ever find the time to work on a follow up to our recent article on Practical Recommendations to Increase the Informational Value of Studies, I would like to discuss these issues in more detail. To interpret whether 1 out of 2 studies is still support for your hypothesis, these values matter a lot, but to interpret whether 4 out of 6 studies are support for your hypothesis, they are almost completely irrelevant. This means that one or two non-significant findings in a larger set of studies do almost nothing to reduce the likelihood that you were examining a true effect. If you’ve performed three studies that all worked, and a close replication isn’t significant, don’t get distracted by looking for moderators, at least until the unexpected result is replicated.

I've taken the spreadsheet Ellen Evers made and shared with the PhD students, and extended is slightly. You can download it here, and use it to perform your own calculations with different levels of power, significant levels, and a-priori likelihoods of H0. On the second tab of the spreadsheet, you can perform these calculations for studies that have different power and significance levels.  If you want to start trying out different options immediately, use the online spreadsheet below:



If we want to reduce publication bias, understanding (I mean, really understanding) that sets of studies that include non-significant findings are extremely likely, assuming H1 is true, is a very important realization. Depending on the number of studies, their power, significance level, and the a-priori likelihood of the idea you were testing, it can be no problem to submit a set of studies with mixed significant and non-significant results for publication. If you do, make sure that the Type 1 error rate is controlled (e.g., by pre-registering your study design). 

I want to end with a big thanks to Ellen Evers for explaining this to me last week, and thanks so much to all of you who answered my questionnaire about probabilities.

Thursday, June 12, 2014

The Null Is Always False (Except When It Is True)

An often heard criticism of null-hypothesis significance testing is that the null is always false. The idea is that average differences between two samples will never be exactly zero (there will practically always be a tiny difference, even if it is only 0.001). Furthermore, if the sample size is large enough, tiny differences can be statistically significant. Both these statements are correct, but they do not mean the null is never true.

The null-hypothesis assumes the difference between the means in the two populations is exactly zero. However, the two means in the samples drawn from these two populations vary with each sample (and the less data you have, the greater the variance). The difference between two means will get really really close to zero when the number of samples approaches infinity. This is a core assumption in Frequentist approaches to statistics. It’s therefore not important that the observed difference in your sample isn’t exactly zero, as long as the difference in the population is zero.

Some researchers, such as Cohen (1990) have expressed their doubt that the difference in the population is ever exactly zero. As Cohen says:

The null hypothesis, taken literally (and that's the only way you can take it in formal hypothesis testing), is always false in the real world. It can only be true in the bowels of a computer processor running a Monte Carlo study (and even then a stray electron may make it false). If it is false, even to a tiny degree, it must be the case that a large enough sample will produce a significant result and lead to its rejection. So if the null is always false, what’s the big deal about rejecting it? (p. 1308).

One ‘big deal’ about rejecting it, is that to reject a small difference (e.g., a Cohen’s d of 0.001) you need a sample size of at least 31 million participants to have a decent chance of observing such a statistical difference in a t-test. With such sample sizes, almost all statistics we use (e.g., checks for normality) break down and start to return meaningless results.

Another ‘big deal’ is that we don’t know whether the observed difference will remain equally large irrespective of the increase in sample size (as should happen, when it is an accurately measured true effect) or whether it will become smaller and smaller, without ever becoming statistically significant, the more measurements are added (as should happen when there is actually no effect). Hagen (1997) explains this latter situation in his article ‘In Praise of the Null-Hypothesis Significance Test’ to prevent people from mistakenly assuming that every observed difference will become significant if you simply add participants. He writes:

‘Thus, although it may appear that larger and larger Ns are chasing smaller and smaller differences, when the null is true, the variance of the test statistic, which is doing the chasing, is a function of the variance of the differences it is chasing. Thus, the "chaser" never gets any closer to the "chasee."’
 

What’s a ‘real’ effect?

The more important question is whether it is true that there are always real differences in the real world, and what the ‘real world’ is. Let’s consider the population of people in the real world. While you read this sentence, some individuals in this population have died, and some were born. For most questions in psychology, the population is surprisingly similar to an eternally running Monte Carlo simulation. Even if you could measure all people in the world in a millisecond, and the test-retest correlation was perfect, the answer you would get now would be different from the answer you would get in an hour. Frequentists (the people that use NHST) are not specifically interested in the exact value now, or in one hour, or next week Thursday, but in the average value in the ‘long’ run. The value in the real world today might never be zero, but it’s never anything, because it’s continuously changing. If we want to make generalizable statements about the world, I think the fact that the null-hypothesis is never precisely true at any specific moment is not a problem. I’ll ignore more complex questions for now, such as how we can establish whether effects vary over time.

When perfect randomization to conditions is possible, and the null-hypothesis is true, every p-value is going to be just as likely. There a great blog post by Jim Grange explaining that p-values are uniformly distributed if the null is true using simulations in R. Take the script from his blog, and change the sample size (e.g., to 100000 in each group), or change the variances, and as long as the means of the two groups remain identical, p-values will be uniformly distributed. Although it is theoretically possible that differences are randomly fluctuating around zero in the long term, some researchers have argued this is often not true. Especially in correlational research, or in any situation where participants are not randomly assigned to conditions, this is a real problem.

Meehl talks about how in psychology every individual-difference variable (e.g., trait, status, demographic) correlates with every other variable, which means the null is practically never true. In these situations, it’s not that testing against the null-hypothesis is meaningless, but it’s not informative. If everything correlates with everything else, you need to create good models, and test those. A simple null-hypothesis significance test will not get you very far. I agree.



Random Assignment vs. Crud

To illustrate when NHST can be used to as a source of information in large samples, and when NHST is not informative in large samples, I’ll analyze data of large dataset with 6344 participants from the Many Labs project. I’ve analyzed 10 dependent variables to see whether they were influenced by A) Gender, and B) Assignment to the high or low anchoring condition in the first study. Gender is a measured individual difference variable, and not a manipulated variable, and might thus be affected by what Meehl calls the crud factor. Here, I want to illustrate this is A) probably often true for individual difference variables, but perhaps not always true, and B) it is probably never true for when analyzing differences between groups individuals were randomly assignment to.

You can download the CleanedData.sav Many Labs Data here, and my analysis syntax here. I perform 8 t-tests and 2 Chi-square tests on 10 dependent variables, while the factor is either gender, or the random assignment to the high or low condition for the first question in the anchoring paradigm. You can download the output here. When we analyze the 10 dependent variables as a function of the anchoring condition, none of the differences are statistically significant (even though there are more than 6000 participants). You can play around with the script, repeating the analysis for the conditions related to the other three anchoring questions (remember to correct for multiple comparisons if you perform many tests), and see how randomization does a pretty good job at returning non-significant results even in very large sample sizes. If the null is always false, it is remarkably difficult to reject. Obviously, when we analyze the answer people gave on the first anchoring question, we find a huge effect of the high vs. low anchoring condition they were randomly assigned to. Here, NHST works. There is probably something going on. If the anchoring effect was a completely novel phenomenon, this would be an important first finding, to be followed by replications and extensions, and finally model building and testing.

The results change dramatically if we use Gender as a factor. There are Gender effects on dependent variables related to quote attribution, system justification, the gambler’s fallacy, imagined contact, the explicit evaluation of arts and math, and the norm of reciprocity. There are no significant differences in political identification (as conservative or liberal), on the response scale manipulation, or on gain vs. loss framing (even though p = .025, such a high p-value is stronger support for the null-hypothesis than for the alternative hypothesis with 5500 participants). It’s surprising that the null-hypothesis (gender does not influence the responses participants give) is rejected for seven out of ten effects. Personally (perhaps because I’ve got very little expertise in gender effects) I was actually extremely surprised, even though the effects are small (with Cohen d’s or around 0.09). This, ironically, shows that NHST works - I've learned gender effects are much more widespread than I'd have though before I wrote this blog post.


It also shows we have learned very little, because NHST when examining gender differences does not really tell us anything about WHY gender influences all these different dependent variables. We need better models to really know what’s going on. For the studies where there was no significant effect (such as political orientation), it is risky to conclude gender is irrelevant – perhaps there are moderators, and gender and political identification are related. 


Conclusion

We can reject the hypothesis that the null is always false. Generalizing statements about how the null-hypothesis is always false, and thus how null-hypothesis significance testing is a meaningless endeavor, are only partially accurate. The null hypothesis is always false, when it is false, but it’s true when it’s true. It's difficult to know when a not statistically significant difference reflects a Type 2 error (there is an effect, but it will only become significant if the statistical power is increased, for example by collecting more data), or whether it actually means the null is true. Null-hypothesis significance testing cannot be used to answer these questions. NHST can only reject the null-hypothesis, and when observed differences are not statistically significant, the outcome of a significance test necessarily remains inconclusive. But assuming the null-hypothesis is true in exploratory research, at least in experiments where random assignment to conditions is possible, is a useful statistical tool.

Saturday, June 7, 2014

Calculating confidence intervals for Cohen’s d and eta-squared using SPSS, R, and Stata

[Now with update for STATA by my colleague +Chris Snijders]
[Now with update about using the MBESS package for within-subject designs]
[Now with an update on using ESCI]

Confidence intervals are confusing intervals. I have nightmares where my students will ask me what they are, and then I try to define them, and I mumble something nonsensical, and they all point at me and laugh.

Luckily, I have had extensive training in reporting statistics I don’t understand completely when I studied psychology, so when it comes to simply reporting confidence intervals, I’m fine. Although these calculations are really easy to do, for some reason I end up getting a lot of e-mails about them, and it seems people don’t know what to do to calculate confidence intervals for effect sizes. So I thought I’d write one clear explanation, and save myself some time in the future. I’ll explain how to do it in 2 ways, the first using SPSS, the second using R, the third by my colleague Chris Snijders on using Stata, and some brief comments about using ESCI.

CI for eta-squared in SPSS

First, download CI-R2-SPSS.zip from the website of Karl L Wuensch. His website is extremely useful (the man deserves an award for it) especially for the step-by-step guides he has created. The explanations he has written to accompany the files are truly excellent and if this blog post is useful, credit goes to Karl Wuensch.

This example focusses on designs where all factors in your ANOVA are fixed (e.g., manipulated), not random (e.g., measured), in which case you need to go here. All you need to do is open NoncF.sav (which refers to the non-central F-distribution, for an introduction, see the OSC blog), fill in some numbers in SPSS, and run a script. You’ll see an empty row of numbers, except .95 in the conf column (which happens to be a value you probably don’t want to use, see the end of this post).



Let’s say you have the following in your results section: F(1,198) = 5.72. You want to report partial η² and a confidence interval around it. You type in the F-value 5.72 in the first column, and then the degrees of freedom (1 in the second column, 198 in the third), and you change .95 into .90 (see below for the reason). Then, you just open NoncF3.sps, run the script, and you get the output in the remaining columns of your SPSS file:



We are only interested in the last three columns. From the r2 column, we get or η², = .028, with the lower (lr2) and upper (ur2) values of the confidence interval to the right that give us 90% CI [.003; .076]. Easy peasy.

CI for eta-squared in R (or R Studio)

I’m still not very good with R. I use it as a free superpowered calculator, which means I rarely use it to analyze my data (for which I use SPSS) but I use it for stuff SPSS cannot do (that easily). To calculate confidence intervals, you need to install the MBESS package (installing R, Rstudio and MBESS might take less time than starting up SPSS, at least on my computer).

To get the confidence interval for the proportion of variance (, or η², or partial η²) in a fixed factor analysis of variance we need the ci.pvaf function. MBESS has loads more options, but you just need to copy paste and run:

ci.pvaf(F.value=5.72, df.1=1, df.2=198, N=200, conf.level=.90)

This specifies the F-value, degrees of freedom, and the sample size (which is not needed in SPSS), and the confidence level (again .90, and not .95, see below). You’ll get the following output:


Here we see the by now familiar lower limit and upper limit (.003 and .076). Regrettably, MBESS doesn’t give partial η², so you need to request it in SPSS (or you can use my effect size spreadsheet).

UPDATE


I found out that for within designs, the MBESS package returns an error. For example:

Error in ci.pvaf(F.value = 25.73, df.1 = 2, df.2 = 28, N = 18, conf.level = 0.9) : N must be larger than df.1+df.2

This error is correct in between-subjects designs (where the sample size is larger than the degrees of freedom) but this is not true in within-designs (where the sample size is smaller than the degrees of freedom for many of the tests). Thankfully, Ken Kelley (who made the MBESS package) helped me out in an e-mail by pointing out you could just use the R Code within the ci.pvaf function and adapt it. The code below will give you the same (at least to 4 digits after the decimal) values as the Smithson script in SPSS. Just change the F-value, confidence level, and the df.1 and df.2.


CI for Cohen’s d in SPSS


Karl Wuensch adapted the files by Smithson (2001) and created a zip file to compute effect sizes around Cohen’s d which works in almost the same way as the calculation for confidence intervals around eta-squared (except for a dependent t-test, in which case you can read more here or here). Open the file NoncT.sav. You’ll again see an almost empty row where you only need to fill in the t-value and the degrees of freedom. Note that (as explained in Wuensch’s help file) there’s a problem with the SPSS files if you fill in a negative t-value, so fill in a positive t-value, and reverse the signs of the upper and lower CI if needed.


If you have a t-test that yielded t(198) = 2.39, then you fill in 2.39 in the first column, and 198 in the second column. For a one-sample t-test this would be enough, in a two-sample t-test you need to fill in the sample sizes n1 (100 participants) and n2 (100 participants). Open T-D-2samples.sps and run it. In the last three columns, we get Cohen’s d (0.33) and the upper and lower limits, 95% CI [0.06, 0.62].


CI for Cohen’s d in R

In MBESS, you can calculate the 95% confidence interval using:

ci.smd(ncp=2.39, n.1=100, n.2=100, conf.level=0.95)

The ncp (or non-centrality parameter) sounds really scary, but it’s just the t-value (in our example, 2.39). n.1 and n.2 are the samples sizes in both groups. You’ll get the following output:


Yes, that’s really all there is to it. The step-by step guides by Wuensch, and the help files in the MBESS package should be able to help you if you run into some problems.

Why should you report 90% CI for eta-squared?

Again, Karl Wuensch has gone out of his way to explain this in a very clear document, including examples, which you can find here. If you don’t want to read it, you should know that while Cohen’s d can be both positive and negative, r² or η² are squared, and can therefore only be positive. This is related to the fact that F-tests are always one-sided (so no, don’t even think about dividing that p = .08 you get from an F-test by two and reporting p = .04, one-sided). If you calculate 95% CI, you can get situations where the confidence interval includes 0, but the test reveals a statistical difference with a p < .05. For a paper by Steiger (2004) that addresses this topic, click here. This means that a 95% CI around Cohen's d equals a 90% CI around η² for exactly the same test. Furthermore, because eta-squared cannot be smaller than zero, a confidence interval for an effect that is not statistically different from 0 (and thus that would normally 'exclude zero') necessarily has to start at 0. You report such a CI as 90% CI [.00; .XX]  where the XX is the upper limit of the CI. Confidence intervals are confusing intervals. 

@Stata afterburner - a confidence interval around eta-squared
by Chris Snijders

In the fall semester I will be teaching a statistics class with Daniel. I have sleepless nights already about what that will do to my evaluations – who wants to appear old and retarded next to such youthful brilliance? [Note that Chris has little to worry about - he was chosen as the best teacher at Eindhoven University of Technology a few years ago - DL]. As Daniel is SPSS and R, and I am just Stata, this also implies some extra work: trying to translate Daniel’s useful stuff to Stata - our students have had their basic course in Stata and I want them to be able to keep up. Luckily, in the case of (confidence intervals around) eta-squared I have an easy way out: Stata 13 includes several effect size measures by default.

The below example is heavily “inspired” by the Stata manual. It shows how to get eta-squared and its confidence interval in Stata. First, read in your data. I use the standard apple data set from Stata:

use http://www.stata-press.com/data/r13/apple

The details of the data do not matter here. Running an Anova goes like this:

anova weight treatment           (output omitted)

To get eta-squared and its confidence interval, type

estat esize

to get

Effect sizes for linear models
-------------------------------------------------------------------
             Source |   Eta-Squared     df     [95% Conf. Interval]
--------------------+----------------------------------------------
              Model |   .9147383         3     .4496586    .9419436
                    |
          treatment |   .9147383         3     .4496586    .9419436
-------------------------------------------------------------------

If you insist on 90% confidence interval, as does Daniel, type:

estat esize, level(90)

Voila!

Stata does several additional things too, such as calculating bootstrapped confidence intervals if you prefer, or calculate effect sizes directly.

Useful additional info can be found at
http://www.stata.com/stata13/effect-sizes/, or try the YouTube intro at https://www.youtube.com/watch?v=rfn5FY96BMc (using the ugly windowed interface).

ESCI update

For people who prefer to use ESCI software by Cumming, please note that ESCI also has an option to provide 95% CI around Cohen's d, both for independent as for dependent t-tests. However, the option is slightly hidden - you need to scroll to the right, where you can check a box which is placed out of view. I don't know why such an important option is hidden away like that, but I've been getting a lot of e-mails about this, so I've added a screenshot below to point out where you can find it. After clicking the check box, a new section appears on the left that allows you to calculate 95% CI around Cohen's d (see second screenshot below)


Conclusion

I think you should report confidence intervals because calculating them is pretty easy and because they tell you something about the accuracy with which you have measured what you were interested in. Personally, I don’t feel them in the way I feel p-values, but that might change. Psychological Science is pushing confidence intervals as ‘new’ statistics, and I’ve seen some bright young minds being brainwashed to stop reporting p-values altogether and only report confidence intervals. I’m pretty skeptical about the likelihood this will end up being an improvement. Also, I don’t like brainwashing. More seriously, I think we need to stop kidding ourselves that statistical inferences based on a single statistic will ever be enough. Confidence intervals, effect sizes, and p-values (all of which can be calculated from the test statistics and degrees of freedom) present answers to different but related questions. Personally, I think it is desirable to report as many statistics that are relevant to your research question as possible. 



Smithson, M. (2001). Correct confidence intervals for various regression effect sizes and parameters: The importance of noncentral distributions in computing intervals. Educational And Psychological Measurement, 61(4), 605-632. doi:10.1177/00131640121971392