The 20% Statistician

A blog on statistics, methods, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Sunday, February 12, 2017

ROPE and Equivalence Testing: Practically Equivalent?

In a previous post, I compared equivalence tests to Bayes factors, and pointed out several benefits of equivalence tests. But a much more logical comparison, and one I did not give enough attention to so far, is the ROPE procedure using Bayesian estimation. I’d like to thank John Kruschke for feedback on a draft of this blog post. Check out his own recent blog comparing ROPE to Bayes factors here

When we perform a study, we would like to conclude there is an effect, when there is an effect. But it is just as important to be able to conclude there is no effect, when there is no effect. I’ve recently published a paper that makes Frequentist equivalence tests (using the two-one-sided tests, or TOST, approach) as easy as possible (Lakens, 2017). Equivalence tests allow you to reject the presence of any effect you care about. In Bayesian estimation, one way to argue for the absence of a meaningful effect is the Region of Practical Equivalence (ROPE) procedure (Kruschke, 2014, chapter 12), which is “somewhat analogous to frequentist equivalence testing” (Kruschke & Liddell, 2017).

In the ROPE procedure, a 95% Highest Density Interval (HDI) is calculated based on a posterior distribution (which is calculated based on a prior and the data). Kruschke suggests that: “if the 95 % HDI falls entirely inside the ROPE then we decide to accept the ROPE’d value for practical purposes”. Note that the same HDI can also be used to reject the null hypothesis, where in Frequentist statistics, even though the confidence interval plays a similar role, you would still perform both a traditional t-test and the TOST procedure.

The only real difference with equivalence testing is that instead of using a confidence interval, a Bayesian Highest Density Interval is used. If the prior used by Kruschke was perfectly uniform, ROPE and equivalence testing would identical, barring philosophical differences in how the numbers should be interpreted. The BEST package by default uses a ‘broad’ prior, and therefore the 95% CI and 95% HDI are not exactly the same, but they are very close, for single comparisons. When multiple comparisons are made, (for example when using sequential analyses, Lakens, 2014), the CI needs to be adjusted to maintain the desired error rate, but in Bayesian statistics, error rates are not directly controlled (they are limited due to ‘shrinkage’, but can be inflated beyond 5%, and often considerably so).

In the code below, I randomly generate random normally distributed data (with means of 0 and a sd of 1) and perform the ROPE procedure and the TOST. The 95% HDI is from -0.10 to 0.42, and the 95% CI is from -0.11 to 0.41, with mean differences of 0.17 or 0.15.

Indeed, if you will forgive me the pun, you might say these two approaches are practically equivalent. But there are some subtle differences between ROPE and TOST

95% HDI vs 90% CI

Kruschke (2014, Chapter 5) writes: “How should we define “reasonably credible”? One way is by saying that any points within the 95% HDI are reasonably credible.” There is not a strong justification for the use of a 95% HDI over a 96% of 93% HDI, except that it mirrors the familiar use of a 95% CI in Frequentist statistics. In Frequentist statistics, the 95% confidence interval is directly related to the 5% alpha level that is commonly deemed acceptable for a maximum Type 1 error rate (even though this alpha level is in itself a convention without strong justification).

But here’s the catch: The TOST equivalence testing procedure does not use a 95% CI, but a 90% CI. The reason for this is that two one-sided tests are performed. Each of these tests has a 5% error rate. You might intuitively think that doing two tests with a 5% error rate will increase the overall Type 1 error rate, but in this case, that’s not true. You could easily replace the two tests, with just one test, testing the observed effect against the equivalence bound (upper or lower) closest to it. If this test is statistically significant, so is the other – and thus, there is no alpha inflation in this specific case. That’s why the TOST procedure uses a 90% CI to have a 5% error rate, while the same researcher would use a 95% CI in a traditional two-sided t-test to examine whether the observed effect is statistically different from 0, while maintaining a 5% error rate (see also Senn, 2007, section 22.2.4)

This nicely illustrates the difference between estimation (where you just want to have a certain level of accuracy, such as 95%), and Frequentist hypothesis testing, where you want to distinguish between signal and noise, and not be wrong more than 5% of the time when you declare there is a signal. ROPE keeps the accuracy the same across tests, Frequentist approaches keep the error rate constant. From a Frequentist perspective, ROPE is more conservative than TOST, like the use of alpha = 0.025 is more conservative than the use of alpha = 0.05.

Power analysis

For an equivalence test, power analysis can be performed based on closed functions, and the calculations take just a fraction of a second. I find that useful, for example in my role in our ethics board, where we evaluate proposals that have to justify their sample size, and we often check power calculations. Kruschke has an excellent R package (BEST) that can do power analyses for the ROPE procedure. This is great work – but the simulations take a while (a little bit over an hour for 1000 simulations).

Because the BESTpower function relies on simulations, you need to specify the sample size, and it will calculate the power. That’s actually the reverse of what you typically want in a power analysis (you want to input the desired power, and see which sample size you need). This means you most likely need to run multiple simulations in BESTpower, before you have determined the sample size that will yield good power. Furthermore, the software requires your to specify the expected means and standard deviations, instead of simply an expected effect size. Instead of Frequentist power analysis, where the hypothesized effect size is a point value (e.g., d = 0.4), Bayesian power analysis models the alternative as a distribution, acknowledging there is uncertainty.

In the end, however, the result of a power analysis for ROPE and for TOST is actually remarkably similar. Using the code below to perform the power analysis for ROPE, we see that 100 participants in each group give us approximately 88.4% power (with 2000 simulations, this estimate is still a bit uncertain) to get a 95% HDI that falls within our ROPE of -0.5 to 0.5, assuming standard deviations of 1.

We can use the powerTOSTtwo.raw function in the TOSTER package (using an alpha of 0.025 instead of 0.05, to mirror to 95% HDI) to calculate the sample size we would need to achieve 88.4% power for independent t-test (using equivalence bounds of -0.5 and 0.5, and standard deviations of 1):


The outcome is 100 as well. So if you use a broad prior, it seems you can save yourself some time by using the power analysis for equivalence tests, without severe consequences.

Use of prior information

The biggest benefit of ROPE over TOST is that is allows you to incorporate prior information in your data analysis. If you have reliable prior information, ROPE can use this information, which is especially useful if you don’t have a lot of data. If you use priors, it is typically advised to check the robustness of the posterior against reasonable changes in the prior (Kruschke, 2013).


Using the ROPE procedure or the TOST procedure will most likely lead to very similar inferences. For all practical purposes, the differences are small. It’s quite a lot easier to perform a power analysis for TOST, and by default, TOST has greater statistical power because it uses 90% CI. But power analysis is possible for ROPE (which is a rare pleasure to see for Bayesian analyses), and you could choose to use a 90% HDI, or any other value that matches your goals. TOST will be easier and more familiar because it is just a twist on the classic t-test, but ROPE might be a great way to dip your toes in Bayesian waters and explore the many more things you can do with Bayesian posterior distributions.


Kruschke, J. (2013). Bayesian estimation supersedes the t test. Journal of Experimental Psychology: General, 142(2), 573–603.
Kruschke, J. (2014). Doing Bayesian Data Analysis, Second Edition: A Tutorial with R, JAGS, and Stan (2 edition). Boston: Academic Press.
Kruschke, J., & Liddell, T. M. (2017). The Bayesian New Statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a Bayesian perspective. Psychonomic Bulletin & Review.
Lakens, D. (2014). Performing high-powered studies efficiently with sequential analyses: Sequential analyses. European Journal of Social Psychology, 44(7), 701–710.
Lakens, D. (2017). Equivalence tests: A practical primer for t-tests, correlations, and meta-analyses. Social Psychological and Personality Science.
Senn, S. (2007). Statistical issues in drug development (2nd ed). Chichester, England ; Hoboken, NJ: John Wiley & Sons.

Sunday, January 29, 2017

Examining Non-Significant Results with Bayes Factors and Equivalence Tests

In this blog, I’ll compare two ways of interpreting non-significant effects: Bayes factors and TOST equivalence tests. I’ll explain why reporting more than only Bayes factors makes sense, and highlight some benefits of equivalence testing over Bayes factors. I’d like to say a big thank you to Bill (Lihan) Chen and Victoria Savalei for helping me out super-quickly with my questions as I was re-analyzing their data.

Does volunteering improve well being? A recent article by Ashley Whillans, Scott Seider, Lihan Chen, Ryan Dwyer, Sarah Novick, Kathryn Gramigna, Brittany Mitchell, Victoria Savalei, Sally Dickerson & Elizabeth W. Dunn suggests the answer is: Not so much. The study was published in Comprehensive Results in Social Psychology, one of the highest quality journals in social psychology, which peer-reviews pre-registrations of studies before they are performed.

People were randomly assigned to a volunteering program for 6 months, or to a control condition. Before and after, a wide range of well-being measures were collected. Bayes factors support the null for all measures. The main results (and indeed, except for some manipulation checks, the only results – not even means or standard deviations are provided in the article) are communicated in the form of Bayes factors in Table 2.

The Bayes factors were calculated using the Bayes factor calculator by Zoltan Dienes, who has a great open access paper in Frontiers, cited more than 200 times since 2014, on how to use Bayes to get most out of non-significant results. I won’t try to explain in detail how these Bayes factors are calculated – too many Bayesians on Twitter have told me I am too stupid to understand the math behind Bayes factors, and how I should have taken calculus in high school. They are right on both accounts, so just read Dienes (2014) for an explanation.

As Dienes (2014) discusses, you can also interpret non-significant results using Frequentist statistics. In a TOST equivalence test, which consists of two simple one-sided t-tests, you determine whether an effect falls between equivalence bounds set to the smallest effect size you care about (for an introduction, see Lakens, 2017). Dienes (2014) says it can be difficult to determine what this smallest effect size of interest is, but for me, if anything, it is easier to determine a smallest effect size of interest than to specify an alternative model in Bayesian statistics.

The authors examined whether well-being was improved by volunteering, and specified an alternative model (what would a true effect of improved well-being look like?) as follows (page 9): “Because our goal was to contrast the null hypothesis to an alternative hypothesis that the effect is moderate in size, we used a normal distribution prior with a mean of 0.50 and a standard deviation of 0.15 for the standardized effect size (e.g. the difference score between standardized T2 and T1 measures).

It is interesting to see the authors wanted to specify their alternative in terms of a ‘standardized effect size’. I fully agree that using standardized effect sizes is currently the easiest way to think about the alternative hypothesis, and it is the reason my spreadsheet and R package “TOSTER” allow you to specify equivalence bounds in standardized effect sizes when performing an equivalence test.

In equivalence testing, we can test whether the observed data is surprisingly smaller than anything we would expect. The authors seem to find a true effect of d = 0.5 a realistic alternative model. So, a good start is to try to reject an effect of d = 0.5. We can just fill in the means, standard deviations, and sample sizes from both groups, and test against the equivalence bound of d = 0.5 (see the code at the bottom of the post). Note that the authors perform a two-sided test (even though they have a one-sided hypothesis, as indicated in the title “Does volunteering improve well-being?”, but following the authors, I will test whether the effect is statistically smaller than d = 0.5, and statistically larger than d = -0.5, instead of only testing whether the effect is smaller than d = 0.5). The most important results are summarized in the Figure below:

Testing the effect for WSB, one of the well-being measures, the standardized effect size of 0.5 equals a raw effect of 0.762 in scale points on the original measure. Because the 90% confidence interval around the mean difference does not contain -0.762 or 0.762, the observed data is surprising (a.k.a statistically significant), if there was a true effect of d = -0.5 or d = 0.5 (see Lakens, 2017, for a detailed explanation). We can reject the hypothesis that d = -0.5 or d = 0.5, and if we do this, given our alpha of 0.05, we would be wrong a maximum of 5% of the time, in the long run. Other people might find smaller effects still of interest. They can collect more data, and perform an equivalence test in a meta-analysis. 

We could write: Using a TOST procedure to test the data against equivalence bounds of d = -0.5 and d = 0.5, the observed results were statistically equivalent to zero, t(78.24) = -2.86, p = 0.003. The mean difference was -0.094, 90% CI[-0.483; 0.295].

Benefits of equivalence tests compared to Bayes factors.

If we perform equivalence tests, we see that we can conclude statistical equivalence for all nine measures. You might wonder about whether we need to correct for the fact that we perform nine tests for all the different well-being measures. Would we conclude that volunteering has a positive effect on well-being, if any single one of these tests showed a significant effect? If so, we should indeed correct for multiple comparisons to control our overall Type 1 error rate, and you can do this in equivalence testing. There is no easy way to control error rates in Bayesian statistics. Some Bayesians simply don’t care about error control, and I don’t exactly know what Bayesian who care about error control do. I care about error control, and the attention p-hacking is getting suggests I am not alone. In equivalence testing, you can control the Type 1 error rate simply by adjusting the alpha level, which is one benefit of equivalence testing over Bayes factors.

To calculate a Bayes factor, you need to specify your prior by providing the mean and standard deviation of the alternative. Bayes factors are quite sensitive to how you specify these priors, and for this reason, not every Bayesian statistician would recommend the use of Bayes factors. Andrew Gelman, a widely known Bayesian statistician, recently co-authored a paper in which Bayes factors were used as one of three Bayesian approaches to re-analyze data. In footnote 3 it is written: “Andrew Gelman wishes to state that he hates Bayes factors” – mainly because of this sensitivity to priors. So not everyone likes Bayes factors (just like not everyone likes p-values!). You can discuss the sensitivity to priors in a sensitivity analysis, which would mean plotting Bayes factors for alternative models with a range of means and standard deviations and different distributions, but I rarely see this done in practice. Equivalence tests also depend on the choice of the equivalence bounds. But it is very easy to see the effect of different equivalence bounds on the test result – you can just check if the equivalence bound you would have chosen falls within the 90% confidence interval. So that is a second benefit of equivalence testing.

The authors used a power analysis to determine the sample size they needed (page 7): "To achieve 80% power to detect an effect size of r = 0.21 (d = 0.40), we required at least 180 participants to detect significant effects of volunteering on our SWB measures of interest." But what was the power of the study to support the null? Although you can simulate everything in R, there is no software to perform power analysis for Bayes factors (indeed, 'power' is a Frequentist concept). When performing an equivalence test, you can easily perform a power analysis to make sure you have a well-powered study if there is an effect, and when there is no effect (and the spreadsheet and R package allow you to do this). When pre-registering a study, you need to justify your sample size, both with an eye for when the alternative hypothesis is true, as when the null hypothesis is true. The ease with which you can perform power calculations is another benefit of equivalence tests.

A final benefit I’d like to discuss concerns the assumptions of statistical tests. You should not perform tests when their assumptions are violated. The authors in the paper examining the effect of volunteering on well-being correctly report Welch’s t-tests, because they have unequal sample sizes in each group, and the equal variances assumption is violated. This is excellent practice. I don’t know how Bayes factors deal with unequal variances (I think they don’t, and simply assume equal variances, but I’m sure the answer will appear in the comments, if there is one). My TOST equivalence test spreadsheet and R code use Welch’s t-test by default (just as R does), so unequal variances is no longer a problem. The equal variances assumption is not very plausible in many research questions in psychology (Delacre, Lakens, & Leys, under review), so not having to assume equal variances is another benefit of equivalence testing compared to Bayes factors.


Only reporting Bayes factors seems, to me, an incomplete description of the data. I think it makes sense to report an effect size, the mean difference, and the confidence interval around it. And if you do that, and have determined a smallest effect size of interest, then performing the TOST equivalence testing procedure is nothing more than checking and reporting whether the p-value for the TOST procedure is smaller than your alpha level to conclude the effect is statistically equivalent. And you can still add a Bayes factor, if you want.

All approaches to statistical inferences have strengths and weaknesses. In most situations, both Bayes factors and equivalence tests lead to conclusions that have the same practical consequences. Whenever they do not, it is never the case that one approach is correct, and one is wrong – the answers differ because the tests have different assumptions, and you will have to think about your data more, which is never a bad thing. In the end, as long as you share the data of your paper online, as the current authors did, anyone can calculate the statistics they like. But only reporting Bayes factors is not really enough to describe your data. You might want to at least report means and standard deviations, so that people who want to include the effect size in a meta-analysis don’t need to re-analyze your data. And you might want to try out equivalence tests next time you interpret null results.

Sunday, December 18, 2016

Why Type 1 errors are more important than Type 2 errors (if you care about evidence)

After performing a study, you can correctly conclude there is an effect or not, but you can also incorrectly conclude there is an effect (a false positive, alpha, or Type 1 error) or incorrectly conclude there is no effect (a false negative, beta, or Type 2 error).

The goal of collecting data is to provide evidence for or against a hypothesis. Take a moment to think about what ‘evidence’ is – most researchers I ask can’t come up with a good answer. For example, researchers sometimes think p-values are evidence, but p-values are only correlated with evidence.

Evidence in science is necessarily relative. When data is more likely assuming one model is true (e.g., a null model) compared to another model (e.g., the alternative model), we can say the model provides evidence for the null compared to the alternative hypothesis. P-values only give you the probability of the data under one model – what you need for evidence is the relative likelihood of two models.

Bayesian and likelihood approaches should be used when you want to talk about evidence, and here I’ll use a very simplistic likelihood model where we compare the relative likelihood of a significant result when the null hypothesis is true (i.e., making a Type 1 error) with the relative likelihood of a significant result when the alternative hypothesis is true (i.e., *not* making a Type 2 error).

Let’s assume we have a ‘methodological fetishist’ (Ellemers, 2013) who is adamant about controlling their alpha level at 5%, and who observes a significant result. Let’s further assume this person performed a study with 80% power, and that the null hypothesis and alternative hypothesis are equally (50%) likely. The outcome of the study has a 2.5% probability of being a false positive (a 50% probability that the null hypothesis is true, multiplied by a 5% probability of a Type 1 error), and a 40% probability of being a true positive (a 50% probability that the alternative hypothesis is true, multiplied by an 80% probability of finding a significant effect).

The relative evidence for H1 versus H0 is 0.40/0.025 = 16. In other words, based on the observed data, and a model for the null and a model for the alternative hypothesis, it is 16 times more likely that the alternative hypothesis is true than that the null hypothesis is true. For educational purposes, this is fine – for statistical analyses, you would use formal likelihood or Bayesian analyses.

Now let’s assume you agree that providing evidence is a very important reason for collecting data in an empirical science (another goal of data collection is estimation – but I’ll focus on hypothesis testing here). We can now ask ourselves what the effect of changing the Type 1 error or the Type 2 error (1-power) is on the strength of our evidence. And let’s agree that we will conclude that whichever error impacts the strength of our evidence the most, is the most important error to control. Deal?

We can plot the relative likelihood (the probability a significant result is a true positive, compared to a false positive) assuming H0 and H1 are equally likely, for all levels of power, and for all alpha levels. If we do this, we get the plot below:

 Or for a rotating version (yeah, I know, I am an R nerd): 

So when is the evidence in our data the strongest? Not surprisingly, this happens when both types of errors are low: the alpha level is low, and the power is high (or the Type 2 error rate is low). That is why statisticians recommend low alpha levels and high power. Note that the shape of the plot remains the same regardless of the relative likelihood H1 or H0 is true, but when H1 and H0 are not equally likely (e.g., H0 is 90% likely to be true, and H1 is 10% likely to be true) the scale on the likelihood ratio axis increases or decreases.

Now for the main point in this blog post: we can see that an increase in the Type 2 error rate (or a reduction in power) reduces the evidence in our data, but it does so relatively slowly. However, we can also see that an increase in the Type 1 error rate (e.g., as a consequence of multiple comparisons without controlling for the Type 1 error rate) quickly reduces the evidence in our data. Royall (1997) recommends that likelihood ratios of 8 or higher provide moderate evidence, and likelihood ratios of 32 or higher provide strong evidence. Below 8, the evidence is weak and not very convincing.

If we calculate the likelihood ratio for alpha = 0.05, and power from 1 to 0.1 in steps of 0.1, we get the following likelihood ratios: 20, 18, 16, 14, 12, 10, 8, 6, 4, 2. With 80% power, we get the likelihood ratio of 16 we calculated above, but even 40% power leaves us with a likelihood ratio of 8, or moderate evidence (see the figure above). If we calculate the likelihood ratio for power = 0.8 and alpha levels from 0.05 to 0.5 in steps of 0.05, we get the following likelihood ratios: 16, 8, 5.3, 4, 3.2, 2.67, 2.29, ,2, 1.78, 1.6. An alpha level of 0.1 still yields moderate evidence (assuming power is high enough!) but further inflation makes the evidence in the study very weak.

To conclude: Type 1 error rate inflation quickly destroys the evidence in your data, whereas Type 2 error inflation does so less severely.

Type 1 error control is important if we care about evidence. Although I agree with Fiedler, Kutzner, and Kreuger (2012) that a Type 2 error is also very important to prevent, you simply can not ignore Type 1 error control if you care about evidence. Type 1 error control is more important than Type 2 error control, because inflating Type 1 errors will very quickly leave you with evidence that is too weak to be convincing support for your hypothesis, while inflating Type 2 errors will do so more slowly. By all means, control Type 2 errors - but not at the expense of Type 1 errors.

I want to end by pointing out that Type 1 and Type 2 error control is not a matter of ‘either-or’. Mediocre statistics textbooks like to point out that controlling the alpha level (or Type 1 error rate) comes at the expense of the beta (Type 2) error, and vice-versa, sometimes using the horrible seesaw metaphor below:

But this is only true if the sample size is fixed. If you want to reduce both errors, you simply need to increase your sample size, and you can make Type 1 errors and Type 2 errors are small as you want, and contribute extremely strong evidence when you collect data.

Ellemers, N. (2013). Connecting the dots: Mobilizing theory to reveal the big picture in social psychology (and why we should do this): The big picture in social psychology. European Journal of Social Psychology, 43(1), 1–8.
Fiedler, K., Kutzner, F., & Krueger, J. I. (2012). The Long Way From -Error Control to Validity Proper: Problems With a Short-Sighted False-Positive Debate. Perspectives on Psychological Science, 7(6), 661–669.
Royall, R. (1997). Statistical Evidence: A Likelihood Paradigm. London ; New York: Chapman and Hall/CRC.

Friday, December 9, 2016

TOST equivalence testing R package (TOSTER) and spreadsheet

I’m happy to announce my first R package ‘TOSTER’ for equivalence tests (but don’t worry, there is an old-fashioned spreadsheet as well).

In an earlier blog post I talked about equivalence tests. Sometimes you perform a study where you might expect the effect is zero or very small. So how can we conclude an effect is ‘zero or very small’? One approach is to specify effect sizes we consider ‘not small’. For example, we might decide that effects larger than d = 0.3 (or smaller than d = -0.3 in a two-sided t-test), are ‘not small’. Now, if we observe an effect that falls between the two equivalence bounds of d = -0.3 and d = 0.3 we can act (in good old-fashioned Neyman-Pearson approach to statistical inferences) as if the effect is ‘zero or very small’. It might not be exactly zero, but it is small enough. You can check out a great interactive visualization of equivalence testing by RPsychologist.

We can use two one-sided tests to statistically reject effects -0.3, and ≥ 0.3. This is the basic idea of the TOST (two one-sided tests) equivalence procedure. The idea is simple, and it is conceptually similar to the traditional null-hypothesis test you probably use in your article to reject an effect of zero. But where all statistics programs will allow you to perform a normal t-test, it is not yet that easy to perform a TOST equivalence test (Minitab is one exception).

But psychology really needs a way to show effects are too small to matter (see ‘Why most findings in psychology are statistically unfalsifiable’ by Richard Morey and me). So I made a spreadsheet and R package to perform the TOST procedure. The R package is available from CRAN, which means you can install it using install.packages(“TOSTER”).

Let’s try a practical example (this is one of the examples from the vignette that comes with the R package).

Eskine (2013) showed that participants who had been exposed to organic food were substantially harsher in their moral judgments relative to those in the control condition (Cohen’s d = 0.81, 95% CI: [0.19, 1.45]). A replication by Moery & Calin-Jageman (2016, Study 2) did not observe a significant effect (Control: n = 95, M = 5.25, SD = 0.95, Organic Food: n = 89, M = 5.22, SD = 0.83). The authors have used Simonsohn’s recommendation to power their study so that they have 80% power to detect an effect the original study had 33% power to detect. This is the same as saying: We consider an effect to be ‘small’ when it is smaller than the effect size the original study had 33% power to detect.

With n = 21 in each condition, Eskine (2013) had 33% to detect an effect of d = 0.48. This is the effect the authors of the replication study designed their study to detect. The original study had shown an effect of d = 0.81, and the authors performing the replication decided that an effect size of d = 0.48 would be the smallest effect size they will aim to detect with 80% power. So we can use this effect size as the equivalence bound. We can use R to perform an equivalence test:

TOSTtwo(m1=5.25, m2=5.22, sd1=0.95, sd2=0.83, n1=95, n2=89, low_eqbound_d=-0.43, high_eqbound_d=0.43, alpha = 0.05)

Which gives us the following output:

Using alpha = 0.05 Student's t-test was non-significant, t(182) = 0.2274761, p = 0.8203089

Using alpha = 0.05 the equivalence test based on Student's t-test was significant, t(182) = -3.026311, p = 0.001417168

TOST results:
  t-value 1    p-value 1 t-value 2   p-value 2  df
1  3.481263 0.0003123764 -3.026311 0.001417168 182

Equivalence bounds (Cohen's d):
  low bound d high bound d
1       -0.48         0.48

Equivalence bounds (raw scores):
  low bound raw high bound raw
1    -0.4291159      0.4291159

TOST confidence interval:
  Lower Limit 90% CI raw Upper Limit 90% CI raw
1             -0.1880364              0.2480364

You see, we are just using R like a fancy calculator, entering all the numbers in a single function. But I can understand if you are a bit intimidated by R. So, you can also fill in the same info in the spreadsheet (click picture to zoom):

Using a TOST equivalence procedure with alpha = 0.05, and without assuming equal variances (because when sample sizes are unequal, you should report Welch’s t-test by default), we can reject effects larger than d = 0.48: t(182) = -3.03, p = 0.001.

The R package also gives a graph, where you see the observed mean difference (in raw scale units), the equivalence bounds (also in raw scores), and the 90% and 95% CI. If the 90% CI does not include the equivalence bounds, we can declare equivalence. 

Moery and Calin-Jageman concluded from this study: “We again found that food exposure has little to no effect on moral judgments” But what is ‘little to no”? The equivalence test tells us the authors successfully rejected effects of a size the original study had 33% power to reject. Instead of saying ‘little to no’ we can put a number on the effect size we have rejected by performing an equivalence test.

If you want to read more about equivalence tests, including how to perform them for one-sample t-tests, dependent t-tests, correlations, or meta-analyses, you can check out a practical primer on equivalence testing using the TOST procedure I've written. It's available as a pre-print on PsyArXiv. The R code is available on GitHub.