Monday, April 6, 2020

Power for inferiority tests in multi-lab replications

In a previous post, I provided a procedure for calculating power for inferiority tests for multi-lab studies. However, the power formula in this post was incorrect. I’m sorry for any confusion this caused (though based on the view count for that post, I gather that the damage was minimal). Here, I provide a corrected procedure, which I’ve validated with Monte Carlo simulations (information on the validation is provided on the bottom of this post).

Where did I go wrong?

So how did I mess this up? The short answer is that I checked my results against the power functions in TOSTER (Lakens, 2017; which are designed to calculate power for equivalence tests), when one bound was set to (negative) infinity. I assumed this was effectively giving power for an inferiority test. But this is not the case. Although an inferiority test is, in effect, just one of the two arms of an equivalence test, the math underlying the calculations for power is somewhat different.

Thankfully, I figured out a way to reasonably calculate power in the presence of heterogeneity. I describe it below. You’ll notice I’ve ripped off some of my old post.

Some basics

Researchers are increasingly interested in conducting multi-lab replications in order to, among other things, achieve high levels of statistical power that would otherwise be unattainable by a single lab. However, several recent multi-lab replications have turned up nonsignificant results (e.g., Many Labs 4; Klein et al, 2019). When the estimated effect across labs is small and nonsignificant, it can be useful to test whether the observed effect is smaller than a specified bound (e.g., the smallest effect that would have been theoretically meaningful; Anvari & Lakens, 2019). Testing against a specified bound to test whether the studied effect is smaller than that bound is known as an inferiority test. (When testing assess whether the studied effect falls within a set of two bounds, this approach is called an equivalence test; for a more thorough discussion, see Lakens, 2017). In a frequentist framework, an inferiority test can be carried out by checking whether the upper limit of a confidence interval excludes the specified inferiority bound – or equivalently, calculating a test statistic and corresponding p-value to assess whether the observed effect is significantly smaller than the bound.

Just as it’s important to conduct power analyses to determine a multi-lab study’s ability to detect effects, it’s important to assess the study’s ability to reject effects. Calculating power for inferiority tests is fairly straightforward, though the methods are less familiar to researchers than power calculations for typical significance tests. However, when conducting an inferiority test with data collected at multiple labs, heterogeneity across the labs can decrease the power of that test. I haven’t found a tutorial on how to calculate power for inferiority tests in this situation, so here’s my (second) attempt at devising a method.

Calculating power for inferiority tests in the presence of heterogeneity

Here, I am specifically focusing on a simple case of comparing two independent groups. Drawing directly from Julious (2004), the formula for power (1β) for inferiority tests is as follows:

1β=1pt(t1α,df,df,λ)

where pt() is the probability function of the t distribution, df=Ntotal2, α is the specified Type I error rate, and λ is the noncentrality parameter. The formula for λ needs to account for the heterogeneity across labs. To do this, we use the following formula:

λ=|δdbound212nm+Lm|

where n is the group size per lab, m is the number of labs, δ is the population effect size (as a standardized mean difference), and dbound is the inferiorty bound (also as a standardized mean difference), and L is the heterogeneity across labs (as explained variance). This formula for λ is adapted from the formula Julious provides, as well as the formula for the noncentrality parameter accounting for heterogeneity provided by Jake Westfall in this blog post.

As I noted previously, in many cases, it makes sense to assume δ is 0. This represents a situation in which the effect does not exist (on average), but there may be heterogeneity around 0. That is, in some labs, the effect might be above or below zero for reasons other than sampling variation. I’ll expand on this issue later and talk about some situations in which I think this is plausible.

As heterogeneity (L) increases, power decreases, so selecting a reasonable value for this parameter is important. In the last version of this post, I agrued that 1% is a reasonable starting point, based on the results of Many Labs 2. Jake Westfall has provided a way of converting τ into L (the proportion of lab variance). The formula he provides is L=τ21τ2. This formula can be useful if you have a good heterogeneity estimate from a meta-analysis. Many Labs 2 found that there was fairly little heterogeneity across labs for most effects they examined. Among effects for which there was some heterogeneity, the τ estimates were around .10. Using the equation above, this converts to L = .01 (1%). However, often we should be more conservative and assume there could be greater heterogeneity, but I think 1% is a good place to start.

Here’s an R function to calculate power using this method:

pwr.inferiority <- function(n, m, d_population = 0, d_bound, L, alpha = .05) {
  
  ncp <- abs((d_population - d_bound)/(2*(sqrt((1/(2*m*n)) + (L/m)))))
  
  DF <- 2*(m*n) - 2
  
  pwr <- 1 - pt(qt(1 - alpha, DF), DF, ncp)
  
  return(pwr)
  
} 

The arguments of this function match the math we were working through above: n is the group size per lab (not the total sample size!), m is the number of labs, d_population (0 by default) is the population effect size, d_bound is the inferiority bound, L is the proportion of lab heterogeneity, and alpha is the desired Type I error rate (.05 by default). The output is an estimate of power, given the specified parameters. To determine a sample size for acheiving a particular level of power, I suggest taking an iterative approach (i.e., try different sample sizes with reasonable values for the other parameters).

As an example can illustrate how the function works, let’s imagine a multi-lab study in which 5 labs each collect N = 200 (n = 100 per group) for a two-group between subjects experiment. First, what happens when there is no heterogeneity?

pwr.inferiority(100, 5, d_population = 0, d_bound = .20, L = 0)
## [1] 0.9351492

It looks like we are very well powered!

Now let’s imagine we have the same design, but there is 1% heterogeneity across labs.

pwr.inferiority(100, 5, d_population = 0, d_bound = .20, L = .01)
## [1] 0.5712866

Suddenly, power drops dramatically. Why? The lab heterogeneity has a huge effect on the precision of the effect size estimates.

One of the things Jake Westfall pointed out in the blog post I linked earlier was that, when there is heterogeneity, increasing the number of labs generally increases power to detect effects more efficiently than increasing the number of participants per lab. This principle applies for rejecting effects as well as detecting them. If we instead have 50 labs each collecting n = 10 per group (still total N = 1000), the power is much greater:

pwr.inferiority(10, 50, d_population = 0, d_bound = .20, L = .01)
## [1] 0.8925013

The larger number of labs is able to better cope with the heterogeneity, and the effect can be more precisely estimated.

Simulations suggest that the approach presented here is fairly accurate (see the bottom of the post for details), but it is limited in its applicability to two-group designs and under some circumstances may be a bit too liberal in its estimates. If you need precise estimates or are planning a more complex design, I suggest taking a simulation-based approach (though that, of course, is considerably more complex and demanding of resources).

An illustration based on the CLIMR project

In the previous version of this post, I illustrated this approach to power calculation using the CLIMR project, a multi-lab replication of key findings in the Construal Level Theory literature. I’m one of the coordinators of this project, along with Erik Mac Giolla, Sofia Calderon, and Karl Ask. Let’s have another look at that project, using the corrected method.

In this project, we’re planning to replicate four experiments, with a total of 64 labs (as of March 24, 2020). Each lab will collect approximately N = 25 participants per experiment (n = 12.5 per group on average). For this demonstration, we'll assume that the true effect of each experiment we replicate is zero but that procedural variations across labs cause the effects to vary in size. I’m not saying this is what will happen or that it’s even likely. But it’s useful to know how informative nonsignficant results can be. Let’s use the function above to create a power curve for inferiority tests. The inferiority bounds are on the x-axis, and power is on the y-axis.

Under these assumptions, we have a considerable amount of power to exclude quite small effect sizes. For instance, we’ll have 85.0% power to reject effects of d = .15 or larger and 97.3% power to reject effects of d = .20 or larger. This puts us in quite good shape for offering a severe test of these effects.

When and why should we care about heterogeneity in inferiority tests?

It’s reasonable to ask whether heterogeneity is even an issue likely to affect inferiority tests in multi-lab studies, but I think it’s important to consider. First, let’s consider why you would test for inferiority in the first place. Presumably, you are interested in assessing whether the observed average effect has been estimated with sufficient precision to reject some specified effect size. That effect size might represent the smallest value that would be of theoretical or practical significance. A well-planned multi-lab test of a theory could identify in advance how small an effect would have to be on average in order for the theory (or at least a crucial part of the theory) to be falsified. Any half-decent theory should at least specify the direction an effect is supposed to go, so typically the concern will be inferiority (“Is the effect smaller than .10?”) rather than equivalance (“Is the effect between -.10 and +.10?”). Thus, a significant inferiority test can be part of evidence useful for decisively demonstrating that an effect does not exist (or at least, that it does not exist in any meaningful way). Planning your sample size and required number of labs for inferiority testing, rather than only typical significance testing, is helpful for reducing the likelihood that you will end up with inconclusive results, like a nonsignificant finding that does not provide evidence against the smallest effect of interest.

So if planning for and testing inferiority is a good idea in principle, we ought to consider some of the factors that affect the power of the test. Heterogeneity is one of them. Variation in effect size might seem like a strange thing to consider when thinking about a situation in which the true effect is zero or very close to it. Why would there be variation in the effect if the effect does not exist? There are at least two situations I can think of in which heterogeneity is plausible even if the true effect is close to or exactly zero.

First, the true effect might occur only under selective circumstances. If the average situation in the multi-lab study was not one in which the phenomenon is likely to occur but sometimes it does (e.g., for some individuals or at some sites), you could end up with a very small mean effect size with nontrivial amounts of heterogeneity. Assuming that the test of the hypothesis was a fair representation of that prediction, this result would still be quite bad for a theory, but it could have important implications for future work.

Second, the effect might not be real at all, but procedural variations might cause “effects” to vary across labs. For instance, perhaps labs sampled differing proportions of non-naive participants who were able to guess the hypothesis. Methodological blemishes like this could result in artifactual heterogeneity. I think it’s quite likely that issues like these regularly arise but are frequently undetected.

References

Anvari, F., & Lakens, D. (2019). Using Anchor-Based Methods to Determine the Smallest Effect Size of Interest. https://doi.org/10.31234/osf.io/syp5a

Julious, S. A. (2004). Sample sizes for clinical trials with normal data. Statistics in medicine, 23(12), 1921-1986.

Klein, R. A., Cook, C. L., Ebersole, C. R., Vitiello, C., Nosek, B. A., Chartier, C. R., … & Cromar, R. (2019). Many Labs 4: Failure to Replicate Mortality Salience Effect With and Without Original Author Involvement.

Lakens, D. (2017). Equivalence tests: a practical primer for t tests, correlations, and meta-analyses. Social psychological and personality science, 8(4), 355-362.

Validation

To validate this procedure for calculating power, I tested it in a series of Monte Carlo simulations. In these simulations, I simulated data with varying amounts of heterogeneity around an effect of 0, and I performed inferiority tests in three ways:

  1. with a one-sided test against an inferiority bound as though there was no heterogeneity (with the TOSTER package; α=.05),
  2. with a one-sided inferiority test using an estimate from a random effects meta-analysis (with the metafor and TOSTER packages; α=.05), and
  3. by fitting a linear mixed-effects model (fitting random slopes for the interaction between labs and condition) and checking whether the 90% CI for the effect excludes the inferiority bound (using lme4, calculating the CI using the t-distribution and the Satterthewaite DF).

Then I calculated the mean squared difference between the calculated power to the power simulated from 10,000 tests for 24 different combinations of inferiority bounds, samples sizes, labs, and heterogeneities. For the cases in which there was no heterogeneity, simulated power for the method that assumed there was no heterogeneity matched the calculations almost exactly (mean squared error < .0001). As one would expect, as heterogeneity increased, the simple method performed more poorly and became anti-conservative (mean squared error .05 when heterogeneity was 5%). Simulated power for the meta-analytic approach and the linear mixed model approach were both close to the calculated power overall (mean squared error .007 and .008 respectively). Thus, the power calculations were fairly accurate, though not exact. Generally, the meta-analytic and mixed model power in the simulations was a bit lower than the calculated power. The mixed model approach was especially conservative when there was no heterogeneity (mean squared error .02), possibly because the model’s random slopes were unwarranted and the standard error became inflated. Given these results, I would recommend treating power calculated using this method as a close but relatively liberal estimate.

There are some situations where you may need very precise estimates of power. Again, probably the most accurate and most flexible way to assess power for inferiority tests is to specify your own simulation-based power analysis, based on your specific design. However, using the formula and function I’ve provided here can give reasonable starting values for numbers of participants and labs in custom-built simulations.

Supplemental code

Code to reproduce the figure above

ds <- seq(from = 0, to = .60, by = .01)

dp <- 0

n <- 12.5
m <- 64

L <- .01

inf_power <- pwr.inferiority(n, m, d_population = dp, d_bound = ds, L = L)

inf_data <- data.frame(bound = ds, power = inf_power)

ggplot(inf_data,
       aes(
         x = bound,
         y = power
       )) +
  geom_line(
    size = 1.5,
    color = "#b91372"
  ) +
  geom_vline(
    xintercept = .475,
    linetype = "longdash"
  ) +
  geom_vline(
    xintercept = .20,
    linetype = "dotted"
  ) +
  geom_hline(
    yintercept = .80,
    linetype = "dashed"
  ) +
  geom_hline(
    yintercept = .90,
    linetype = "dashed"
  ) +
  geom_hline(
    yintercept = .95,
    linetype = "dashed"
  ) +
  geom_hline(
    yintercept = .99,
    linetype = "dashed"
  ) +
  scale_x_continuous(
    breaks = seq(0, .80, .05)
  ) +
  scale_y_continuous(
    breaks = c(.10, .33, .66, .80, .90, .95, .99),
    labels = c(".10", ".33", ".66", ".80", ".90", ".95", ".99")
  ) +
  labs(
    y = "Power",
    x = "Inferiority bound (d)",
    title = "Power for inferiority",
    subtitle = paste("Power for each experiment to reject effect sizes\n", m, 
" labs each collecting N = 25", 
" participants per experiment,\nassuming an average effect of ", 
dp, " and ", 100*L, "% heterogeneity across labs", 
sep = "")
  ) +
  guides(
    color = FALSE
  ) +
  coord_cartesian(
    ylim = c(0, 1)
  ) +
  theme_classic()

Saturday, March 28, 2020

Power for inferiority tests in multi-lab replications (SEE UPDATE)

UPDATE: THE CONCLUSIONS OF THIS POST ARE WRONG

In my efforts to validate this approach with Monte Carlo simulations, I discovered that the formula I used to calculate power in this post is incorrect. DO NOT USE THE POWER CALCULATIONS IN THIS POST.

To view a corrected and updated post on this topic, with an explanation of what went wrong, click here. I’m retaining the original post below, as a record.

The previous (incorrect) post

Researchers are increasingly interested in conducting multi-lab replications in order to, among other things, achieve high levels of statistical power that would otherwise be unattainable by a single lab. However, several recent multi-lab replications have turned up nonsignificant null results (e.g., Many Labs 4; Klein et al, 2019). When the estimated effect across labs is small and nonsignificant, it can be useful to test whether the observed effect is smaller than a specified bound (e.g., the smallest effect that would have been theoretically meaningful; Anvari & Lakens, 2019). Testing against a specified bound to test whether the studied effect is smaller than that bound is known as an inferiority test. (When testing assess whether the studied effect falls within a set of two bounds, this approach is called an equivalence test; for a more thorough discussion, see Lakens, 2017).

Just as it’s important to conduct power analyses to determine a multi-lab study’s ability to detect effects, it’s important to assess the study’s ability to reject effects. Calculating power for inferiority tests is fairly straightforward, though the methods are less familiar to researchers than power calculations for typical significance tests. Dan Lakens’s R package TOSTER provides handy functions for calculating power for equivalence and (non)-inferiority tests. However, when conducting an inferiority test with data collected at multiple labs, heterogeneity across the labs can decrease the power of that test. I haven’t found a tutorial on how to calculate power for inferiority tests in this situation, so I put this one together.

A method for estimating power for two-group inferiority tests in multi-lab studies

Here, I’ve borrowed substantially from Jake Westfall (particularly this blog post from a few years ago) and from Dan Lakens’s work on the TOSTER package to offer a method for calculating power for two-group inferiority tests (between-subjects).

Based on the calculations Dan Lakens uses in TOSTER and based on Julious’s (2004) guide to samples size estimation for equivalence tests, this procedure I’m suggesting here assumes that power (1β) for an inferiority test is given as

1β=2pt(t1α,df,df,λ)1

where pt() is the probability function of the t distribution, df=Ntotal2 and α is the specified Type I error rate. The most important part of this equation is the noncentrality parameter λ. I’ve calculated the noncentrality parameter as

λ=|δdbound212nm+Lm|

where n is the group size per lab, m is the number of labs, δ is the population effect size, and dbound is the inferiorty bound (i.e., the value you’re testing against to see if your observed effect size is smaller). For many purposes, it may make sense to assume that δ is zero, to represent a case in which you are studying a null effect. Setting dbound requires theoretical or practical reasoning (see Anvari & Lakens, 2019).

In the formula above, the value L is the proportion of total variance in the effect size explained by the lab by condition interaction. In other words, this value represents the extent to which labs vary in their effect sizes. This is not mere sampling variation; we’re talking about true variation in the effects at different replication sites. In practice, this could be due to procedural differences, cultural differences, or just about anything that could vary across labs.

As I mentioned above, I am borrowing heavily from the work of Jake Westfall, who provided a version of the noncentrality parameter equation which I adapted into this one. He also provides a way of calculating L from a τ – a common estimator of heterogeneity in meta-analyses: L=τ21τ2

Jake takes .05 (5%) to be a reasonable value for L, based on data from Many Labs 1. This seems like a mostly fine assumption to me. But Many Labs 2, which was designed specifically to examine heterogeneity across labs, found that there was relatively little heterogeneity. Among effects for which there was some heterogeneity, the τ estimates were around .10. Using Jake’s equation, this gives us a reasonable default L of .01 (1%) – but perhaps in some cases, we should be more conservative and assume there could be unusually large amounts of variation in effects across labs.

Let’s go back to the noncentrality parameter. This value is, very roughly speaking, the t value that represents a nonzero effect size in the population, and it specifies the location of the noncentral t distribution. The typical noncentrality parameter is essentially a t value, calculated by the mean difference divided by the standard error. The larger the standard error, the greater the variance. And the larger the variance, the lower the power and precision. This version of the noncentrality parameter we’re using here takes into account additional variance components. That is, it lets us specify how variation other than sampling error – namely lab heterogeneity – play a role in the effect size estimates we will observe in our multi-lab study. The greater the lab heterogeneity, the lower the power.

Here is a simple R function for calculating power using this approach:

pwr.inferiority <- function(n, m, d_population = 0, d_bound, L, alpha = .05) {
  
  ncp <- abs((d_population - d_bound)/(2*(sqrt((1/(2*m*n)) + (L/m)))))
  
  DF <- 2*(m*n) - 2
  
  pwr <- 2*(1 - pt(qt(1 - alpha, DF), DF, ncp)) - 1
  
  return(pwr)
  
} 

The arguments of this function match the math we were working through above: n is the group size per lab (not the total sample size!), m is the number of labs, d_population (0 by default) is the population effect size, d_bound is the inferiority bound, and L is the proportion of lab heterogeneity. The output is an estimate of power, given the specified parameters. To determine a sample size for acheiving a particular level of power, I suggest taking an iterative approach (i.e., try different sample sizes with reasonable values for the other parameters).

As an example can illustrate how the function works, let’s imagine a multi-lab study in which 5 labs each collect N = 200 (n = 100 per group) for a two-group between subjects experiment. When L is set to 0 (no lab heterogeneity), pwr.inferiority and TOSTER give similar results.

pwr.inferiority(100, 5, d_population = 0, d_bound = .20, L = 0)
## [1] 0.8702984
powerTOSTtwo(alpha = .05, N = 100*5, low_eqbound_d = -Inf, high_eqbound_d = .20)
## The statistical power is 87.08 % for equivalence bounds of -Inf and 0.2 .
## 
## [1] 0.8708419

My procedure is slighly more conservative with its estimates because TOSTER uses the normal distribution to calculate power, and my procedure uses the t-distribution.

Now let’s imagine we have the same design, but there is 1% heterogeneity across labs.

pwr.inferiority(100, 5, d_population = 0, d_bound = .20, L = .01)
## [1] 0.1425731

Suddenly, power drops dramatically. Why? The lab heterogeneity has a huge effect on the precision of the effect size estimates.

One of the things Jake Westfall pointed out in the blog post I linked earlier was that, when there is heterogeneity, increasing the number of labs generally increases power to detect effects more efficiently than increasing the number of participants per lab. This principle applies for rejecting effects as well as detecting them. If we instead have 50 labs each collecting n = 10 per group (still total N = 1000), the power is much greater:

pwr.inferiority(10, 50, d_population = 0, d_bound = .20, L = .01)
## [1] 0.7850025

The larger number of labs is able to better cope with the heterogeneity, and the effect can be more precisely estimated.

Let’s consider a real example: The CLIMR project, a multi-lab replication of several key findings of Construal Level Theory. I’m one of the coordinators of this project, along with Erik Mac Giolla, Sofia Calderon, and Karl Ask. As of March 24, a total of 64 labs had committed to assisting with data collection. We’ve described what this means for our ability to detect effects on our website. What about interpreting nonsignificant effects? Let’s assume that the true effect of each experiment we replicate is zero but that procedural variations across labs cause the effects to vary in size. I’m not saying this is what will happen or that it’s even likely. But it’s useful to know how informative nonsignficant results can be.

So on these assumptions, how much power to reject effects do we have?

A pretty considerable amount of power, as it turns out! With this many labs, we have about 95% power to reject effects of .20 or greater. Again, we’re making a lot of assumptions here, but this bodes well for our ability to rule out fairly small effects if the results are nonsignificant.

A caveat

I think this is a reasoanble approach for estimating power, but I would be more confident after verifying these calculations with Monte Carlo simulations. At present, I haven’t done this validation. If and when I do, I’ll update this post.

References

Anvari, F., & Lakens, D. (2019). Using Anchor-Based Methods to Determine the Smallest Effect Size of Interest. https://doi.org/10.31234/osf.io/syp5a

Julious, S. A. (2004). Sample sizes for clinical trials with normal data. Statistics in medicine, 23(12), 1921-1986.

Klein, R. A., Cook, C. L., Ebersole, C. R., Vitiello, C., Nosek, B. A., Chartier, C. R., … & Cromar, R. (2019). Many Labs 4: Failure to Replicate Mortality Salience Effect With and Without Original Author Involvement.

Lakens, D. (2017). Equivalence tests: a practical primer for t tests, correlations, and meta-analyses. Social psychological and personality science, 8(4), 355-362.

Supplemental code

Code to reproduce the figure above:

ds <- seq(from = 0, to = .60, by = .01)

dp <- 0

n <- 12.5
m <- 64

L <- .01

inf_power <- pwr.inferiority(n, m, d_population = dp, d_bound = ds, L = L)

inf_data <- data.frame(bound = ds, power = inf_power)

ggplot(inf_data,
       aes(
         x = bound,
         y = power
       )) +
  geom_line(
    size = 1.5,
    color = "#b91372"
  ) +
  geom_vline(
    xintercept = .475,
    linetype = "longdash"
  ) +
  geom_vline(
    xintercept = .20,
    linetype = "dotted"
  ) +
  geom_hline(
    yintercept = .80,
    linetype = "dashed"
  ) +
  geom_hline(
    yintercept = .90,
    linetype = "dashed"
  ) +
  geom_hline(
    yintercept = .95,
    linetype = "dashed"
  ) +
  geom_hline(
    yintercept = .99,
    linetype = "dashed"
  ) +
  scale_x_continuous(
    breaks = seq(0, .80, .05)
  ) +
  scale_y_continuous(
    breaks = c(.10, .33, .66, .80, .90, .95, .99),
    labels = c(".10", ".33", ".66", ".80", ".90", ".95", ".99")
  ) +
  labs(
    y = "Power",
    x = "Inferiority bound (d)",
    title = "Power for inferiority",
    subtitle = paste("Power for each experiment to reject effect sizes\n", m, 
" labs each collecting N = 25", 
" participants per experiment,\nassuming an average effect of ", 
dp, " and ", 100*L, "% heterogeneity across labs", 
sep = "")
  ) +
  guides(
    color = FALSE
  ) +
  coord_cartesian(
    ylim = c(0, 1)
  ) +
  theme_classic()

Update(s)

2020-03-29: In the original version of this post, I used τ as the symbol for both the noncentrality parameter (following Julious, 2004) and the meta-analytic heterogeneity estimator. Now, I use λ as the noncentrality parameter. I don’t think I’ve ever seen the noncentrality parameter for a t distribution symbolized as λ, so don’t assume that other sources are going to match the notation here.