Composite endpoints and sample size estimation

08 February 2018

Papers: Composite endpoints in acute heart failure research
              Power and sample size estimation for rank-based composite endpoints 
Programs: PaulBrownPhD/cepower

The need for composites

The medical research council guideline on Developing and Evaluating Complex Interventions states
A single primary outcome, and a small number of secondary outcomes, is the most straightforward from the point of view of statistical analysis. However, this may not represent the best use of the data, and may not provide an adequate assessment of the success or otherwise of an intervention which may have effects across a range of domains (ref).
Thus, for certain disease states there is a shift away from designating a single endpoint as the primary outcome of a clinical trial. When the disease condition can be represented by multiple endpoints, allowing conclusions to be dictated by a significance test on one of these alone is inadequate. This dilemma is more acute when the statistical power endowed by endpoints is inversely proportional to their importance. For example, in heart failure trials, the clinical outcomes with low incidence (such as mortality) yield impractical sample sizes, yet a sensitive biomarker which provides sufficient power remains a surrogate outcome. Therefore, combining endpoints to form a univariate outcome that measures total benefit has been the trend. Potentially, this 'composite endpoint' offers reasonable statistical power while tracking the treatment response across a constellation of symptoms and obviating the normal issues that arise from multiple testing i.e. an inflated alpha.

The selection of endpoints to form the composite is not restricted and is somewhat arbitrary. The component outcomes will all reflect the underlying clinical condition but they should not be too highly correlated with each other in which case the information gain is minimal. Also, the anticipated effect of treatment should be in the same direction across all component outcomes, but not necessarily of the same magnitude.

In the CJC paper (available at the above web link) we considered four composites: the average Z-score (ZS), win-ratio (WR), global rank (GR) and clinical composite (CC). Construction of WR and GR are illustrated in the following two figures.

The ZS and WR were proposed by statisticians and the GR and CC by clinicians, roughly speaking. Composites vary in their attempt to maximise clinical meaning while retaining statistical power, depending on who is advocating them, and it is unclear whether any achieve the right balance. They are usually pulled too far in one direction or the other. For example, statisticians have argued against the dichotomisation of endpoints for the sake of clinical meaning (ref). A reasonable point. But has this led us to become too fixated on power (ref) at the expense of clinical meaning (ref)?

Data simulations

The JMASM paper (web link above) considers ZS and GR only (the WR and CC are easily coded and hence not included). The macros described in the paper and made available at the web link use data simulations to estimate power given certain assumptions about effect sizes and correlations among the outcomes. The following figure illustrates how random samples satisfying the pre-specified conditions are generated using iteration:

The macros can be implemented within SAS as follows:

It is necessary to throw the (superfluous) output to an .lst file using proc printto; otherwise SAS will stall when the output window is full and it needs to be cleared intermittently. If memory serves, it took about five days for the programs to run. The code was validated in many and various ways. For example, by replicating the sample size for the BLAST study which used ZS for the primary outcome and FIGHT which used GR. 

Using these data simulations we can evaluate the performance of the composites as the assumed effect size varies. The following figure shows how power is affected when the assumed effect size on a given outcome shifts from pessimistic to optimistic. In this way we can see that the power of the overall composite is more sensitive to some outcomes than others, and this depends on the construction of the composite. For example, the WR does a better job of favouring outcomes higher up in the hierarchy than the GR because the cut-offs employed for the GR restrict the influence of outcomes, yet the WR restricts the influence of BNP. Thus the composites are weighting the individual outcomes.
Weighting and power are inextricably linked and although a hoped for increase in power is a common justification for employing a composite, it is hardly persuasive. Sun et al. (ref in our paper) showed a single outcome can yield more power than a CC. This is partly because composites discard data with seeming indifference during their construction e.g., time-to-first ignores recurrent events, event-types (and hence the correlations among them) and event severity. Adding an outcome does not necessarily compensate for this loss if the additional outcome is not sensitive to treatment (ref). Often clinical windows are imposed on the data and consequently events that exceed the cut-off time are dismissed as irrelevant.

Related post:

Probability index for composite endpoints