Using the probability index to evaluate composite endpoint

05 April 2018

Papers: How do we measure the effect size?
             Influence of component outcomes on the composite
Programs: PaulBrownPhD/probindex

Types of composite

There are a number of motivating factors for employing a composite. For example a composite may be contrived to handle missing data[ref] or competing risks[ref]; or yield phase II results that better predict phase III[ref]; or offer a more succinct clinically meaningful measure[ref]; or capture  risk-benefit; or increase statistical power[ref]. Various algorithms for constructing a composite from certain component outcomes  have been described and may reveal such an impetus.

Some composites combine endpoints of the same type i.e. time-to-event or binary, while others combine miscellaneous types. The former often simply collapse relevant endpoints e.g. time-to-first[ref], any-versus-none[ref]. The latter, on the other hand, may attempt to rank patients from the most adverse response to the most favourable, while bearing in mind a select group of prioritised outcomes e.g. a clinical score[ref], and the global rank[ref] and unmatched win-ratio[ref] (the unmatched win-ratio was described for time-to-event outcomes but is easily adapted for multiple noncommesnurate outcomes[ref]).

Other composites unify outcomes to create a measure that is itself intrinsically meaningful e.g. days-alive-and-out-of-hospital[ref], and a trichotomous clinical composite[ref]. Finally, there are those composites that standardise responses from disparate outcomes before taking a sum or average across components[ref1, ref2]. More than simply being proposed, composites of disparate outcomes are becoming increasingly common primary outcomes in important randomised controlled trials[ref1, ref2, ref3]. These composites have been compared[ref1, ref2].

The problems with composites

Because composites are often employed as the primary endpoint in clinical trials, they affect current debates in cardiovascular medicine, such as the benefit of statin therapies. A survey revealed that approximately 50% of cardiovascular clinical trials adopted a composite[ref]. Although the research environment is not entirely similar and new composites have emerged, the conclusions of a literature review conducted over 20 years ago regarding the use of composites rings true today: "There are serious deficiencies in the methodology currently used in the construction of [composites]. First, many authors develop ad hoc arbitrarily constructed [composites] for immediate use (often as  a primary outcomes measure) in descriptive or comparative studies. Construction of such ad hoc [composites] without evaluation of their measurement properties is scientifically debatable"[ref].

Papers proposing new composites rarely run data simulations to evaluate their performance; these normally appear in the literature much later [eg ref]. However, both the use of composites and criticism highlighting their limitations have been presented [eg ref1, ref2]. In fact, the European Medicines Agency guideline on research in acute heart failure specifically recommends against the use of composites that comprise disparate outcomes[ref].

Basically, composites are complex constructions that often yield limited ordinal responses. Regarding the weighting of component outcomes, researchers often declare that no weighting has been employed. However, weighting can be implied by the construction of the composite and also data-dependent. What is normally meant by 'weighting' are the numerical coefficients specified by an investigator to yield a weighted estimate of the treatment effect. However, any time outcomes are prioritised there is a weighting mechanism at play. For example, a global rank may ignore almost completely those outcomes given low priority, or, conversely, it may be dominated by them [ref]. Even a time-to-first composite is favouring those outcomes with the higher incidence rate; a moderate difference on mortality may be drowned out by another less important event-type where no difference is observed. Clearly any masking of effects implies some weighting of outcomes or favouritism. Thus, there is a disproportionate representation of outcomes in the composite, and it is difficult to anticipate, inadvertent and often unknown.


To illustrate this point we used data simulations (see link to paper above). The probability index is used as an effect measure with the assumed effect size of the individual component plotted on the horizontal axis and the resulting effect size for the composite on the vertical axis (see figure in paper). Thus we may define the slope of the line as the Influence of the component outcome. The investigator could explore such a plot when designing a trial to get a sense of how the composite is weighting the components e.g. whether some components can overwhelm the composite while others are suppressed. An estimate of the slope which quantifies Influence could be reported, although it is dependent on the assumed effect sizes and not just the definition of the composite. 

For example, in a global rank the outcomes will be favoured according to the hierarchy that prioritises them, but the extent to which outcomes with lower priority are ignored depends on the data. We used data simulations to generate the figure below for two composites comprising outcomes: mortality, dyspnea, troponin, creatinine, NT-proBNP (as per Felker & Maisel [ref]). We can see that the global rank composite is insensitive to the biomarker NT-proBNP (in the hierarchy of outcomes it was positioned last) and more influenced by dyspnea, a subjective outcome (large variance) which was prioritised after mortality (a low death rate was assumed). On the other hand, the average z-score shows a more congruent relationship with the individual outcomes because it is a straight average of z-scores. This weighting or Influence is not explicit and barely intentional.

Since an unmatched win-ratio or global rank of multiple endpoints is attempting to arrange patients according to their overall response, we might say that the relative contribution of each outcome to the composite is beside the point. However, if after statistical analysis we have declared the new treatment to be superior, we would like to know what is driving this result. If the mortality and hospital readmission rates are low, then the result may well be dominated by a biomarker, i.e. the composite would be very highly correlated with an endpoint which is obviously considered tenuous otherwise it would have been deemed the primary outcome. Since we do not know exactly what the outcome is, i.e. what the composite is made up of, we cannot make sense of the result. We should of course look at the results from separate (under-powered) analyses of the components but this can give rise to contentious discussion when it becomes clear that effects have been masked or subdued or are counteracted in the composite.

Our results show that Influence of individual outcomes that comprise a composite are not well anticipated. This is slightly analogous to data-dependent methods of covariate adjustment e.g. stepwise methods. In the analysis plan or protocol we can describe the algorithm for selecting covariates to retain in the model, but we cannot say what the analysis will ultimately adjust for. This is perhaps one reason why such methods are out of favour. As Senn says: "the wisest course open to the frequentist is to make a list of covariates suspected to be important and to fit these regardless"  [ref]. Likewise with composite endpoints: if an analysis plan states that the primary endpoint is a global rank of several outcomes, or time-to-first of a number of adverse events, etc., we are informed of the algorithm only. The degree to which the amalgamated outcomes represents the component outcomes is speculation. In other words, we cannot articulate exactly what our primary outcome is. Influence calls into question the belief that a composite measures the "overall" effect of treatment.

It is an obvious slight of hand: there is a gain in power while appearing to use clinical outcomes. And, there is no obligation to specify post-hoc what the contribution of the individual outcomes turned out to be. In other words, what the primary endpoint turned out to be. If the audience were informed of this, would it not affect their interpretation of the findings? Will it not affect the reproducibility of the results? It will not be too surprising if such analyses lead to ambiguous and contentious results. It is conceivable that our global rank reduces to a biomarker i.e. is highly correlated with it. Or a time-to-first endpoint neglects mortality. And we learn this only after massive investment of resources in the trial. There ought to be some awareness of this risk, and also the risk of opposing effects. Influence, as we have defined it, could be gauged using data simulations at the design stage to highlight the issues.

My concern is that convention becomes a safeguard for suboptimal methods. This is seen in drug development and the 'slow march to market' where convention often entails the efficient (repeated) use of inefficient methods. Convention and simplicity make statisticians and programmers efficient and their work less prone to error. Interestingly, though, composites have been promoted largely by clinicians because clinical understanding is needed to inform the construction of the composite. Statistics journals have paid little heed[ref]; they belatedly publish data simulations to identify the faults with certain composites. It seems that statisticians may be failing to influence the discussion.


The probability index (PI) is easily obtained from SAS using proc npar1way. It is derived from the Mann-Whitney U statistic (for survival outcomes we would use Gehan's generalised Wilcoxon). U may be thought of as the number of 'wins' resulting if every patient in the Active group were compared to every patient in the Control group. The probability index is this number divided by the total number of such comparisons (i.e. the number of patients in one group multiplied by the number in the other). The SAS code is as follows:

Obtaining the confidence intervals is a little more difficult; see the full code in GitLab (link above).

Related post:

Composite endpoints and sample size estimation