Background: Clinical prediction models aim to facilitate treatment decisions, by providing estimates of absolute risk utilizing individual patient characteristics (such as age, gender and treatments received. Prediction models are usually evaluated for their ability to predict the target outcome. However, widely used measures of performance regarding discrimination and calibration do not quantify a model's ability to predict treatment benefit, i.e. the absolute difference in risk arising due to treatment. For this reason, the concordance-statistic for benefit ('c-for-benefit') has been recently proposed. This employs a procedure where individuals are matched according to their predicted benefit. The statistical properties of c-for-benefit are currently unclear, thus hampering its implementation and interpretation.
Objectives: We aim to assess the statistical properties of c-for-benefit and to propose alternative measures for quantifying a model's ability to predict treatment benefit.
Methods: We explore the potential advantages and limitations of c-for-benefit using theoretical arguments. We then perform a series of simulations, aiming to demonstrate these properties in practice. Hereto, we generate datasets with a binary outcome where the treatment effect is either constant or modified by patient-level covariates. These datasets are then used to develop prediction models that account for the received treatment. Subsequently, we assess the performance of all developed models using the c-for-benefit, and illustrate its properties in real clinical datasets.
Results: In accordance with our theoretical arguments, in many simulated scenarios the estimated c-for-benefit provided a distorted picture of the model's predictive performance. Simulations showed that estimates for c-for-benefit are often close to 0.5 and tend to be inaccurate, even for correctly specified models. Large values for c-for-benefit were only obtained when the interaction effect is very strong and depends on continuous covariates.
Conclusions: The c-for-benefit may be problematic for validating predictions of treatment benefit, unless there are strong interactions between treatment and continuous covariates. For this reason, we recommend alternative measures to quantify model performance.