Introduction

The prognostic prediction model (commonly known as the “prognostic model”) is an integral part of current clinical practice [1] and contributes valuable evidence for lifestyle modification and optimal therapeutic interventions [2, 3]. As global cancer prevalence grows, oncology remains a key area demanding useful prognostic models for patient prognosis and therapy guidance [4], as recommended by clinical guidelines such as National Comprehensive Cancer Network, the American Society of Clinical Oncology, and the European Society of Medical Oncology [5,6,7]. However, despite the rapid increase in the number of prognostic models for cancer patients in recent years [8], most are based on static data, which can only predict the prognosis at diagnosis, failing to reflect changes in patient survival probability during follow-up [9]. Furthermore, the effects of time-dependent covariates which may change over time are often overlooked in the static prognostic model. Thus, static models, outdated and inaccurate, are prone to calibration drift, which is one of the major pitfalls of using prognostic models in practice [10]. Worse still, the reporting quality and methodological quality of oncology predictive models were mostly poor [4, 11, 12], providing insufficiently accurate predictions [13]. These drawbacks collectively impede the applicability of prognostic models in clinical practice.

The dynamic prediction model (DPM) is a novel approach to addressing the problem of inaccuracy and calibration drift in static model [14]. The DPM acknowledges the real-time of each point and the time-dependent effect of prognostic factors, which are designed to evolve over time [15,16,17]. Thus, the prognosis prediction can be computed throughout cancer patient follow-up using dynamic models that consider the time elapsed from diagnosis or treatment, the event history, and the time-dependent effect of prognostic factors [15,16,17]. DPM studies on cancer prognosis have been increasingly published in recent years [18,19,20]. However, whether the reporting quality and methodological quality of the current DPMs for cancer prognosis have improved and the methodological characteristics of the novel models remain unclear, which affect the future development of DPMs and their applicability in clinical practice.

Therefore, this study aimed to systematically evaluate the reporting quality and methodological quality of DPM studies on cancer prognosis as well as summarize the methodological characteristics. This study is expected to serve as a useful reference of current status and help improve future development and application of DPM.

Materials and methods

The protocol for this study is publicly available on the Open Science Framework (OSF) at DOI: https://doi.org/10.17605/OSF.IO/Y7DGZ. All stages of the literature review, including study selection, appraisal, and data abstraction, were conducted in accordance with the protocol. However, we amended the protocol by introducing a test to compare overall TRIPOD adherence scores across different study types and publication years, aiming to explore potential variations in TRIPOD adherence scores across these categories.

Literature search

We systematically searched the Ovid MEDLINE, Ovid EMBASE, and the Cochrane Library databases from inception to November 11, 2021. No language restrictions were imposed. The relevant keywords used were based on search filters published from the Search Filters Resource website of the InterTASC Information Specialists’ Sub-Group [21], Cochrane reviews [11, 22] of prediction models, and DPM reviews [14]. The detailed search strategies are shown in Table S1. The reference list of all included studies and relevant reviews were also manually checked to identify extra studies.

Study selection

Studies that aimed to develop, validate or compare a DPM for cancer prognosis were included. Studies were excluded if they were: (1) conference abstracts, reviews (including meta-analyses), comments, letters, protocols, or non-human studies; (2) articles published in non-English or non-Chinese languages; or (3) methodological studies emphasizing research methodology rather than clinical practice. When the same study provided more than one DPM for multiple outcomes, we selected the DPM corresponding to the primary outcome with the largest event number.

Two authors independently screened the titles and abstracts to exclude ineligible studies, followed by full-text assessment. The reasons for exclusion in full-text screening were documented. Any disagreement was resolved through discussion.

Data collection

Data collection consisted of four major sections: general characteristics, methodological characteristics, reporting quality, and methodological quality of the DPM studies. Four authors were trained to collect the data based on a predefined electronic data extraction form containing the items in the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) [23, 24], the Prediction model Risk of Bias Assessment Tool (PROBAST) [25, 26]. The form was also based on the Checklist for Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modeling Studies checklist [27]. The trained authors were divided into two groups for independent information extraction. Any discrepancies were resolved by discussion. For studies that aimed to validate a previously published prediction model, we also retrieved information from the original papers, including comprehensive details of the model development, such as the full model (which presents the complete prediction model, including all regression coefficients and the model intercept or baseline survival at a given time point) specifications. Detailed information on the extraction is available in Text S1.

Study quality assessment

The TRIPOD was used to assess the reporting quality of each included study. The TRIPOD contains 22 main criteria with 37 items in six sections. The reporting quality was assessed at study level and item level, respectively. At study level, the reporting quality was quantified by calculating the overall TRIPOD adherence score, which was defined as the number of TRIPOD items with “yes” responses divided by the total number of TRIPOD items applicable to the particular study [28]. At specific item level, the reporting quality of was measured by calculating the overall adherence per TRIPOD item (the number of studies that adhered to a specific TRIPOD item divided by the number of studies in which the specific TRIPOD item was applicable) [28].

The PROBAST tool was designed to facilitate the assessment of risk of bias and provides an evaluation of methodological quality for studies reporting on the development, validation, or update of prediction models [29]. It has been widely utilized for assessing study methodological quality [29, 30]. The PROBAST contains 20 signaling questions in four domains: participants, predictors, outcomes, and statistical analyses. The answers to the signaling questions contained yes (Y), probably yes (PY), no (N), probably no (PN), or no information (NI). Each domain and the overall risk of bias were rated as having a low, high, or unclear risk of bias.

Statistical analysis

Statistical analyses and figure preparation were performed using R, version 4.1.2. The data for continuous variables are reported either as the mean ± SD or the median with interquartile range (IQR) according to normality tests (Kolmogorov-Smirnov and Shapiro-Wilk). Categorical variables are expressed as frequencies and percentages. The Kruskal-Wallis rank test and two-sample Kolmogorov-Smirnov test were used to compare the overall TRIPOD adherence scores among study types. The Mann-Whitney U was used to compare the overall TRIPOD adherence scores between study types. The false discovery rate was used to correct for multiple comparisons. Spearman’s rank correlation was used to assess the association between the reporting quality and methodological quality. A two-tailed α level of 0.05 was used for all statistical tests.

Results

Study selection

Of the 19,170 publications identified through searching, 15,740 were eligible for title and abstract reviews, and 15,398 publications were further excluded for being irrelevant. We then assessed 342 full texts, and 34 studies were finally included (Fig. 1). The reasons for the full-text exclusions are provided in Text S2, and the list of included studies is provided in Text S3.

Fig. 1
figure 1

Flow chart of study selection

Characteristics of included studies

The annual trend in the number of publications on DPMs for cancer prognosis is shown in Figure S1. The first study was published in 2005, and the number of DPM studies increased rapidly in 2018 and has been steadily increasing since then. Among the included DPM studies, eight studies were developed without internal validation, and 17 studies were developed with internal validation. The remaining nine studies were developed with external validation, of which one study contained two external validation sets [31] (Table 1 and Table S2).

Table 1 Characteristics of the included studies by study type

The patients were mainly recruited from Europe, North America, and Asia. Most of the included studies (n = 30, 88.2%) were retrospective cohorts, and the data were mainly from hospitals. The studies included 19 cancer types, mainly focusing on soft tissue sarcomas (n = 4), breast cancer (n = 4), gastrointestinal cancer (n = 3), and prostate cancer (n = 3). The most common primary outcome was overall survival (n = 23, 67.6%), with follow-up durations ranging from 1.1 year to over 21 years. Other outcomes included disease-specific survival, progression-free survival, thromboembolism, relapse-free survival, and infection rates. The studies utilized a broad range of candidate predictors, including demographic details, clinical characteristics, treatment approaches, and biological markers, which are crucial for understanding the context of the models developed (Table 1 and Table S2).

Methodological characteristics

The methodological characteristics of the included studies are summarized in Fig. 2 and Table S3. Most DPM studies (n = 24) considered time-dependent variables in the model development. The landmark (n = 12), joint (n = 6), time-dependent Cox (n = 6), and conditional survival model (n = 6) were the most common modeling methods. All but five of the studies reported discrimination evaluations, mainly using C-statistic (n = 16), the area under the curve (n = 10), and Brier scores (n = 5). More than 35% of the studies (n = 12) did not report whether model calibration evaluation was performed. Of the remaining 22 studies, 12 used calibration plots, five used heuristic shrinkage factors, and the other five used Brier scores. Regarding internal validation of the model, 11 studies did not report, three studies used random splits, and the remaining 58.8% used cross-validation (n = 12) or bootstrapping (n = 8). All studies except one [32] reported the presentation format of models. The most common form was the full model (n = 19), followed by nomograms (n = 11) and web calculators (n = 3). However, only one study reported both the full model and a visual model (web calculator) [33].

Fig. 2
figure 2

Methodological characteristics of dynamic prediction models for cancer prognosis. AUC, area under the curve

Reporting quality

At study level, the overall TRIPOD adherence score of the 34 studies ranged from 41.38 to 85.71%, with a median of 75% (Table S4). The adherence scores of studies with development that included external validation (median, 78.12%; IQR, 11.55%) were higher than those of development-only studies. As for development-only studies, those with internal validation (median, 75.00%; IQR, 10.71%) had better adherence scores than those without internal validation (median, 70.20%; IQR, 6.5%). However, no significant difference was found in the adherence scores among study types (Table S4 and Fig. 3). Four studies [9, 19, 34, 35] reporting that the TRIPOD was considered or followed (median, 81.98%; IQR, 1.65%) had higher and more stable overall adherence scores than the other studies (median, 72.87%; IQR, 9.85%). Additionally, a total of 28 studies published after the release of TRIPOD (since 2015) showed higher adherence scores, with a median of 75.43% compared to 59.67% for earlier studies (Table S4). This trend continued with 20 studies published after the release of both TRIPOD and PROBAST (since 2019), where a median of 76.50% versus 67.25% for studies published before (Table S4) was observed. Similarly, 9 studies published since 2018 showed higher overall adherence scores, with a median of 75.86% compared to 65.52% for those published earlier (Table S4).

Fig. 3
figure 3

The overall TRIPOD adherence score of individual studies for (A) different study types and (B) different risks of bias. D-IV, development without internal validation; D, development with internal validation; D + EV, development with external validation. FDR, false discovery rate

At item level, the overall adherence per TRIPOD item is shown in Fig. 4 and Table S5. Particularly problematic areas were title (item 1, 23.53%), model interpretation consisting of clinical and research implications (item 20, 29.41%) and explanation of the DPM usage (item 15b, 50%), and full model presentation (item 15a, 55.88%). Apart from this, other common poorly reported items were sample size justification (item 8, 8.82%) and details of patient treatment received (item 5c, 32.35%).

Fig. 4
figure 4

Overall adherence per TRIPOD item. *items not applicable for a development study; †items might not be applicable for a specific study

Methodological quality

Figure 5 summarizes the methodological quality of the included studies. Only one study (2.94%) was of high quality, three studies (8.82%) were unclear, and the remaining 30 studies (88.24%) were of low quality. The high-quality study updated and externally validated the DPM for patients with soft tissue extremity sarcoma using the landmark model [35].

Fig. 5
figure 5

Risk of bias assessment of the included studies. (A) The proportion of the risk of bias across all included studies for four domains and overall, (B) the risk of bias for each included study, and (C) the proportion of the risk of bias types across all included studies for each signaling question. MLCWG, Multidisciplinary Larynx Cancer Working Group; Y, yes; PY, probably yes; NI, no information; PN, probably no; N, no

At domain level, 33 studies were rated as high quality for the participants, predictors, and outcome domains, but only one study (2.94%) was high-quality for the statistical analysis domain.

Concerning the signaling question level, low quality was mainly caused by the inappropriate handling of predictors (N/PN, 35.29%) and using univariable analysis to select predictors before multivariable modelling (N/PN, 41.18%; NI, 23.53%). Inadequate handling of missing data (N/PN, 44.12%; NI, 29.41%), inappropriate evaluation of the model’s performance (N/PN, 44.12%), and the lack of adjustment for overfitting (N/PN, 44.12%) were also common reasons (Table S6).

The correlation between methodological quality and reporting quality

Figure 3 and Table S4 show the overall TRIPOD adherence scores for studies with different methodological quality. The adherence score of one high-quality study (77.14%) was slightly higher than studies with unclear risk of bias (median, 75.86%; IQR, 8.31%), while the studies with unclear risk of bias had a higher adherence score than low-quality studies (median, 74.17%; IQR, 10.32%). However, no correlation was detected between the two qualities (rs = -0.11, P = 0.536).

Discussion

Principal findings

The DPMs included in this review primarily used the landmark model and the joint model to dynamically predict the overall survival of cancer patients. However, most DPMs were of poor-quality owing to (1) suboptimal reporting of the model presentation format, explanation, and title, and (2) poor methodology for selecting and handling predictors, in addition to the familiar problems of reporting and developing predictive models (e.g., sample size, missing data, and model performance).

Study implications for research and practice

This study found that the reporting quality of DPM has improved over time, especially after the publication of TRIPOD and PROBAST. However, DPMs still had a poor model presentation, interpretation, and titling, making them difficult to apply in practice. Approximately 60% of the DPMs lacked manipulatable presentation formats, 50% lacked explanations on how to use the model, and over 70% failed to provide a proper title (as recommended by TRIPOD), which impeded model operation and information acquisition for users. Presentation formats that are easy to understand and manipulate (e.g., nomograms, websites, etc.) based on the intended users, timing, and settings of use, supplied with clear instructions, will be highly valuable [36]. We also strongly recommend reporting the full regression model, which would benefit the quick expansion and updating of approved models. Moreover, to further understand and promote the application of DPMs, researchers are encouraged to give a working example of the DPM and discuss the potential clinical use and implications for future research of the DPM [24].

It is important to recognize that different modeling methods can influence modeling results [37]. Consequently, researchers should take into account the characteristics of various modeling methods when making design and selections [38]. Among the included DPMs, there was a noticeable preference for landmark and joint models. Landmark models are statistical approaches that evaluate the effect of time-dependent covariates by considering a fixed time point, known as the “landmark time”, after which the risk of an event is reassessed. These models are particularly useful when the risk of an event changes over time and can be influenced by events that occur after the landmark time [39]. Joint models, on the other hand, combine the analysis of longitudinal and time-to-event data within a single statistical framework. They are designed to handle the complex relationships between repeated measurements and the time-to-event outcomes, providing a comprehensive view of the data structure [40]. Generally, time-dependent Cox, landmark, and joint models outperformed the conditional survival (CS) model when time-dependent covariates were considered as potential predictors in time-to-event data, although the CS model provided more accurate information about the patient’s prognosis, especially when the patient exceeded a pre-specified landmark survival time [41, 42]. Compared to time-dependent Cox models, the landmark models were more transparent, especially in the present context of a binary time-dependent covariate [39], while the joint models provided a better fit and were more robust [43]. The landmark model was more widely used than the joint model probably because it does not require the specification of a longitudinal model [44] and is easier to implement [45, 46]. So far, there is no clear conclusion on the comparison of the predictive performance of these two approaches [44, 45, 47]. More studies are needed to systematically compare them to facilitate researchers’ choices in the development and validation of DPMs.

Finally, we suggest that the selection and handling of DPM predictors be improved in future studies. About 24% of the DPMs failed to report how the predictors were selected, and > 33% selected predictors based solely on univariate analysis. This practice can lead to the model overfitting, especially when combined with small sample sizes or collinearity among variables [27, 48]. Thus, additional information, such as literature, expert opinions, clinical knowledge, reliability, consistency, applicability, usability, and cost of the predictor’s measurements, should be considered [26, 49]. Furthermore, more than half of the DPM studies converted continuous predictors into binary or multi-categorical variables for modeling, which led to information loss and reduced predictive performance [50]. Hence, it is not recommended to convert continuous predictors to categorical variables unless the predictor has a solid, constant risk of the outcome over a range of values; and nonlinear relationships could be explored [51].

Comparison with other studies

To our knowledge, no previous study has specifically evaluated the reporting and methodological quality of DPMs. Consequently, we have benchmarked our findings against those from studies on static predictive models.

Regarding reporting quality, we identified significant shortcomings in how DPMs report certain aspects of the TRIPOD guidelines, particularly in the title (item 1), which was adequately addressed in less than 25% of the models. This aligns with a previous article [52], indicating a broader issue with the reporting of prognostic model development in published literature. There is a clear need for researchers to employ more rigorous methods and to enhance the reporting of their studies.

Furthermore, our study observed an improvement in the reporting of full models compared to earlier studies [53, 54]. The adoption of innovative methodologies, such as landmark and joint models, appears to contribute positively to this improvement in reporting full models to some extent.

Regarding methodological quality, our research revealed that the methodological quality of current DPMs in cancer prognosis is suboptimal, particularly in areas such as model performance evaluation, handling of missing data, and selection of predictors. These findings align with the methodological shortcomings identified in previous reviews of static predictive models for cancer prognosis [52, 55, 56].

Strengths and limitations

To our knowledge, this is the first investigation to assess the reporting quality and methodological quality of DPM studies and summarize their methodological characteristics. Our findings could provide a valuable reference for the development and application of DPMs. Nevertheless, the DPM is a newly emerged field, and thus, only 34 current DPM studies on cancer prognosis (except the methodological studies) were identified, so future improvements are needed.

Several limitations in this study should be noted. First, currently no commonly accepted reporting or methodological tools for DPM studies are available, so we used TRIPOD and PROBAST, which are designed for predictive models. Nearly all items in the assessment tools were applicable for DPM studies, although some DPM characteristics were not considered in the existing assessment tools. For example, the report should be identified as a DPM in the title, explain why dynamic predictions are made, how dynamically changing variables (e.g., repeated measurement data and time-dependent covariates) are measured, and how such variables are handled in the model. Second, to better evaluate the qualities of DPM studies, non-clinical methodology research was excluded, which may have affected the characterization of the methodological characteristics. Third, due to the current status of the DPM field, it is highly time-consuming to search for related reports, and certain DPM reports may have been missed. In summary, the DPM is a highly valuable prognostic tool with great potential. However, establishing well-acknowledged modeling and reporting guidance and developing relevant tools for DPMs are urgently needed.

Conclusion

DPM studies on cancer prognosis involving the time-dependent effects of predictors or repeated measurement data showed great potential, especially those based on landmark and joint models. However, the reporting quality and methodological quality are suboptimal. An informative title and full model equation should always be presented to allow the user to retrieve and correctly validate the model, and the manipulatable presentation and interpretation of DPMs are urgently needed for their application. The predictors should be properly selected and handled to enhance the quality of DPMs.