1 Introduction

Situations where subjects start from an initial condition (state 0) and could subsequently develop a final event (state 2 or death), either directly or passing through another intermediate condition (state 1 or illness), may be described by the simple illness-death multistate model (Meira-Machado et al. 2009). This framework may be applied also in situations where the transition into the intermediate state does not consist in the occurrence of a disease, but instead represents the administration of a therapeutic intervention. Consider, as an example, patients affected by severe end-stage cardiomyopathy that enter a waiting list (state 0) to receive heart transplant: they may be transplanted (state 1) after some time or die (state 2) while still on list. In this context, a key issue is often to assess the impact of the intermediate event on the subsequent hazard of mortality which is done, in practice, by comparing two transition hazards: the hazard from state 0 to state 2 and the hazard from state 1 to state 2. A key aspect to consider to this end is the choice of the scale used to measure time after transition to state 1 (Iacobelli and Carstensen 2013; Meier-Hirmer and Schumacher 2013; Rebora et al. 2015). The most obvious one is the original (or “clock forward”) time scale, where time is always measured starting from the entry into the initial state 0. As an alternative, one may resort to the “clock reset” scale, where time is set back to 0 as patients enter the intermediate state 1. In practice, for those who experience the intermediate event, the observation time since origin can be split in two parts: time elapsed while in state 0 (also called “waiting time to intermediate event”) and time elapsed while in state 1 (“time since intermediate event”). Depending on the relation between the hazard of transition from state 1 to state 2 and the two time scales, three types of data generating process can be defined (Putter et al. 2007). Briefly, the process is:

  • Markovian, if the hazard depends only on time since origin;

  • semi-Markovian, if the hazard depends only on time since the intermediate event;

  • extended semi-Markovian, if the hazard depends on time since the intermediate event but also on the waiting time to the intermediate event.

In a previous paper (Bernasconi et al. 2016), a non-parametric estimator of the survival function after the intermediate state, proposed by Simon and Makuch (1984), was reviewed. It was also shown that this method provides unbiased estimates only under the Markov assumption. Finally, the paper presented a novel estimator based on the clock-reset scale which was proved to be suitable for semi-Markovian scenarios. In a subsequent paper (Tassistro et al. 2020), we discussed different possibilities to model the hazard of death before and after the intermediate event depending on the validity of Markov assumption, using a hazard regression model (e.g. Cox, Poisson or some suitable full-parametric model). Within that framework, we showed how to assess the impact of the intermediate event on the transition to the final state.

In this work we show that the well known Mantel-Byar extension (1974) of the log-rank test, used to compare the hazards between two time-dependent groups, provides valid results only for markovian processes. We thus present an alternative method, based on the clock-back scale, that is suitable under semi-Markov assumption. Finally, we propose an extension of the method for extended semi-Markov scenarios that allows to account for the effect of waiting time to the intermediate event. The performance of the three methods is compared in a simulation protocol.

In Sect. 2, we set up the notation, review existing methods and introduce a new procedure suitable for non-markovian contexts. In Sect. 3, a simulation study to compare the performance of the methods in different scenarios is presented and results are discussed. In Sect. 4, we show an application of the methods on a real set of clinical data. Finally, in Sect. 5, we summarize and discuss our findings in the context of existing literature.

2 Methods

2.1 Notation and recall of standard log-rank test

Considering the illness-death model depicted in Fig. 1 where all patients entry in state 0 (i.e. initial event) and they can possibly move to the absorbing state 2 (i.e. develop the final event) either directly or passing through state 1 (i.e. without or after an intermediate event), we define the random variables involved in the process as follows:

  • \(T_{01}\): potential time to state 1 starting from state 0;

  • \(T_{02}\): potential time to state 2 starting from state 0 and without passing through state 1;

  • \(T_{12}\): potential time to state 2 starting from state 1;

  • \(T_{012}=T_{01}+T_{12}\): potential time to state 2 starting from state 0 after passing through state 1 at some time point defined by \(T_{01}\);

  • \(X=I(T_{01}\le T_{02})\): indicator of the intermediate event so that \(T_{02}\) is observable only when \(X=0\) while \(T_{01}\), \(T_{12}\) and \(T_{012}\) are observable only when \(X=1\);

  • \(T=(1-X)\times T_{02}+X\times T_{012}\): observable time to state 2 starting from state 0 either with or without passing through state 1;

  • C: time to right censoring starting from state 0;

  • \(Z=\min (T,C)\): observed time to final event (without or after the intermediate event) or right censoring;

  • \(\Delta =I(T\le C)\): indicator of the final event or censoring;

  • \(E=I(T_{01}\le Z)\): indicator of the intermediate event.

For each patient \(i=1,...,n\), we observe the following data:

  • \(t_i\): time to final event (either without or after the intermediate event) or right censoring;

  • \(\delta _i\): final event (\(\delta _i=1\)) or censoring (\(\delta _i=0\)) indicator;

  • \(\varepsilon _i\): indicator of the intermediate event (\(\varepsilon _i=1\) if occurred and \(\varepsilon _i=0\) otherwise);

  • \(t_i^{01}\): time to the intermediate event, observable only if \(\varepsilon _i=1\).

The vector of observed data \((t_i; \delta _i; \varepsilon _i; \varepsilon _i\cdot t_i^{01})\) is the sampling realization of \((Z; \Delta ; E; E\cdot T_{01})\). Moreover, we define:

  • \(t_j, j=1,...,J\): ordered observed times of final event from entry in state 0, both for patients who do not enter state 1 (patients with \(\varepsilon _i=0\)) and for patients who enter state 1 (patients with \(\varepsilon _i=1\));

  • \(t_k, k=1,...,K\): ordered observed times of final event, however, for patients who do not enter state 1 (patients with \(\varepsilon _i=0\)) this time is calculated from entry in state 0, for patients who enter state 1 (patients with \(\varepsilon _i=1\)) this time is calculated from entry in state 1.

Finally, we denote \(\lambda _{02}(t)\) and \(\lambda _{12}(t)\) the transition hazards to the final state starting either from state 0 or from state 1, respectively.

The focus is now directed to non-parametric approaches to test the null hypothesis of equality between these two hazards: \( \lambda _{02}(t)\) and \(\lambda _{12}(t) \). In the simple situation where, at the start of follow-up, all patients are either in state 0 or already in state 1 and further transition into state 1 cannot be observed, the null hypothesis can be written as: \(H_0: \lambda _{02}(t) = \lambda _{12}(t)\). The standard non-parametric approach to test \(H_0\) is the log-rank test. In this case there are two time-fixed groups, patients never treated \((E=0)\) and patients treated from start \((E=1)\). At each final event time \(t_j\) the following quantities can be computed:

  • \(d_{1j}=\sum _{i=1}^{n} (\delta _i\cdot \varepsilon _i\cdot I(t_i=t_j))\): number of events in group E=1 at time \(t_j\);

  • \(n_{1j}=\sum _{i=1}^{n} (\varepsilon _i\cdot I(t_i>t_j))\): number of patients at risk in group E=1 at time \(t_j\);

  • \(d_{.j}=\sum _{i=1}^{n} (\delta _i\cdot I(t_i=t_j))\): overall number of events in both groups at time \(t_j\);

  • \(n_{.j}=\sum _{i=1}^{n} (I(t_i\ge t_j))\): overall number of patients at risk in both groups at time \(t_j\).

Under \(H_0\) the distribution of \(d_{1j}\) is hypergeometric and the following test statistic can be defined:

$$\begin{aligned} Q = \frac{[\sum _{j=1}^{J} (d_{1j} - \frac{d_{.j}}{n_{.j}}n_{1j})]^2}{\sum _{j=1}^{J}Var(d_{1j})}, \end{aligned}$$
(1)

where \(Var(d_{1j})=d_{1j}\frac{d_{.j}}{n_{.j}}(1-\frac{d_{.j}}{n_{.j}})\frac{n_{.j}-n_{1j}}{n_{.j}-1}\), which under \(H_0\) follows a Chi-square distribution with 1 degree of freedom.

When the transition into state 1 may occur over time, as in the case of an illness-death model, alternative approaches should be considered as illustrated in Sects. 2.2 and 2.3.

Fig. 1
figure 1

Illness-death model with clock forward (t) and clock reset (\(t-t^{01}\)) time scales. Boxes and arrows represent the states and the possible transitions respectively

2.2 Mantel-Byar test

An extension of the logrank test to deal with time-dependent groups was proposed by Mantel and Byar (1974) in the context of heart-transplant. From here on, we denote this method as “MB-test”. Time is always measured using the original scale t (time since entry in state 0) and the null hypothesis tested is still \(H_0: \lambda _{02}(t) = \lambda _{12}(t)\) . At time 0 all patients are in state 0 and thus the risk-set of the intermediate event (i.e. transplant) group is empty. At each time \(t_j\), the quantities \(d_{1j}\) and \(n_{1j}\) are now replaced by the following ones:

  • \(d_{1j}^{MB}=\sum _{i=1}^{n} (\delta _i\cdot \varepsilon _i\cdot I(\varepsilon _i t_i^{01}\le t_j)\cdot I(t_i=t_j))\): number of events at time \(t_j\) among those who already had the intermediate event;

  • \(n_{1j}^{MB}=\sum _{i=1}^{n} (\varepsilon _i\cdot I(\varepsilon _i t_i^{01}\le t_j)\cdot I(t_i>t_j))\): number of patients at risk at time \(t_j\) among those who already had the intermediate event.

A test statistic analogous to the log-rank test can now be defined:

$$\begin{aligned} Q^{MB} = \frac{[\sum _{j=1}^{J} (d_{1j}^{MB} - \frac{d_{.j}}{n_{.j}}n_{1j}^{MB})]^2}{\sum _{j=1}^{J}Var(d_{1j}^{MB})}, \end{aligned}$$
(2)

where \(Var(d_{1j}^{MB})=d_{1j}^{MB}\frac{d_{.j}}{n_{.j}}(1-\frac{d_{.j}}{n_{.j}})\frac{n_{.j}-n_{1j}^{MB}}{n_{.j}-1}\), which again under \(H_0\) follows a Chi-square distribution with 1 degree of freedom.

To illustrate the rationale beyond this method, lets consider the following example on patients affected by severe cardiopathy. For all patients, the starting event is the inclusion in a waiting list for receiving heart trasplant (state 0), the intermediate event (state 1) is represented by the transplant itself and the final event (state 2) is death, either while in list or after transplant. Suppose that a patient is transplanted at time \(t^{01}=t^*\), since then the patients is moved to the risk-set of the transplanted group. In other words, the observation of that patient is split in two parts: i) from time 0 to \(t^*\) the patient is ascribed to the waiting-list group and the observation is censored at \(t^*\); ii) from \(t^*\) onwards the patient is ascribed to the transplanted group and the observation is left-truncated at \(t^*\). If the process is markovian, the hazard of mortality under transplant, after t time units since entry in the waiting list, does not depend on the waiting time to transplant \(t^{01}\) nor on the time since transplant \(t-t^{01}\). Thus, under markovianity, the MB-test is suitable to compare the hazard of mortality of a patient in waiting list with the hazard of a hypothetical patient transplanted at an arbitrary time \(t^{01}\) (e.g. at \(t^{01}=0\)). For this reason, the null hypothesis can also be written as \(H_0: \lambda _{02}(t) = \lambda _{12;t^{01}=0}(t)\), where \(\lambda _{12;t^{01}=0}(t)\) is the hazard of mortality of a patient transplanted at time \(t^{01}=0\).

If the process is not markovian, however, there is an impact of time since transplant \(t-t^{01}\) (semi-Markov process) and possibly also of time to transplant \(t^{01}\) (extended semi-Markov process). As we will show by the simulation study, in these scenarios the MB-test is not suitable to evaluate the above null hypothesis because the hazard after transplant is obtained by averaging heterogeneous hazards of patients at t time units since entry in list, when patients have spent a different amount of time on list and after transplant.

2.3 “Clock-reset logrank-type” test

For semi-markovian processes, we propose a simple modification of the Mantel-Byar test that can be applied by adopting a double time scale to measure time: the original time scale t before transplant and the clock reset scale \(t-t^{01}\) after transplant. From here on, we denote this method as “CR-test”. At each time \(t_k\), the following quantities can be computed:

  • \(d_{1k}^{CR}=\sum _{i=1}^{n} (\delta _i\cdot \varepsilon _i\cdot I(t_i-\varepsilon _i t_i^{01}= t_k))\): number of events at time \(t_k\) after the intermediate event (defined only for subjects that experience the intermediate event);

  • \(n_{1k}^{CR}=\sum _{i=1}^{n} (\varepsilon _i\cdot I(t_i-\varepsilon _i t_i^{01}\ge t_k))\): number of patients at risk at time \(t_k\) after the intermediate event (defined only for subjects that experience the intermediate event);

  • \(d_{.k}^{CR}=\sum _{i=1}^{n} (\delta _i\cdot (1-\varepsilon _i) \cdot I(t_i=t_k)) + d_{1k}^{CR}\): sum of the number of events at time \(t_k\) among those without the intermediate event and the number of events at time \(t_k\) after the intermediate event;

  • \(n_{.k}^{CR}=\sum _{i=1}^{n} ((1-\varepsilon _i) \cdot I(t_i\ge t_k)+\varepsilon _i \cdot I(\varepsilon _i t_i^{01}\ge t_k)) + n_{1k}^{CR}\): sum of the number of patients at risk at time \(t_k\) among those without the intermediate event, the number of patients who will have the intermediate event but are still in state 0 at time \(t_k\) and the number of patients at risk at time \(t_k\) after the intermediate event.

A test statistic analogous to the log-rank test (and the MB-test) can now be defined:

$$\begin{aligned} Q^{CR} = \frac{[\sum _{j=k}^{K} (d_{1k}^{CR} - \frac{d_{.k}^{CR}}{n_{.k}^{CR}}n_{1k}^{CR})]^2}{\sum _{k=1}^{K}Var(d_{1k}^{CR})}, \end{aligned}$$
(3)

where \(Var(d_{1k}^{CR})=d_{1k}^{CR}\frac{d_{.k}^{CR}}{n_{.k}^{CR}}(1-\frac{d_{.k}^{CR}}{n_{.k}^{CR}})\frac{n_{.k}^{CR}-n_{1k}^{CR}}{n_{.k}^{CR}-1}\), which again under \(H_0\) follows a Chi-square distribution with 1 degree of freedom.

Lets illustrate the rationale of this method with reference to the example on heart transplant. Again, the observation of a patient transplanted at \(t^{01}=t^*\) is split in two parts: i) from time \(t=0\) to \(t=t^*\) the patient is ascribed to the waiting-list group and the observation is censored at \(t^*\); ii) from time \(t-t^{01}=0\) onwards the patient is ascribed to the transplanted group. Under the semi-Markov property, the hazard of mortality at time \(t-t^{01}\) since transplant does not depend on the waiting time to transplant \(t^{01}\). Using the CR-test, the hazard after transplant is calculated by averaging hazards of patients at \(t-t^{01}\) time units since transplant, disregarding that they spent a different amount of time on list. Thus, under semi-markovianity, the test is suitable to compare the hazard of mortality of a patient on waiting list with the hazard of an hypothetical patient transplanted at an arbitrary time \(t^{01}\) (e.g. at \(t^{01}=0\)). For this reason, the null hypothesis can also be written as \(H_0: \lambda _{02}(t) = \lambda _{12;t^{01}=0}(t-t^{01})\).

One may wonder whether the null hypothesis of interest should be formulated as \(H_0: \lambda _{02}(t) = E(\lambda _{12}(t-t^{01})|T^{01}<t)\) given that patients are transplanted after an heterogeneous waiting time. However, if the goal is to compare the mortality of patients who underwent transplantation to that of patients still on the waiting list, and thus to evaluate whether the transplantation option has a favorable impact on the survival of these patients, the null hypothesis \(H_0: \lambda _{02}(t) = \lambda _{12;t^{01}=0}(t-t^{01})\) is of interest. Considering a patient who might receive transplantation at a given time, to assess whether there is a real advantage in transplanting them, one should compare the mortality rate from that moment onwards if the patient were transplanted at that moment to the mortality rate if the patient will never be transplanted. This question could be posed at any moment of follow-up, but it seems natural to initially set the time-point as the moment when the patient could first receive the transplant, that is, their entry onto the waiting list. The fact that the data available to answer this question come from a real-world scenario where transplantation occurs at different times for each patient generates complexity in how the analysis should be conducted, but it does not alter the question of interest. On the other hand, the null hypothesis \(H_0: \lambda _{02}(t) = E(\lambda _{12}(t-t^{01})|T^{01}<t)\) considers a comparison between the hazard in the population of those who are not transplanted with the hazard in a population that is not well-defined because it refers to a dynamic cohort (those transplanted within time t). Thus, the issue of real-world data complexity is shifted to an unclear formulation of the question of interest. A study of this type of approach to the analysis of a time-dependent treatment, both from a theoretical and practical perspective, is presented in the paper by Rebora et al. (2015).

When the waiting time on list \(t^{01}\) has an impact on the hazard after transplant (extended semi-Markov process), this has to be taken into account. Thus, we propose the following two-steps approach (we refer to this method as “ECR-test”, where E stands for “extended”):

  1. 1.

    fit a Cox model only on post-transplant data to check the impact of time to transplant \(t^{01}\) on the hazard of mortality after transplant measured using the clock reset scale (Meier-Hirmer and Schumacher 2013): \(\lambda _{12}(t-t^{01}) = \lambda _0 (t-t^{01}) \exp (\beta _{t^{01}} t^{01})\);

  2. 2.

    use the estimated coefficient \(\hat{\beta _{t^{01}}}\) to adjust the size of the risk-set in the transplanted group \(n_{1j}\) within the “clock reset logrank-type” test.

We should note that the model proposed assumes that the effect of waiting time is constant over time (i.e. proportional hazards assumption) and that it is linear. In case of violation of one or both assumptions, a more complex model should be considered, for instance including an interaction term between \(t-t^{01}\) and \(t^{01}\) (to overcome the proportional hazard assumption) or using a spline function to model the effect of \(t^{01}\) (to overcome the linearity assumption). These modifications would obviously affect the way we account for the effect of \(t^{01}\) in our second step. For simplicity, we now refer to the situation where the model \(\lambda _{12}(t-t^{01}) = \lambda _0 (t-t^{01}) \exp (\beta _{t^{01}} t^{01})\) is valid considering also that mild violations to the assumptions would have little impact of the results of the test (this will be invesitgated in the simulation study). The second step can be done using the same idea behind the Breslow estimator of the baseline cumulative hazard commonly adopted to make profile-specific survival predictions from a Cox model. The idea consists in substituting, at each final event time \(t_k\), \(n_{1k}^{CR}\) and \(n_{.k}^{CR}\) with, respectively:

  • \(n_{1k}^{CR}*=\sum _{i=1}^{n} (\exp (\hat{\beta _{t^{01}}} \cdot t_i^{01})\varepsilon _i\cdot I(t_i-\varepsilon _i t_i^{01}\ge t_k)) \): sum over subjects at risk at time \(t-t^{01}=t_k\) after transplant, which resembles the expected counts one would observe if patients were all transplanted at \(t^{01}=0\).

  • \(n_{.k}^{CR}*=\sum _{i=1}^{n} ((1-\varepsilon _i) \cdot I(t_i\ge t_k)+\varepsilon _i \cdot I(\varepsilon _i t_i^{01}\ge t_k)) + n_{1k}^{CR}*\): sum of number of patients at risk at time \(t_k\) among those without the intermediate event, the number of patients who will have the intermediate event but are still in state 0 at time \(t_k\) and the expected number of patients at risk at time \(t_k\) after the intermediate event if all were transplanted at \(t^{01}=0\).

The test statistic now can be defined as:

$$\begin{aligned} Q^{CR}* = \frac{{U^{CR}}^2}{Var(U^{CR})} = \frac{[\sum _{k=1}^{K} (d_{1k}^{CR} - \frac{d_{.k}^{CR}}{n_{.k}^{CR}*}n_{1k}^{CR}*)]^2}{Var(U^{CR})}, \end{aligned}$$
(4)

where \(Var(U^{CR})\) has to be estimated using resampling methods (e.g. bootstrap). Under \(H_0\), \(Q^{CR}*\) follows a Chi-square distribution with 1 degree of freedom.

3 Simulation protocol

3.1 Data generation

We compared the performance of CR-test and ECR-test vs MB-test in the following eight scenarios:

  1. 1.

    Markov under the null hypothesis (M0): absence of effect of waiting time on the clock forward scale and absence of effect of the intermediate event on transition to the final state;

  2. 2.

    Markov under an alternative hyphotesis (M1): absence of effect of waiting time on the clock forward scale and presence of effect of the intermediate event on transition to the final state;

  3. 3.

    Semi-markov under the null (SM0): absence of effect of waiting time on the clock reset scale and absence of effect of the intermediate event on transition to the final state;

  4. 4.

    Semi-markov under an alternative (SM1): absence of effect of waiting time on the clock reset scale and presence of effect of the intermediate event on transition to the final state;

  5. 5.

    Extended semi-markov under the null hypothesis with constant effect of waiting time (ESMph0): presence of constant effect of waiting time on the clock reset scale and absence of effect of the intermediate event on transition to the final state;

  6. 6.

    Extended semi-markov under an alternative hypothesis with constant effect of waiting time (ESMph1): presence of constant effect of waiting time on the clock reset scale and presence of effect of the intermediate event on transition to the final state;

  7. 7.

    Extended semi-markov under the null hypothesis with time-varying effect of waiting time (ESMtv0): presence of time-varying effect of waiting time on the clock reset scale and absence of effect of the intermediate event on transition to the final state;

  8. 8.

    Extended semi-markov under an alternative hypothesis with time-varying effect of waiting time (ESMtv1): presence of time-varying effect of waiting time on the clock reset scale and presence of effect of the intermediate event on transition to the final state.

In each scenario, we generated 1000 dataset of 50, 100, and 200 observations each using the inversion method of Bender et al. (2005). The potential times \(T_{02}\) and \(T_{01}\) were simulated according to a weibull (scale = 0.6, shape = 0.7) and an exponential distribution (0.5) respectively, while the censoring time C was directly simulated using a uniform distribution (a = 0.25, b = 8). The potential time \(T_{12}\) was generated in a different way for each scenario (\(\beta \) is the effect of the intermediate event and \(\beta _{t^{01}}\) is the effect of the waiting time to the intermediate event):

  1. 1.

    M0: weibull (scale=0.6, shape=0.7) and then shifted back of \(T_{01}\) units (see “Appendix A” for more details);

  2. 2.

    M1: weibull (scale=\(0.6\cdot \exp (\beta )\), shape=0.7) and then shifted back of \(T_{01}\) units (see “Appendix A” for more details), where \(\beta =-\log (3)/0.7=\log (0.208)\);

  3. 3.

    SM0: weibull (scale=0.6, shape=0.7);

  4. 4.

    SM1: weibull (scale=\(0.6\cdot \exp (\beta )\), shape=0.7), where \(\beta =-\log (3)/0.7=\log (0.208)\);

  5. 5.

    ESM0ph: weibull (scale=\(0.6\cdot \exp (\beta _{t^{01}}\cdot t^{01})\), shape=0.7), where \(\beta _{t^{01}}=0.5/0.7=\log (2.043)\);

  6. 6.

    ESM1ph: weibull (scale=\(0.6\cdot \exp (\beta )\cdot \exp (\beta _{t^{01}}\cdot t^{01})\), shape=0.7), where \(\beta =-\log (3)/0.7=\log (0.208)\) and \(\beta _{t^{01}}=0.5/0.7=\log (2.043)\);

  7. 7.

    ESM0tv: weibull (scale=\(0.6\cdot \exp (\gamma _{t^{01}}\cdot t^{01}\cdot \log (\frac{t_{12}}{5}))\), shape=0.7), where \(\gamma _{t^{01}}=-0.2=\log (0.819)\);

  8. 8.

    ESM1tv: weibull (scale=\(0.6\cdot \exp (\beta )\cdot \exp (\gamma _{t^{01}}\cdot t^{01}\cdot \log (\frac{t_{12}}{5}))\), shape=0.7), where \(\beta =-\log (3)/0.7=\log (0.208)\) and \(\gamma _{t^{01}}=-0.2=\log (0.819)\).

The choice of time functions, parameters and coefficients used for the simulation is based on a criterion of simplicity but also on plausibility with respect to the example of heart transplant that we use to illustrate the application of the methods (Sect. 4). It shoud be noted that in all scenarios both \(T_{01}\) and \(T_{12}\) are generated independently of \(T_{02}\). Under the alternative hypthotesis (scenarios 2, 4, 6, 8), the effect of the intermediate event \(\beta \) is constant (proportional hazards assumption). The hazard functions involved in the process for each simulation scenario are defined in Table 1 and a graphical representation is presented in Figs. 2, 3, 4, 5. R functions to run MB-test, CR-test and ECR-test and code for simulations and data analysis is available on a GitHub repository: https://github.com/davidebernasconi/Clock-reset-logrank-type-test.

Table 1 Hazard functions involved in the process for each simulation scenario
Fig. 2
figure 2

Hazards of final event after the intermediate event \(\lambda _{12}(t)\) (thin lines) in Markov scenarios for three patients with times to intermediate event \(t^{01}=0, 1, 2\). On the x-axis, time is represented in the clock forward scale (i.e. measured since the initial event). The hazard of final event without the intermediate event \(\lambda _{02}(t)\) is also reported (grey thick line)

Fig. 3
figure 3

Hazards of final event after the intermediate event \(\lambda _{12}(t)\) (thin lines) in semi-Markov scenarios for three patients with times to intermediate event \(t^{01}=0, 1, 2\). On the x-axis, time is represented in the clock reset scale (i.e. measured since the intermediate event). The hazard of final event without the intermediate event \(\lambda _{02}(t)\) is also reported (grey thick line). Obviously, for this quantity, time on the x-axis is represented in the clock forward scale (i.e. measured since the initial event)

Fig. 4
figure 4

Hazards of final event after the intermediate event \(\lambda _{12}(t)\) (thin lines) in extended semi-Markov scenarios with fixed effect of \(t^{01}\) for three patients with times to intermediate event \(t^{01}=0, 1, 2\). On the x-axis, time is represented in the clock reset scale (i.e. measured since the intermediate event). The hazard of final event without the intermediate event \(\lambda _{02}(t)\) is also reported (grey thick line). Obviously, for this quantity, time on the x-axis is represented in the clock forward scale (i.e. measured since the initial event)

Fig. 5
figure 5

Hazards of final event after the intermediate event \(\lambda _{12}(t)\) (thin lines) in extended semi-Markov scenarios with time-varying effect of \(t^{01}\) for three patients with times to intermediate event \(t^{01}=0, 1, 2\). On the x-axis, time is represented in the clock reset scale (i.e. measured since the intermediate event). The hazard of final event without the intermediate event \(\lambda _{02}(t)\) is also reported (grey thick line). Obviously, for this quantity, time on the x-axis is represented in the clock forward scale (i.e. measured since the initial event)

3.2 Simulation results

Summary statistics (median and range over all simulations) describing data generated in each scenario are shown in Table 2. The number of subjects who are never observed having the intermediate event and the number of final events (and censored observations) among these subjects does not vary across scenarios. The reason is that the hazard of transition to the intermediate event \(\lambda _{01}(t)\) and the hazard from the starting event directly to the final event \(\lambda _{02}(t)\) are assumed the same in all scenarios.

The number of final events among subjects who also experienced the intermediate state is generally lower in Markov and semi-Markov scenarios under the alternative hypothesis, where the protective effect of the intermediate event is highest and does not decrease with time, and it is higher in extended semi-Markov scenarios under the null hypothesis, where no effect of the intermediate event is assumed if it happens immediately and an unfavourable effect of waiting time is also postulated.

Simulation results for Markov (1: M0 and 2: M1), Semi-Markov (3: SM0 and 4: SM1), extended Semi-Markov with constant effect of waiting time (5: ESM0ph and 6: ESM1ph) and extended Semi-Markov with time-varying effect of waiting time (7: ESM0tv and 8: ESM1tv) scenarios are presented in Figs. 6, 7, 8 and 9 respectively. For all scenarios under the null hypothesis of no effect of the intermediate event on the transition to the final state (1,3,5,7) we checked that the fraction of false rejections of H0 was close to the nominal level, while for scenarios under the alternative hypothesis of a protective effect of the intermediate event (2,4,6,8) we compared the power (i.e. fraction of true rejections of H0) of CR-test (or ECR-test) vs MB-test (Table 3). Monte Carlo standard errors are also reported (Morris et al. 2019).

Table 2 Description of data generated under each simulation scenario
Fig. 6
figure 6

Simulation results for Markov scenarios. Left panels represent the scenario under the null hypothesis (M0) while right panels represent the scenario under an alternative hypothesis (M1). Top panels show results with a sample size of n = 50, in middle panels n = 100, in bottom panels n = 200

Fig. 7
figure 7

Simulation results for semi-Markov scenarios. Left panels represent the scenario under the null hypothesis (SM0) while right panels represent the scenario under an alternative hypothesis (SM1). Top panels show results with a sample size of n = 50, in middle panels n = 100, in bottom panels n = 200

Fig. 8
figure 8

Simulation results for extended semi-Markov scenarios with constant (PH) effect of the time to the intermediate event. Left panels represent the scenario under the null hypothesis (ESM0ph) while right panels represent the scenario under an alternative hypothesis (ESM1ph). Top panels show results with a sample size of n = 50, in middle panels n = 100, in bottom panels n = 200

Fig. 9
figure 9

Simulation results for extended semi-Markov scenarios with time-varying (non PH) effect of the time to the intermediate event. Left panels represent the scenario under the null hypothesis (ESM0tv) while right panels represent the scenario under an alternative hypothesis (ESM1tv). Top panels show results with a sample size of n = 50, in middle panels n = 100, in bottom panels n = 200

Table 3 Relative frequency of \(p-value<\alpha \) over 1000 simulation (with Monte Carlo standard errors)

Results show that, under \(H_0: \lambda _{02}(t) = \lambda _{12;t^{01}=0}(t)\), MB-test has a rejection rate close to the nominal level only when the process is Markovian. As expected, CR-test is not suitable for these scenarios as the rejection rate under the null is too high. CR-test and ECR-test show a desirable behaviour both in terms of false rejections and power in semi- and extended semi-Markov scenarios, respectively. Correctly, the rejection rate under the null is invariant with respect to the sample size, while the power increases as more observations are included in the analysis. When the effect of waiting time is not constant over time and under the null hypothesis (scenario 7: ESMtv0) the false rejection rate of ECR-test is slighlty higher than the nominal level with high sample size (n = 200). In a further simulation study under scenario 7 and a sample size of 200 observations, we explored the performance of the ECR test with the follwing modification: the coefficient of waiting time was not assumed as fixed but was made time-dependent by introducing an interaction term between the waiting time and a function (natural logarithm) of time. As expected, we obtained slightly better results than with the ECR test without the modification: the false rejection rates at nominal levels 0.01, 0.05 and 0.1 were 0.013, 0.068 and 0.112 respectively. In real applications, where it is generally harder to choose the proper time function to model the interaction term, one can resort to polynomial transformations or spline functions making the modification of the test even more complex. However, when the violation of the proportional hazards assumption for the effect of waiting time is not heavy, the ECR test described in Sect. 2.3 is expected to provide a fairly good performance in controlling the type I error rate.

4 Application to real heart transplant data

We consider observational data extracted from the Italian National Heart Transplant Database on patients affected by severe cardiomyopathy who were included in a waiting list for receiving heart transplant (HTx) between 2012 and 2016. We considered only patients who had apparent stable conditions at time of entry in the waiting list and who were assigned the lowest priority to receive HTx (N=969). All patients were followed from the time of inclusion in the waiting list until death or independent right censoring with a median (Q1–Q3) follow-up of 36.7 (22.7–51.3) months (follow-up distribution estimated by the reverse Kaplan-Meier method). The number of patients who actually received HTx was 595 (51%), after a median (Q1–Q3) [min-max] of 5.4 (1.8–12) [0–56] months. The primary aim of the study is to compare the mortality of patients while still on list (139 observed deaths) versus after receiving HTx (123 observed deaths). Obviously, many factors that can potentially have a prognostic role both during the waiting time to HTx and after the intervention can be considered and included in the analysis. However, one may also be interested in performing an overall comparison between the survival experience observed on patients before and after HTx.

To this purpose, non parametric survival analysis methods derived from Kaplan-Meier estimator and log-rank test appear attractive albeit it is crucial to consider the time-dependent nature of the intervention group. Many papers in statistical and medical literature warned about the incorrect application of traditional methods to assess the impact of a time-dependent treatment which may lead to severe “immortal time” bias (see for example (Yadav and Lewis 2021)).

In order to perform an hypothesis testing of the equality of the hazards between the two time-dependent groups, one may be tempted to resort directly to the MB-test without preliminarily looking at the process memory assumptions. In this example, the chi-squared distributed test statistic is 15.14 (1 degree of freedom) which leads to a p-value < 0.001 thus strongly supporting evidence of a difference in mortality rate (higher for post-HTx experience). A better workflow would start from checking the Markov assumption before running the MB-test. In practice, this could be done by estimating and testing the association between the waiting time to HTx and the hazard of mortality after HTx, using the clock-forward time scale (i.e. time is measured since entry in the waiting list). To this purpose, we may simply fit a Cox model on patients who actually received HTx, including the waiting time as the only covariate and also accounting for the delayed entry in the risk set of transplanted patients by left-truncating the observation at the time of HTx (again, the waiting time to the intervention). The result of this analysis is an estimated hazard ratio for one year increase in waiting time of 4.38 (95%CI: 3.11–6.18) and a p-value < 0.001 (according to Wald test, Score test or Likelihood Ratio test). Thus, the Markov assumption is really implausible in this context.

As suggested in Tassistro et al. (2020) , the next step should be the assessment of semi-Markov assumption. In practice, this could be done by estimating and testing the association between the waiting time to HTx and the hazard of mortality after HTx, using the clock-reset time scale (i.e. time is measured since HTx). To this purpose, we may simply fit a Cox model on patients who actually received HTx, including the waiting time as the only covariate and considering as the observed outcome the time from HTx to death (or censoring). The result of this analysis is an estimated hazard ratio for one year increase in waiting time of 1.24 (95%CI: 1.03–1.50) and a p-value = 0.023 (according to Wald test, similar result for Score test or Likelihood Ratio test). Thus, there is some evidence that an impact of the waiting time to HTx on mortality is present (i.e. also the semi-Markov hypothesis should be rejected). Looking at regression diagnostics (Fig. 10), the effect of waiting time to HTx seems constant in time (proportional hazards assumption, panel A) even if not really linear on the logarithmic scale (panel B).

Taking advantage of the results about process memory properties and assuming an extended semi-Markov scenario beyond this data, we may try to apply the ECR-test, which seems the most appropriate to test the null hypothesis of equal mortality rate while on list and never transplanted versus transplanted immediately after inclusion on list (\(H_0: \lambda _{02}(t) = \lambda _{12;t^{01}=0}(t-t^{01})\)), accounting for the effect of waiting time to HTx. The chi-squared distributed test statistic is 1.036 (1 degree of freedom) which leads to a p-value = 0.309 thus leading to the conclusion that, in the hypothetical scenario where all patients are transplanted immediately after inclusion on list, it is not possible to demonstrate a clear advantage in survival provided by HTx.

Interestingly, if we ignored the impact of the waiting time to transplant and applied the CR-test as if the process was semi-Markovian, we would obtain a Chi-square statistic of 6.379 and a p-value of 0.012.

The procedure that we followed to perform the analysis is described in general terms in “Appendix B”.

Fig. 10
figure 10

Check of proportional hazards assumption by the analysis of Schoenfeld residuals (A) and linearity assumption by an hazard ratio spline plot (B) for the effect of the waiting time to HTx on the hazard of death evaluated by a Cox model fitted only on post-HTx observation in the data of the motivating application (function belonging to “survminer” and “Greg” R packages were employed)

5 Discussion

In a illness-death model, the impact of the occurrence of the intermediate event on the hazard of the final event can be investigated non-parametrically with logrank-type tests that account for the time-dependent nature of the indicator of the intermediate event. The only non-parametric test present in the literature to address this issue is the MB-test, proposed back in 1974, without a clear definition of the underlying null hypothesis. The hazard of the final event after passing through the intermediate one which is involved in the MB-test is also the quantity guiding the product-limit survival estimator of the survival since the intermediate event proposed by Simon and Makuch (1984). In previous works (Bernasconi et al. 2016, 2018) we showed that the Simon-Makuch method provides an unbiased estimator of the survival for patients who switches to the intermediate event immediately after the initial one (i.e. null waiting time to the intermediate event), but only when the process is markovian.

Similarly, we showed here that the MB-test is a valid approach to test this particular null hypothesis \(H_0: \lambda _{02}(t) = \lambda _{12;t^{01}=0}(t-t^{01})\) only when the Markov assumption is true. For non-Markov processes, we proposed a test based on the clock-reset time scale (directly suitable for semi-Markov processes) that can possibly account for the effect of the time to the intermediate event (thus becoming suitable also for extended semi-Markov scenarios).

Our considerations are supported by results of the simulation protocol, where we considered Markov, semi-Markov and extended semi-Markov scenarios both under the null hypothesis of no effect of the intermediate event administered immediately after the initial state on the hazard of the final state and under an alternative hypothesis. For the extended semi-Markov scenarios, we considered both the possibility of a constant or a time-varying effect of the time to the intermediate event. We compared the two methods in terms of false rejection rate (for scenarios under the null hypothesis) and power (for scenarios under the alternative hypothesis).

For simplicity, in all scenarios under the alternative hypothesis, data were generated under the assumption of proportionality between the hazard of the final event without and after the intermediate one (i.e. constant effect of the intermediate event). However, due to their non-parametric nature, all of the approached here discussed are suitable to tackle situations were this assumption is not fulfilled. Moreover, extensions analogous to those applied to the standard log-rank test, can also be considered for these methods in order to increase the efficiency in non-PH situations.

A Cox model where time after transition to the intermediate state is measured using the clock-reset scale and where the time-varying indicator of transition to the intermediate state and the fixed waiting time to this state are included in the model as covariates, can also be considered as a possible alternative to the non-parametric approach here proposed to assess the impact of the intermediate event (Tassistro et al. 2020). However, this approach is not directly suitable for situations where hazards before and after the intermediate event are not proportional.

Obviously, in order to choose the most suitable test, one has to preliminarily check the process memory assumptions. Several approaches have been proposed in the literature for this purpose such as the procedure shown in Tassistro et al. (2020), a non-parametric approach for the illness-death model (Rodriguez-Girondo and Uña-Alvarez 2012) and the general class of tests recently discussed by Putter and Titman (2022).

The work presented in this paper focus on the simple irreversible illness-death model. Methods for the estimation of transition probabilities in more extensive non-Markov models have been proposed, using plug-in approach or landmarking (Uña-Álvarez and Meira-Machado 2015; Titman 2015; Putter and Spitoni 2018). Moreover, the test proposed in this paper is only suitable for an unadjusted analysis of the association between the time-dependent treatment and the survival outcome. To account for the impact of other covariates in the estimation of transition probabilities in a non-Markov setting, one can resort to regression methods based on pseudo-observations (Andersen et al. 2022).

In conclusion, in this paper we proposed two alternative methods to the classical MB-test to assess the impact of the intermediate event to the final one in a illness-death model. We showed that MB-test is valid only under the Markov assumption while our methods are suitable for semi-Markov and extended semi-Markov scenarios. We recommend researchers who wish to assess the association between a binary time-dependent treatment and a time-to-event endpoint to study process memory assumptions before choosing the most appropriate method to perform the analysis. To provide clear guidance, a general step-by-step procedure to assess the impact of the intermediate event on the hazard of the final event in a non-markovian illness-death model using a non-parametric "log-rank-type" test is summarized in “Appendix B”.