Decision Tree for Prognostic Classification

Published Date: 02 Mar 2018

Decision Tree for Prognostic Classification of Multivariate Survival Data and Competing Risks

1. Introduction

Decision tree (DT) is one way to represent rules underlying data. It is the most popular tool for exploring complex data structures. Besides that it has become one of the most flexible, intuitive and powerful data analytic tools for determining distinct prognostic subgroups with similar outcome within each subgroup but different outcomes between the subgroups (i.e., prognostic grouping of patients). It is hierarchical, sequential classification structures that recursively partition the set of observations. Prognostic groups are important in assessing disease heterogeneity and for design and stratification of future clinical trials. Because patterns of medical treatment are changing so rapidly, it is important that the results of the present analysis be applicable to contemporary patients.

Due to their mathematical simplicity, linear regression for continuous data, logistic regression for binary data, proportional hazard regression for censored survival data, marginal and frailty regression for multivariate survival data, and proportional subdistribution hazard regression for competing risks data are among the most commonly used statistical methods. These parametric and semiparametric regression methods, however, may not lead to faithful data descriptions when the underlying assumptions are not satisfied. Sometimes, model interpretation can be problematic in the presence of high-order interactions among predictors.

DT has evolved to relax or remove the restrictive assumptions. In many cases, DT is used to explore data structures and to derive parsimonious models. DT is selected to analyze the data rather than the traditional regression analysis for several reasons. Discovery of interactions is difficult using traditional regression, because the interactions must be specified a priori. In contrast, DT automatically detects important interactions. Furthermore, unlike traditional regression analysis, DT is useful in uncovering variables that may be largely operative within a specific patient subgroup but may have minimal effect or none in other patient subgroups. Also, DT provides a superior means for prognostic classification. Rather than fitting a model to the data, DT sequentially divides the patient group into two subgroups based on prognostic factor values (e.g., tumor size < 2 cm vs tumor size ³ 2 cm). The repeated partitioning creates “bins” of observations that are approximately homogeneous. This permits the use of some summary functions (e.g., Kaplan-Meier or cumulative incidence function (CIF)) to compare prognosis between the “bins.” The combination of binning and the interpretability of the resulting tree structure make DT extremely well suited for developing prognostic stratifications.

The landmark work of DT in statistical community is the Classification and Regression Trees (CART) methodology of Breiman et al. (1984). A different approach was C4.5 proposed by Quinlan (1992). Original DT method was used in classification and regression for categorical and continuous response variable, respectively. In a clinical setting, however, the outcome of primary interest is often duration of survival, time to event, or some other incomplete (that is, censored) outcome. Therefore, several authors have developed extensions of original DT in the setting of censored survival data (Banerjee & Noone, 2008).

In science and technology, interest often lies in studying processes which generate events repeatedly over time. Such processes are referred to as recurrent event processes and the data they provide are called recurrent event data which includes in multivariate survival data. Such data arise frequently in medical studies, where information is often available on many individuals, each of whom may experience transient clinical events repeatedly over a period of observation. Examples include the occurrence of asthma attacks in respirology trials, epileptic seizures in neurology studies, and fractures in osteoporosis studies. In business, examples include the filing of warranty claims on automobiles, or insurance claims for policy holders. Since multivariate survival times frequently arise when individuals under observation are naturally clustered or when each individual might experience multiple events, then further extensions of DT are developed for such kind of data.

In some studies, patients may be simultaneously exposed to several events, each competing for their mortality or morbidity. For example, suppose that a group of patients diagnosed with heart disease is followed in order to observe a myocardial infarction (MI). If by the end of the study each patient was either observed to have MI or was alive and well, then the usual survival techniques can be applied. In real life, however, some patients may die from other causes before experiencing an MI. This is a competing risks situation because death from other causes prohibits the occurrence of MI. MI is considered the event of interest, while death from other causes is considered a competing risk. The group of patients' dead of other causes cannot be considered censored, since their observations are not incomplete.

The extension of DT can also be employed for competing risks survival time data. These extensions can make one apply the technique to clinical trial data to aid in the development of prognostic classifications for chronic diseases.

This chapter will cover DT for multivariate and competing risks survival time data as well as their application in the development of medical prognosis. Two kinds of multivariate survival time regression model, i.e. marginal and frailty regression model, have their own DT extensions. Whereas, the extension of DT for competing risks has two types of tree. First, the “single event” DT is developed based on splitting function using one event only. Second, the “composite events” tree which use all the events jointly.

2. Decision Tree

A DT is a tree-like structure used for classification, decision theory, clustering, and prediction functions. It depicts rules for dividing data into groups based on the regularities in the data. A DT can be used for categorical and continuous response variables. When the response variables are continuous, the DT is often referred to as a regression tree. If the response variables are categorical, it is called a classification tree. However, the same concepts apply to both types of trees. DTs are widely used in computer science for data structures, in medical sciences for diagnosis, in botany for classification, in psychology for decision theory, and in economic analysis for evaluating investment alternatives.

DTs learn from data and generate models containing explicit rule-like relationships among the variables. DT algorithms begin with the entire set of data, split the data into two or more subsets by testing the value of a predictor variable, and then repeatedly split each subset into finer subsets until the split size reaches an appropriate level. The entire modeling process can be illustrated in a tree-like structure.

A DT model consists of two parts: creating the tree and applying the tree to the data. To achieve this, DTs use several different algorithms. The most popular algorithm in the statistical community is Classification and Regression Trees (CART) (Breiman et al., 1984). This algorithm helps DTs gain credibility and acceptance in the statistics community. It creates binary splits on nominal or interval predictor variables for a nominal, ordinal, or interval response. The most widely-used algorithms by computer scientists are ID3, C4.5, and C5.0 (Quinlan, 1993). The first version of C4.5 and C5.0 were limited to categorical predictors; however, the most recent versions are similar to CART. Other algorithms include Chi-Square Automatic Interaction Detection (CHAID) for categorical response (Kass, 1980), CLS, AID, TREEDISC, Angoss KnowledgeSEEKER, CRUISE, GUIDE and QUEST (Loh, 2008). These algorithms use different approaches for splitting variables. CART, CRUISE, GUIDE and QUEST use the statistical approach, while CLS, ID3, and C4.5 use an approach in which the number of branches off an internal node is equal to the number of possible categories. Another common approach, used by AID, CHAID, and TREEDISC, is the one in which the number of nodes on an internal node varies from two to the maximum number of possible categories. Angoss KnowledgeSEEKER uses a combination of these approaches. Each algorithm employs different mathematical processes to determine how to group and rank variables.

Let us illustrate the DT method in a simplified example of credit evaluation. Suppose a credit card issuer wants to develop a model that can be used for evaluating potential candidates based on its historical customer data. The company's main concern is the default of payment by a cardholder. Therefore, the model should be able to help the company classify a candidate as a possible defaulter or not. The database may contain millions of records and hundreds of fields. A fragment of such a database is shown in Table 1. The input variables include income, age, education, occupation, and many others, determined by some quantitative or qualitative methods. The model building process is illustrated in the tree structure in 1.

The DT algorithm first selects a variable, income, to split the dataset into two subsets. This variable, and also the splitting value of $31,000, is selected by a splitting criterion of the algorithm. There exist many splitting criteria (Mingers, 1989). The basic principle of these criteria is that they all attempt to divide the data into clusters such that variations within each cluster are minimized and variations between the clusters are maximized.

The follow-

Name

Age

Income

Education

Occupation

...

Default

Andrew

45600

College

Manager

...

Allison

29000

High School

Self Owned

...

Yes

Sabrina

36800

High School

Clerk

...

Andy

37300

College

Engineer

...

…

...

Table 1. Partial records and fields of a database table for credit evaluation

up splits are similar to the first one. The process continues until an appropriate tree size is reached. 1 shows a segment of the DT. Based on this tree model, a candidate with income at least $31,000 and at least college degree is unlikely to default the payment; but a self-employed candidate whose income is less than $31,000 and age is less than 28 is more likely to default.

We begin with a discussion of the general structure of a popular DT algorithm in statistical community, i.e. CART model. A CART model describes the conditional distribution of y given X, where y is the response variable and X is a set of predictor variables (X = (X1,X2,…,Xp)). This model has two main components: a tree T with b terminal nodes, and a parameter Q = (q1,q2,…, qb) Ì Rk which associates the parameter values qm, with the mth terminal node. Thus a tree model is fully specified by the pair (T, Q). If X lies in the region corresponding to the mth terminal node then y|X has the distribution f(y|qm), where we use f to represent a conditional distribution indexed by qm. The model is called a regression tree or a classification tree according to whether the response y is quantitative or qualitative, respectively.

2.1 Splitting a tree

The DT T subdivides the predictor variable space as follows. Each internal node has an associated splitting rule which uses a predictor to assign observations to either its left or right child node. The internal nodes are thus partitioned into two subsequent nodes using the splitting rule. For quantitative predictors, the splitting rule is based on a split rule c, and assigns observations for which {xi < c} or {xi ³ c} to the left or right child node respectively. For qualitative predictors, the splitting rule is based on a category subset C, and assigns observations for which {xi Î C} or {xi Ï C} to the left or right child node, respectively.

For a regression tree, conventional algorithm models the response in each region Rm as a constant qm. Thus the overall tree model can be expressed as (Hastie et al., 2001):

(1)

where Rm, m = 1, 2,…,b consist of a partition of the predictors space, and therefore representing the space of b terminal nodes. If we adopt the method of minimizing the sum of squares as our criterion to characterize the best split, it is easy to see that the best , is just the average of yi in region Rm:

(2)

where Nm is the number of observations falling in node m. The residual sum of squares is

(3)

which will serve as an impurity measure for regression trees.

If the response is a factor taking outcomes 1,2, ... K, the impurity measure Qm(T), defined in (3) is not suitable. Instead, we represent a region Rm with Nm observations with

(4)

which is the proportion of class k(k Î {1, 2,…,K}) observations in node m. We classify the observations in node m to a class , the majority class in node m. Different measures Qm(T) of node impurity include the following (Hastie et al., 2001):

Misclassification error:

Gini index:

Cross-entropy or deviance:

(5)

For binary outcomes, if p is the proportion of the second class, these three measures are 1 - max(p, 1 - p), 2p(1 - p) and -p log p - (1 - p) log(1 - p), respectively.

All three definitions of impurity are concave, having minimums at p = 0 and p = 1 and a maximum at p = 0.5. Entropy and the Gini index are the most common, and generally give very similar results except when there are two response categories.

2.2 Pruning a tree

To be consistent with conventional notations, let's define the impurity of a node h as I(h) ((3) for a regression tree, and any one in (5) for a classification tree). We then choose the split with maximal impurity reduction

(6)

where hL and hR are the left and right children nodes of h and p(h) is proportion of sample fall in node h.

How large should we grow the tree then? Clearly a very large tree might overfit the data, while a small tree may not be able to capture the important structure. Tree size is a tuning parameter governing the model's complexity, and the optimal tree size should be adaptively chosen from the data. One approach would be to continue the splitting procedures until the decrease on impurity due to the split exceeds some threshold. This strategy is too short-sighted, however, since a seeming worthless split might lead to a very good split below it.

The preferred strategy is to grow a large tree T0, stopping the splitting process when some minimum number of observations in a terminal node (say 10) is reached. Then this large tree is pruned using pruning algorithm, such as cost-complexity or split complexity pruning algorithm.

To prune large tree T0 by using cost-complexity algorithm, we define a subtree T T0 to be any tree that can be obtained by pruning T0, and define to be the set of terminal nodes of T. That is, collapsing any number of its terminal nodes. As before, we index terminal nodes by m, with node m representing region Rm. Let denotes the number of terminal nodes in T (= b). We use instead of b following the "conventional" notation and define the risk of trees and define cost of tree as

Regression tree: ,

Classification tree: ,

(7)

where r(h) measures the impurity of node h in a classification tree (can be any one in (5)).

We define the cost complexity criterion (Breiman et al., 1984)

(8)

where a(> 0) is the complexity parameter. The idea is, for each a, find the subtree Ta T0 to minimize Ra(T). The tuning parameter a > 0 "governs the tradeoff between tree size and its goodness of fit to the data" (Hastie et al., 2001). Large values of a result in smaller tree Ta and conversely for smaller values of a. As the notation suggests, with a = 0 the solution is the full tree T0.

To find Ta we use weakest link pruning: we successively collapse the internal node that produces the smallest per-node increase in R(T), and continue until we produce the single-node (root) tree. This gives a (finite) sequence of subtrees, and one can show this sequence must contains Ta. See Brieman et al. (1984) and Ripley (1996) for details. Estimation of a () is achieved by five- or ten-fold cross-validation. Our final tree is then denoted as .

It follows that, in CART and related algorithms, classification and regression trees are produced from data in two stages. In the first stage, a large initial tree is produced by splitting one node at a time in an iterative, greedy fashion. In the second stage, a small subtree of the initial tree is selected, using the same data set. Whereas the splitting procedure proceeds in a top-down fashion, the second stage, known as pruning, proceeds from the bottom-up by successively removing nodes from the initial tree.

Theorem 1 (Brieman et al., 1984, Section 3.3) For any value of the complexity parameter a, there is a unique smallest subtree of T0 that minimizes the cost-complexity.

Theorem 2 (Zhang & Singer, 1999, Section 4.2) If a2 > al, the optimal sub-tree corresponding to a2 is a subtree of the optimal subtree corresponding to al.

More general, suppose we end up with m thresholds, 0 < al < a2 < … < am and let a0= 0. Also, let corresponding optimal subtrees be , then

(9)

where means that is a subtree of . These are called nested optimal subtrees.

3. Decision Tree for Censored Survival Data

Survival analysis is the phrase used to describe the analysis of data that correspond to the time from a well-defined time origin until the occurrence of some particular events or end-points. It is important to state what the event is and when the period of observation starts and finish. In medical research, the time origin will often correspond to the recruitment of an individual into an experimental study, and the end-point is the death of the patient or the occurrence of some adverse events. Survival data are rarely normally distributed, but are skewed and comprise typically of many early events and relatively few late ones. It is these features of the data that necessitate the special method survival analysis.

The specific difficulties relating to survival analysis arise largely from the fact that only some individuals have experienced the event and, subsequently, survival times will be unknown for a subset of the study group. This phenomenon is called censoring and it may arise in the following ways: (a) a patient has not (yet) experienced the relevant outcome, such as relapse or death, by the time the study has to end; (b) a patient is lost to follow-up during the study period; (c) a patient experiences a different event that makes further follow-up impossible. Generally, censoring times may vary from individual to individual. Such censored survival time underestimated the true (but unknown) time to event. Visualising the survival process of an individual as a time-line, the event (assuming it is to occur) is beyond the end of the follow-up period. This situation is often called right censoring. Most survival data include right censored observation.

In many biomedical and reliability studies, interest focuses on relating the time to event to a set of covariates. Cox proportional hazard model (Cox, 1972) has been established as the major framework for analysis of such survival data over the past three decades. But, often in practices, one primary goal of survival analysis is to extract meaningful subgroups of patients determined by the prognostic factors such as patient characteristics that are related to the level of disease. Although proportional hazard model and its extensions are powerful in studying the association between covariates and survival times, usually they are problematic in prognostic classification. One approach for classification is to compute a risk score based on the estimated coefficients from regression methods (Machin et al., 2006). This approach, however, may be problematic for several reasons. First, the definition of risk groups is arbitrary. Secondly, the risk score depends on the correct specification of the model. It is difficult to check whether the model is correct when many covariates are involved. Thirdly, when there are many interaction terms and the model becomes complicated, the result becomes difficult to interpret for the purpose of prognostic classification. Finally, a more serious problem is that an invalid prognostic group may be produced if no patient is included in a covariate profile. In contrast, DT methods do not suffer from these problems.

Owing to the development of fast computers, computer-intensive methods such as DT methods have become popular. Since these investigate the significance of all potential risk factors automatically and provide interpretable models, they offer distinct advantages to analysts. Recently a large amount of DT methods have been developed for the analysis of survival data, where the basic concepts for growing and pruning trees remain unchanged, but the choice of the splitting criterion has been modified to incorporate the censored survival data. The application of DT methods for survival data are described by a number of authors (Gordon & Olshen, 1985; Ciampi et al., 1986; Segal, 1988; Davis & Anderson, 1989; Therneau et al., 1990; LeBlanc & Crowley, 1992; LeBlanc & Crowley, 1993; Ahn & Loh, 1994; Bacchetti & Segal, 1995; Huang et al., 1998; KeleÃ…Å¸ & Segal, 2002; Jin et al., 2004; Cappelli & Zhang, 2007; Cho & Hong, 2008), including the text by Zhang & Singer (1999).

4. Decision Tree for Multivariate Censored Survival Data

Multivariate survival data frequently arise when we faced the complexity of studies involving multiple treatment centres, family members and measurements repeatedly made on the same individual. For example, in multi-centre clinical trials, the outcomes for groups of patients at several centres are examined. In some instances, patients in a centre might exhibit similar responses due to uniformity of surroundings and procedures within a centre. This would result in correlated outcomes at the level of the treatment centre. For the situation of studies of family members or litters, correlation in outcome is likely for genetic reasons. In this case, the outcomes would be correlated at the family or litter level. Finally, when one person or animal is measured repeatedly over time, correlation will most definitely exist in those responses. Within the context of correlated data, the observations which are correlated for a group of individuals (within a treatment centre or a family) or for one individual (because of repeated sampling) are referred to as a cluster, so that from this point on, the responses within a cluster will be assumed to be correlated.

Analysis of multivariate survival data is complex due to the presence of dependence among survival times and unknown marginal distributions. Multivariate survival times frequently arise when individuals under observation are naturally clustered or when each individual might experience multiple events. A successful treatment of correlated failure times was made by Clayton and Cuzik (1985) who modelled the dependence structure with a frailty term. Another approach is based on a proportional hazard formulation of the marginal hazard function, which has been studied by Wei et al. (1989) and Liang et al. (1993). Noticeably, Prentice et al. (1981) and Andersen & Gill (1982) also suggested two alternative approaches to analyze multiple event times.

Extension of tree techniques to multivariate censored data is motivated by the classification issue associated with multivariate survival data. For example, clinical investigators design studies to form prognostic rules. Credit risk analysts collect account information to build up credit scoring criteria. Frequently, in such studies the outcomes of ultimate interest are correlated times to event, such as relapses, late payments, or bankruptcies. Since DT methods recursively partition the predictor space, they are an alternative to conventional regression tools.

This section is concerned with the generalization of DT models to multivariate survival data. In attempt to facilitate an extension of DT methods to multivariate survival data, more difficulties need to be circumvented.

4.1 Decision tree for multivariate survival data based on marginal model

DT methods for multivariate survival data are not many. Almost all the multivariate DT methods have been based on between-node heterogeneity, with the exception of Molinaro et al. (2004) who proposed a general within-node homogeneity approach for both univariate and multivariate data. The multivariate methods proposed by Su & Fan (2001, 2004) and Gao et al. (2004, 2006) concentrated on between-node heterogeneity and used the results of regression models. Specifically, for recurrent event data and clustered event data, Su & Fan (2004) used likelihood-ratio tests while Gao et al. (2004) used robust Wald tests from a gamma frailty model to maximize the between-node heterogeneity. Su & Fan (2001) and Fan et al. (2006) used a robust log-rank statistic while Gao et al. (2006) used a robust Wald test from the marginal failure-time model of Wei et al. (1989).

The generalization of DT for multivariate survival data is developed by using goodness of split approach. DT by goodness of split is grown by maximizing a measure of between-node difference. Therefore, only internal nodes have associated two-sample statistics. The tree structure is different from CART because, for trees grown by minimizing within-node error, each node, either terminal or internal, has an associated impurity measure. This is why the CART pruning procedure is not directly applicable to such types of trees. However, the split-complexity pruning algorithm of LeBlanc & Crowley (1993) has resulted in trees by goodness of split that has become well-developed tools.

This modified tree technique not only provides a convenient way of handling survival data, but also enlarges the applied scope of DT methods in a more general sense. Especially for those situations where defining prediction error terms is relatively difficult, growing trees by a two-sample statistic, together with the split-complexity pruning, offers a feasible way of performing tree analysis.

The DT procedure consists of three parts: a method to partition the data recursively into a large tree, a method to prune the large tree into a subtree sequence, and a method to determine the optimal tree size.

In the multivariate survival trees, the between-node difference is measured by a robust Wald statistic, which is derived from a marginal approach to multivariate survival data that was developed by Wei et al. (1989). We used split-complexity pruning borrowed from LeBlanc & Crowley (1993) and use test sample for determining the right tree size.

4.1.1 The splitting statistic

We consider n independent subjects but each subject to have K potential types or number of failures. If there are an unequal number of failures within the subjects, then K is the maximum. We let Tik = min(Yik,Cik ) where Yik = time of the failure in the ith subject for the kth type of failure and Cik = potential censoring time of the ith subject for the kth type of failure with i = 1,…,n and k = 1,…,K. Then dik = I (Yik ≤ Cik) is the indicator for failure and the vector of covariates is denoted Zik = (Z1ik,…, Zpik)T.

To partition the data, we consider the hazard model for the ith unit for the kth type of failure, using the distinguishable baseline hazard as described by Wei et al. (1989), namely where the indicator function I(Zik < c) equals 1 if Zik < c and 0 otherwise, which corresponds to a split, say s, based on a continuous covariate Zj (j = 1,..,p). If the covariate is categorical, then I(Zik Î A) for any subset A of its categories need to be considered.

Parameter b is estimated by maximizing the partial likelihood. If the observations within the same unit are independent, the partial likelihood functions for b for the distinguishable baseline model (10) would be,

(11)

Since the observations within the same unit are not independent for multivariate failure time, we refer to the above functions as the pseudo-partial likelihood.

The estimator can be obtained by maximizing the likelihood by solving . Wei et al. (1989) showed that is normally distributed with mean 0. However the usual estimate, a-1(b), for the variance of , where

(12)

is not valid. We refer to a-1(b) as the naïve estimator. Wei et al. (1989) showed that the correct estimated (robust) variance estimator of is

(13)

where b(b) is weight and d(b) is often referred to as the robust or sandwich variance estimator. Hence, the robust Wald statistic corresponding to the null hypothesis H0 : b = 0 is

(14)

4.1.2 Tree growing

To grow a tree, the robust Wald statistic is evaluated for every possible binary split of the predictor space Z. The split, s, could be of several forms: splits on a single covariate, splits on linear combinations of predictors, and boolean combination of splits. The simplest form of split relates to only one covariate, where the split depends on the type of covariate whether it is ordered or nominal covariate.

The “best split” is defined to be the one corresponding to the maximum robust Wald statistic. Subsequently the data are divided into two groups according to the best split.

Apply this splitting scheme recursively to the learning sample until the predictor space is partitioned into many regions. There will be no further partition to a node when any of the following occurs:

The node contains less than, say 10 or 20, subjects, if the overall sample size is large enough to permit this. We suggest using a larger minimum node size than used in CART where the default value is 5;
All the observed times in the subset are censored, which results in unavailability of the robust Wald statistic for any split;
All the subjects have identical covariate vectors. Or the node has only complete observations with identical survival times. In these situations, the node is considered as 'pure'.

The whole procedure results in a large tree, which could be used for the purpose of data structure exploration.

4.1.3 Tree pruning

Let T denote either a particular tree or the set of all its nodes. Let S and denote the set of internal nodes and terminal nodes of T, respectively. Therefore, . Also let |×| denote the number of nodes. Let G(h) represent the maximum robust Wald statistic on a particular (internal) node h. In order to measure the performance of a tree, a split-complexity measure Ga(T) is introduced as in LeBlanc and Crowley (1993). That is,

(15)

where the number of internal nodes, |S|, measures complexity; G(T) measures goodness of split in T; and the complexity parameter a acts as a penalty for each additional split.

Start with the large tree T0 obtained from the splitting procedure. For any internal node h of T0, i.e. h Î S0, a function g(h) is defined as

(16)

where Th denotes the branch with h as its root and Sh is the set of all internal nodes of Th. Then the 'weakest link' in T0 is the node such that

(17)

Let and T1 be the subtree after pruning off the branch . In addition, let and T2 be the tree after pruning off the branch . Repeating this procedure leads to a nested sequence of subtrees where TM is the root node, and the sequence ¥ = aM >…> am > am-1 >…> a1 > a0 = 0.

It is important to note that am's are an increasing sequence. And for any a such that am ≤ a ≤ am+1, in particular, the geometric mean of am, . It follows that T(a) = T(am) = T(a 'm) = Tm. This implies that we can get the best pruned subtree for any penalty a from the pruning algorithm.

4.1.4 The best-sized tree based on test sample

Now we need to select one or several appropriately sized trees from the nested sequence. Several methods have been suggested for this purpose, one of them is test sample method.

When the sample size is large enough, the test sample is the preferred method of determining the right tree size. To do this, the whole sample is divided into two parts: a learning sample L1 and a test sample L2. Usually, the proportion is 2:1.

A large tree T0 is grown and pruned to obtain a nested sequence of subtrees using the learning sample L1. Then the test sample L2 is sent down along the large tree T0 and the splitting statistics, G(h), are recalculated for each internal node, h Î S, using the validation sample. The tree that maximizes the split-complexity measure is chosen as the best-sized subtree, where the constant penalty ac is chosen for each split. It has been suggested by LeBlanc and Crowley (1993) that ac be typically chosen such that 2 ≤ ac ≤ 4, where ac = 2 is in the spirit of the AIC criterion and ac = 4 corresponds roughly to the .05 significance level for a split under the curve.

Finally a marginal Kaplan-Meier survival curve is prepared separately (i.e., for each type of failure) for all groups resulted from the best-sized tree. For example, if the best-sized trees classified patients into three groups, then we prepare three corresponding marginal Kaplan-Meier survival curves for each type of failure.

4.1.5 Application: Bladder cancer data

In this section, we shall illustrate the proposed methods using the well-known bladder tumour cancer data reported by Byar (1980). The data were from a randomized clinical trial conducted by the Veterans Administration Co-operative Urological Group between 1971 and 1976 and consisted of 117 patients with superficial bladder tumours. The tumors were removed transurethrally, and patients were then randomly assigned to one of three treatments: placebo, pyridoxine (Vitamin B6), or intravesical thiotepa (triethylenetriphosphamide). Thiotepa is a member of the class of alkylating agents, which were among the first anticancer drugs used. Alkylating agents are highly reactive and bind to certain chemical groups found in nucleic acids. These compounds inhibit proper synthesis of DNA and RNA, which leads to apoptosis or cell death. However, since alkylating agents cannot discriminate between cancerous and normal cells, both types of cells will be affected by this therapy. For example, normal cells can become cancerous due to alklyating agents. Thus, thiotepa is a highly cytotoxic compound and can potentially have adverse effects. Consequently, the effects of thiotepa on cancer recurrence and death are not obvious (Ghosh & Lin, 2000).

Treatment was aimed at preventing bladder cancer recurrence following the removal of superficial bladder tumours. Patients were examined every 3 months for recurrence of tumour and any new tumors were removed. We used the version of data presented in original paper by Wei et al. (1989) which is only available for the placebo and the thiotepa groups. There were 38 patients in the thiotepa group and 48 placebo patients. The outcome variable was number of months to the event since last tumour occurrence. Patients are censored when they die, immediately after their fourth event or when the end of the study is reached. Besides the treatment the number of initial tumours and diameter of the largest initial tumour were also recorded for each patient. Particularly, the number of initial tumours ranged from 1 to 8 with the respective counts of patients equal to 50, 11, 10, 4, 5, 2, 0, and 3.

The root node was split by the number of initial tumours, with the best cutoff fewer than five versus at least five initial tumours (= 9.99). The subgroup with at least five initial tumours formed terminal node 3. On the opposite side of the tree, the subgroup with fewer than five positive nodes was next split by number of initial tumours again (best cutoff <4 vs. ³4; = 275.22). None of these subgroups were further split and formed terminal nodes 4 and 5 in the tree. Hence, best-sized tree formed three groups of patients.

Prognostic grouping of the patients was based on the terminal nodes in 2. We chose survival probability to first and second recurrence represented by marginal Kaplan-Meier curve as a measure of prognosis, and ranked each terminal node in the tree according to that measure. The survival probability of first and second recurrence for three groups of patients is presented in 3. Among these three groups, patient with at least five initial tumours have the poorest prognosis, that is, they are more likely to develop recurrence. Patient with at most three initial tumours have the best prognosis, because they are less likely to develop recurrence.

4.2 Decision tree for multivariate survival data based on frailty model

Quite the opposite to marginal models, frailty models directly account for and estimate within subject correlation. A parameter estimate of within subject propensity for events is obtained directly.

The random effects approach to frailty models involves the assumption that there are unknown factors within a subject causing similarity (homogeneity) in failure times within the subject and thus differences (heterogeneity) in failure times between different subjects. The reason such factors are referred to as unknown is that if they were known to the investigator, they could be included in the analysis, resulting in independence within a subject. Frailty modeling (known as such because it examines the tendency for failures/recurrences within a subject at similar times, or experience similar frailties) involves specification of independence within a subject, conditional on an unobservable frailty vi. This frailty for the ith subject is incorporated (conditionally) into the proportional hazard function previously examined as follows

(18)

where the number of failure for the ith subject may be different, i.e. k = 1,…,Ki.

4.2.1 The splitting statistic and tree growing

In this method, the splitting rule is defined as Wald test statistic evaluating covariate effect based on frailty model. Thus, we create tree by maximizing between-node separation.

To split a node, a splitting function needs to be evaluated at every cutoff point. We consider the following model

(19)

The parameter b corresponds to the effect of separation between two daughter nodes induced by cutoff point c which will be estimated by penalized likelihood method. The penalized likelihood is developed by re-parameterizing model (19) as follows (Therneau & Grambsch, 2000):

(20)

where g = log(v)={g1, g2,…,gn} is the vector of parameters for the re-parameterized frailty, and X is the corresponding design matrix, which is a N by n matrix, where , such that Xkl = 1 when kth failure belongs to lth subject and Xkl =0 otherwise, with k= 1,…, N and l= 1,2,…, n. With this parameterization, model (20) has a similar structure as the classical Cox model. Let q be the index parameter of the frailty distribution in the node. Then, the penalized log-likelihood can be expressed as

(21)

where PL(b,g ;data) is the partial likelihood and g(g;q) is a penalty function. Specifically, PL(b,g ;data) is the usual partial likelihood for b and g

(22)

where Xi is the ith row of design matrix which corresponds to the ith recurence in node, Wi(t) is an indicator variables such that Wi(t)=1 when the item is at risk in time ti and 0 otherwise, and R(ti) is the risk set in time ti. The second term g(g;q) is a penalty function of g. It has been shown (Therneau & Grambsch, 2000) that the following penalty function will give exactly the same solution as EM algorithm (Klein, 1992) when the frailty has a gamma distribution indexed by parameter q,

(23)

The parameters b and g can be estimated by solving the score functions and parameter q is estimated by maximizing the profile log-likelihood (Therneau & Grambsch, 2000). In turn, following Gray (1992), the variance covariance matrix is V(b;g) = H-1 and

(24)

where A is the second derivative matrix of usual partial likelihood with respect to b and g, g'' is the second derivative of penalty function with respect to g. The first diagonal element of H-1 corresponds to the variance of b (Therneau & Grambsch, 2000; Gray, 1992). Comparing to EM algorithm, penalized likelihood method is computationally faster and has been incorporated into standard statistical packages.

The splitting function is defined as , where and are penalized likelihood estimator and its estimated variance. In summary, when a tree is constructed, a conditional proportional hazard structure given frailty is assumed within each node. The splitting function is evaluated at each allowable split, and the best cutoff point c* is chosen corresponding to the maximum Wald statistic. This process is applied recursively until all the nodes cannot be further split. That is, the covariate space within the node becomes homogeneous or only a few failures are left within the node. Consequently, the growing procedure leads to a large initial tree, denoted by T0.

4.2.2 Tree pruning and the best-sized tree selection based on test sample

The data in a node are sent to one of two child nodes by a split point or split set selected. This procedure is repeated until a certain criterion is met. A stopping rule might not detect significant splits which occur at later nodes. To avoid this possibility, a pruning technique can be applied to eliminate some insignificant nodes after splitting as many times as possible until each node has fewer than a pre-specified number of cases.

The prediction error of the maximal tree is usually larger than that of a parsimonious tree when estimated by an independent sample. Thus, some nodes need to be pruned. We adopt the split-complexity pruning technique of LeBlanc & Crowley (1993), defining Wald statistic as the goodness of split of a tree. The approach is to find a tree T maximizing the split-complexity as follows:

1. Given a complexity parameter a ³ 0 and a tree T, define the split-complexity function Ga(T) as Ga(T) = G (T) - a|S|, where is the sum of the maximum Wald statistic over the internal nodes of T and |S| is the number of internal nodes in T.

2. For each a ³ 0, there exists a tree maximizing the split-complexity. If a = 0, the tree maximizing the split-complexity is the maximal tree. The larger the a, the smaller the tree maximizing the split-complexity.

3. Let h be any node and Th be the branch of a tree T with root node h, then we define a function , where Sh is the set of all internal nodes of Th.

4. Prune branches at node for which .

5. Repeat the above steps to obtain a nested sequence of trees, , where T0 is the maximal tree. Tm is a tree with some branches of Tm-1 pruned, and TM is the trivial tree with only the root node.

The final tree is selected by evaluating the sequence of maximal split-complexity trees. However, the goodness of split of the trees is over-estimated by the learning sample. That is, the estimation of the goodness of split by the learning sample is too optimistic. Thus, if an independent test sample exists, the final tree can be selected as follows:

1. Let be the pruned sequence of trees obtained from a learning sample L1. Pour a test sample L2 down each tree.

2. Estimate the split-complexity G(Tm) of each tree Tm with the test sample L2.

3. Select the tree with the largest value of G(Tm).

4.2.3 Application: Chronic granulomatous disease (CGD) data

In this section, we illustrate the DT multivariate survival data based on frailty model by analyzing a CGD data in Therneau & Grambsch (2000). CGD is a heterogeneous group of uncommon inherited disorder characterized by recurrent pyogenic infections that usually begin early in life and may lead to death in childhood. Interferon gamma is a principal macrophage-activating factor shown to partially correct the metabolic defect in phagocytes, and for this reason it was hypothesized that it would reduce the frequency of serious infections in patients with CGD. In 1986, Genentech Inc. conducted a randomized, double-blind, manized interferon gamma (rIFN-g) or placebo three times daily for a year. The primary endpoint of the study was the time to the first serious infection. However, data were collected on all serious infections until cessation of follow up, which occurred near day 400 for most patients. Thirty of the group had at least one serious infection. The total number of infections was 56 and 20 in the placebo and treatment groups, respectively. A question is whether a distinct group of patient based on their recurrent infections exist.

Covariates include the enrolling hospital and randomization data, age, height, weight, sex, use of antibiotics or corticosteroids at the time of enrolment, and the pattern of inheritance. The data set had 203 observations on 128 subjects.

Next, we applied the proposed DT method to this data. Then we developed a tree by using all data, and the best sized tree with seven terminal nodes and its corresponding Kaplan-Meier survival curves were presented in 4 and 5, respectively. The method first splits on whether treatment is rIFN-g or placebo. Then age (< 5.5 versus ³ 5.5 years) and height (< 132.9 versus ³ 132.9 cm) are chosen as the best partitions respectively. After 6 partitions, a best sized tree with 7 terminal nodes was developed, where circles and squares represent internal nodes and terminal nodes respectively, and value c2 denoted the splitting function for each partition. Final tree with seven terminal nodes lead to seven risk groups: placebo with age < 5.5 years (node 4), rIFN-g with height < 132.9 cm (node 6), placebo with age ³ 5.5 years and height < 158.5 cm (node 10), rIFN-g with 132.9 ≤ height < 148 cm (node 14), rIFN-g with height ³ 148 cm (node 15), placebo with 5.5 ≤ age < 20.5 years and height ³ 158.5 cm (node 22), and placebo with age ³ 20.5 years and height ³ 158.5 cm (node 23).

The second DT was constructed by using test sample method. We built tree T using sample L1 of size two third portions of data and validated using sample L2 of size one third of data. First divide the whole data into two parts: the learning sample L1 and the test sample L2. Then grow and prune the initial tree T0 using L1. At the stage of tree size determination, the goodness-of-split measure G(Tm) was recalculated or validated using the test sample L2. The subtree that maximizes the validated Ga is the best-sized tree. The best-sized tree had 2 terminal nodes formed two risks group of patients ( 6). Since the two risks group were characterized by treatment, then it is clear from 7 that the rIFN-g patients had lower first recurrent risk compared to placebo group, even though the second recurrence Kaplan-Meier survival function crossed each other.

5. Decision Tree for Competing Risks Data

Competing risks arise in many applications in biomedical sciences. In the competing risks framework subjects may fail due to one of the J possible causes. A competing risk can be defined as an event, whose occurrence precludes the occurrence of other events under examination. The classic example of competing risks is competing causes of death, such as cancer, heart disease, AIDS, etc. In cancer studies, common competing risks are relapse and death in remission (or treatment related mortality). These two risks represent the two reasons a patient may fail the therapy, namely a recurrence of the disease (relapse) or a death due to the toxicity of the therapy itself.

In many situations, the focus is on the development of groups of individuals which have different survival properties. In this section, we develop a method to create groups of individuals with competing risks data based on the covariate values. The method extends the classification and regression trees (CART) algorithm to competing risks data by modeling the subdistribution hazard, i.e. the hazard function associated with the CIF.

The “single event” and “composite event” decision tree types refer to whether or not the device used to create each split discriminates based on one event only, or the differences among all the events jointly. When survival data involve competing risks, under the “single event” categories, we will build up the tree by focusing on the event of interest and treating other competing risks as nuisance events that have to be adjusted for. Under the “composite event” categories, we propose to build up the decision tree by accounting for all events simultaneously.

In section 5.1 we discuss between-node univariate decision trees for competing risks based on the difference between CIFs as proposed by Ibrahim et al. (2008). The tree address the problem of identifying risk subgroups when there is one competing risk event of primary interest. In section 5.2 we develop a decision tree for competing risks based on maximizing the between-node difference of the CIF with regard to more than one competing risk event. This decision tree is designed to address the problem of identifying subgroups when multiple competing risk events are of interest to the researcher.

5.1 “Single event” decision tree for competing risks data

In the development of decision tree for competing risks, we used deviance for the between-node difference. Deviance is derived from likelihood ratio test statistic of proportional hazards model of subdistribution developed by Fine & Gray (1999). Therefore, only internal nodes have associated deviance statistics. The tree structure is different from CART in a way that, for original tree, each node, either terminal or internal, has an associated impurity measure. This is why the CART pruning procedure is not directly applicable to such type of tree. However, Segal's pruning algorithm (Segal, 1988), which exerts little computational burden, has resulted in tree that have become well-developed tools.

We consider a typical setting for competing risks survival data. Suppose that there are n individuals and each subjected to J (J ³ 2) types of event. Let Yi be the time when ith unit experiences one of the jth type of event, and let Ci be the corresponding censoring time, where j = 1, 2, …, J; i = 1, 2, …, n. The sample consists of the set of vectors {(Ti,d i,Zi): i = 1, 2, ..., n}. Here, Ti = min(Yi,Ci) are the observed failure times; di = DiI(Yi < Ci), where I(×) is the indicator function and Di Î{1, 2, …, J}; Zi Î Rp denotes the covariate vector for the ith unit. Since recursive partitioning handles covariates one by one, we assume p = 1 for the ease of illustration. In order to ensure identifiability, we also assume that the failure time Yi is independent of the censoring time Ci conditional on the covariate Zi, for any i = 1, …, n.

In the subdistribution approach by Fine & Gray (1999), the subdistribution hazard for each type of failure is formulated with the proportional hazards model. Since we only consider splitting on a single covariate, the subdistribution hazard function is assumed to take the following form:

(25)

where is an unspecified baseline subdistribution hazard function and is an unknown regression parameter corresponding to cutpoint c. We assume that there is a change point effect of Zi on the subdistribution hazard function with cutpoint c.

When there is no censoring, can be estimated in exactly the same way as in the Cox model for right-censored data using a modified risk set. Here the risk set, R(t), at time t is all individuals yet to experience any event plus all those individuals who experienced types other than jth event at a time prior to t. The risk set leads to a partial likelihood:

(26)

The log partial likelihood is

(27)

For the case where competing risks data contained censored observations, the score function is constructed by using weight developed by inverse probability of censoring weighting technique. Value of that solves the score function is the desired estimators.

Given the estimated , the deviance is defined as . This is the summary measure of agreement between model and the data, where the smaller value corresponds to better goodness of fit (Collett, 1994).

The splitting function is defined in term of deviance as . This statistic can be derived from likelihood ratio for testing the significance of which is its maximum likelihood estimator.

In summary, when a tree is constructed, a proportional subdistribution hazard structure is assumed within each node. The splitting function is evaluated at each allowable split, and the best cutpoint c* is chosen such that . This process is applied recursively until all the nodes cannot be further split.

5.1.1 Algorithm to Grow Tree

To grow a tree, the deviance statistic is evaluated for every possible binary split of the predictor space Z. The split, s, could be of several forms: splits on a single covariate, splits on linear combinations of predictors, and boolean combination of splits. The simplest form, in which each split relates to only one covariate, can be described as follows:

1. If Zk is ordered, then the data will be split into two groups specified by {Zk < c} and {Zk ³ c} respectively;

2. If Zk is nominal, then any subset of the possible nominal values could induce a split.

The "best split" is defined to be the one corresponding to the minimum deviance statistic. Subsequently the data are partitioned into two groups according to the best split. The partitioning continues until a criterion met. Some DT methods use a direct stopping rule that stops splitting if a further partitioned does not improve some criterion. However, this may miss finding a good split later on. Other methods grow a large tree, and then prune it back to a smaller size.

5.1.2 Algorithm to Prune Tree

The idea of pruning is to iteratively cut off branches of the initial tree, T0, in order to locate a limited number of candidate subtrees from which an optimally sized tree is selected. To our method, we adopt Segal's pruning algorithm (Segal, 1988) which exerts little computational burden. The steps for adopting this algorithm is as follows:

1. Initially grow a large tree.

2. To each of the internal nodes in the original tree, assign the maximal splitting statistic contained in the corresponding branch. This statistic reflects strength of linking for the branch to the tree.

3. Among all these internal nodes, finds the one with the smallest statistic. That is, find the branch that has the weakest link and then prune off this branch from the tree.

4. The second pruned tree can be obtained in a similar manner by applying the above two steps to the first pruned tree.

5. Repeating this process until the pruned tree contains only the root node, finally a sequence of nested trees is obtained.

The desired tree can be obtained by plotting the size of these trees against their weakest linking statistics. The tree corresponding to the “kink” point in the curve is chosen as the best one.

5.1.3 Application: Breast cancer data

As an application of competing risks DT, we used breast cancer data from Fyles et al. (2004). Between December 1992 and June 2000, 639 women 50 years of age or older who had undergone breast-conserving surgery for an invasive adenocarcinoma 5 cm or less in diameter (pathological stage T1 or T2) were randomly assigned to receive breast irradiation plus tamoxifen, RT+Tam, (319 women) or tamoxifen alone, Tam, (320 women). Participating centers included the Princess Margaret Hospital, the Women's College Campus of the Sunnybrook and Women's College Health Science Centre in Toronto, and the British Columbia Cancer Agency centers in Vancouver and Victoria. Table 2 contains the list of variables and their descriptions.

The events that might occur in breast cancer study are relapse, second malignancy and death. Here we chose relapse as the event of interest. The patient's survival time was the time length between the date of randomization and the occurrence of one event or last follow-up date.

Variable name

Description

Randomized treatment: 0=tamoxifen, 1=radiation+tamoxifen

Variable assessed at the time of randomization

pathsize

Size of the tumour (cm)

hist

Histology: 1=ductal, 2=lobular,3=medullary, 4=mixed, 5=other

hrlevel

Hormone receptor level: 0=negative, 1=positive

hgb

Haemoglobin (g/l)

nodedis

Whether axillary node dissection was done, 0=Yes, 1=No

age

Age (years)

Outcome variables

time

Time from randomization to event (relapse, second malignancy or death) or last follow up (years)

Status at last follow-up: 0=censored, 1=relapse, 2=malignancy, 3=death

Table 2. Description of variables in the breast cancer study

Since the goal of regression tree is to partition patients into groups on the basis of similarity of their responses to treatment, we constructed a separate regression tree for each treatment (tamoxifen alone and tamoxifen plus radiation). The partitioning is based on baseline characteristics such as patient demographics and clinical measurements. The final tree structure provides treatment effect within each group of patients. The question to be answered by this type of analysis is - for whom does treatment work best?

The partitioning exploration is employed to find group of patients for each treatment by using decision tree. With respect to cumulative incidence of relapse, we obtained four groups of patients which are treated by tamoxifen alone, and three groups of patients which are treated by tamoxifen plus radiation ( 8).

The description of four groups of patient which were treated by tamoxifen alone is:

1. Node 2: tumour size < 1.3 cm

2. Node 4: tumour size ³ 1.3 cm and age < 59.5 years

3. Node 6: tumour size ³ 1.3 cm and age ³ 59.5 years and haemoglobin < 139 g/l

4. Node 7: tumour size ³ 1.3 cm and age ³ 59.5 years and haemoglobin ³ 139 g/l

This group formation reveals that node 2 has the lowest cumulative incidence of relapse up to about 8 years, whereas node 4 has the highest cumulative incidence ( 9).

For patients with tamoxifen plus radiation, there were three groups resulted from decision tree, namely:

1. Node 2: tumour size < 2 cm

2. Node 4: tumour size ³ 2 cm and hormone receptor level negative

3. Node 5: tumour size ³ 2 cm and hormone receptor level positive

Women with tamoxifen plus radiation whose tumour size less than 2 cm have the lowest probability to relapse, whereas the highest probability is for those whose tumour size ³ 2cm and negative hormone receptor level ( 10).

Overall comparison for both treatments reveals that the poorest and best prognosis are from tamoxifen plus radiation treatment group. We found that tamoxifen plus radiation is not effective for those women whose tumour size greater than 2 cm and negative hormone receptor level. This group is more likely to relapse compared to the others. On the other hand, patients with tamoxifen plus radiation and tumor size less than 2 cm have the best prognosis, because they are less likely to relapse ( 11).

5.2 “Composite event” decision tree for competing risks data

In this section we introduce a DT for data with competing risks for the situation where more than one competing risk is of interest to the researcher. This method uses a between-node separation method via composite measure of the risk of two more events based on the multivariate test of CIFs (Luo & Turnbull, 1999). In sections 5.2.1 and 5.2.2, we outline the methods for splitting, growing, pruning, and selecting the final tree. In Section 5.2.3, we illustrate the techniques by analyzing contraceptive discontinuation data.

5.2.1 The growing stage

We begin by selecting a splitting function based on event-specific cumulative CIFs. The CIF is a marginal probability of failure for the event of interest. Instead of using the overall survival function, the CIF is used to summarize data with competing risks. After splitting the covariate, we use a method based on one proposed by Luo & Turnbull (1999) to compare the CIFs for event 1 and 2 for group 1, with the CIFs for event 1 and 2 for group 2. In this method, Luo & Turnbull extend the rank-based tests for data with a single event (e.g., the log-rank test, the generalized Wilcoxon test, and the Fleming-Harrington test) to apply to data with multivariate competing risks of interest. For our method, we repeated the process of splitting and comparing CIFs until we find the two CIFs with the maximum between-node heterogeneity.

For any split, we let k = 1, 2 denote the two groups. Recall that in the competing risk setting, “failures” or “events” are categorized into several mutually exclusive types. Suppose there are j = 1, ..., J different event types. Without loss of generality, we assume that there are two events of interest, J = 2, and censoring, but the following can be generalized to three or more events. Let Yik and Cik be the true failure time and censoring time, respectively, for the ith individual (i = 1, ... , n) in group k. Let Dik ∈ {1, ... , J} be the indicator of the true event type. For each individual i in a set of data, we observe (Tik, δik), where Tik = min(Yik,Cik) and δik = Dik I(Tik ≤ Cik). We assume that observations from different individuals within a group are independent and identically distributed, but we assume that the underlying competing risk processes for each individual may be dependent. We also assume that the censoring time Cik is independent of the failure time Yik for any event j.

For event j and group k, the CIF is defined as Fjk(t) = P(Yik ≤ t, δik = j). For each possible split of a covariate for a node, we need to test whether the CIFs are equal for each event j, that is, we need to test

(28)

Fjk(t) has sub-density fjk(t). We also define as the survival function for individuals in group k. Let nk be the sample size from the kth sample. The jth component of the multivariate rank-based test statistic is

(29)

where

and is a weight function that converges to some weight, Wj(s), in probability. We consider weights in the same class as those proposed by Gray (1988),

(30)

where is the estimated CIF for event j when data from group 1 and group 2 are combined. The parameter r controls the assignment of the weights with respect to time. When r = −1, more weight is given to early differen