9.4.4. Cox Regression
Apart from time and status variables, data for Survival Analysis often contain measurements on one or more continuous variables, such as temperature, dosage, age or one or more categorical variables such as gender, region, treatment. In such cases it is desirable to construct Life Tables (or survival functions) which reflect the effects of these continuous or categorical variables (which are also called covariates).
A method which combines the elements of nonparametric Life Table analysis and the parametric Regression Analysis was introduced by D R Cox in 1972. It is known as the Cox Regression or Cox’s proportional hazards model. The latter reflects a fundamental assumption of this model, namely that the hazard function of an individual in one group is proportional to the hazard function of another in another group at any time period. In graphical terms, this is equivalent to assuming that the hazard curves of different groups do not cross each other.
Let X be a matrix of p independent variables (covariates) for n cases. Then the hazard function for the j^{th} case can be written as:
where:
· h_{0}(t) is the baseline hazard function and
· is the vector of coefficients.
Rearranging the terms we obtain:
Although the righthand side of this equation is linear, the existence of another unknown on the lefthand side, h_{0}(t), makes it impossible to solve it using the Linear Regression method. Instead, a maximum likelihood solution for a loglikelihood function that is derived from the above equation is found:
where R(t_{j}) is the set of cases at risk at the start of time j and d_{j} is the number of cases that terminate at time j.
The maximum likelihood solution of this proportional hazards model is computed employing a modified NewtonRaphson minimisation algorithm. The nature of this method implies that a solution (convergence) cannot always be achieved. In such cases, you are advised to edit the convergence parameters provided, in order to find the right levels for the particular problem at hand. Three convergence parameters can be edited in a dialogue that pops up just after the Variable Selection Dialogue (see 9.4.4.2. Cox Regression Intermediate Inputs).
Stratified Analysis: It may be the case that with some data sets the assumption of proportionality of individual hazard functions cannot be maintained. Under such circumstances it may be possible to define some subgroups within which the hazard functions can assumed to be proportional. This type of analysis is called a stratified analysis and the subgroups within which the proportionality assumption is maintained are called the strata.
The strata are defined by the levels of a factor column, which can be selected from the Variable Selection Dialogue. The program then assumes that different strata have different baseline hazard functions, but that they all have the same coefficient estimates for covariates. A maximum of six strata can be fitted simultaneously.
Baseline Functions at Means of Covariates: In order to maintain numerical stability, UNISTAT always centres the covariates by subtracting their respective means first. Most results, including the coefficient estimates, are invariant to centring. However, the baseline functions are different with and without centring and some sources prefer reporting the centred baseline functions while the others the uncentred ones (baseline functions relative to the origin). UNISTAT makes both sets of results available. See 9.4.4.2. Cox Regression Intermediate Inputs below for how to make this selection. The only other result that is affected by centring is the Linear Fit (XBeta) reported under Case (Diagnostic) Statistics (see 9.4.4.3.2. Cox Regression Further Output Options).
Predictions: If a case (row) of data does not contain any missing values except for the time variable, then a prediction of survival times is generated for this case. If at least one such case exists in data, then a fourth option Predicted Survival is added to the Baseline Functions dialogue. An additional column of predicted survival times is output for each predicted case.
9.4.4.1. Cox Regression Variable Selection
Variable: Any number of columns containing numeric data can be selected as independent variables (covariates). However, unlike any other regression procedures, a Cox regression can be run without selecting any independent variables. This is called the initial model (or the null model). UNISTAT reports the 2Log Likelihood value for the initial model for any model fitted.
Interaction: This button is used to create independent variables, which are the products of existing numeric variables. If only one variable is highlighted, then the new independent variable created will be the product of the selected column by itself. If two or more variables are highlighted, then the new term will be the product of these variables. Maximum threeway interactions are allowed. Interactions of dummy variables or lags are not allowed. In order to create interaction terms for dummy variables, create interactions first, and then create dummy variables for them. For further information see 2.1.4. Creating Interaction, Dummy and Lag/Lead Variables.
Dummy: This button is used to create n or n – 1 new independent (dummy or indicator) variables for a factor column containing n levels. Each dummy variable corresponds to a level of the factor column. A case in a dummy column will have the value of 1 if the factor contains the corresponding level in the same row, and 0 otherwise. If the selected variable is an interaction term, then dummy variables will be created for this interaction term. Up to threeway interactions are allowed and short or long string columns can be selected as factors. It is possible to include all n levels in the analysis or to omit the first or the last level to remove the inherent overparameterisation of the model. For further information see 2.1.4. Creating Interaction, Dummy and Lag/Lead Variables.
Full: This button becomes activated when two or more categorical variables are highlighted. Like the [Dummy] button, it is also used to create dummy variables. The only difference is that this button will create all necessary dummy variables and their interactions to specify a complete model. For instance, if two categorical variables are highlighted, then this button will create two sets of dummy variables representing the main effects and a third set representing the interaction term. Maximum threeway interactions are supported.
Time: It is compulsory to select at least one column containing numeric data as the time variable. Two types of data can be selected for Cox Regression; (1) Enter Durations where one column is selected containing the duration of each case and (2) Enter Begin and End Times where one column is selected as the starting time and another column as the termination time. In the latter case, the program subtracts the second column from the first one to obtain the duration of each case. If this results in some non positive values, then they are treated as missing cases (see 3.0.2.4. Time Data).
Censored: This variable is selected to show whether the termination time of a case is not known or the termination event has occurred. If the value in the column is zero then it is assumed that the case has been censored, which means that it has been excluded from the study before the termination event has occurred. Nonzero values (usually one) indicate that the termination event has occurred for this case. If a censor variable is not selected, then it is assumed that termination event has occurred for all cases. The default value of 0 indicating the censored cases can be edited to any other value from the next dialogue (see 9.4.4.2. Cox Regression Intermediate Inputs).
Weight: A column containing continuous numeric data can be selected as a weight variable. In this case, the program will normalise this column so that its sum is equal to the valid number of cases and then multiply each covariate by its square root before running the regression.
Factor: A categorical variable containing a limited number of distinct values can be selected as a factor. Unlike other Survival Analysis procedures though, this variable is not used in selecting subgroups to be included in the analysis. All cases are included in the analysis, but a stratified model is fitted. For more information on stratified analysis see the beginning of section 9.4.4. Cox Regression.
9.4.4.2. Cox Regression Intermediate Inputs
Tolerance: This value is used to control the sensitivity of nonlinear minimisation procedure employed. Under normal circumstances, you do not need to edit this value. If a convergence cannot be achieved, then larger values of this parameter can be tried by removing one or more zeros.
Maximum Number of Iterations: When convergence cannot be achieved with the default value of 100 function evaluations, a higher value can be tried.
Expansion Factor: When the nonlinear minimisation algorithm moves away from a valid range of input parameters, this expansion factor is used to start a new search in a wider area. Under normal circumstances, you do not need to edit this value. If convergence cannot be achieved, then other values of similar (but larger) magnitude can be tried.
Value for Censored Cases: If a column of the spreadsheet has been selected as a censor variable, UNISTAT will consider cases with a zero entry in this column as censored. However, you can specify any value here to indicate which cases are to be treated as censored. Once a change is made here, it will apply to all Survival Analysis procedures throughout the session.
At Mean of Covariates: This box is used to select the type of baseline survival and hazard functions. A zero value results in baseline functions being reported relative to the origin and other values relative to the covariate means.
One other result that is affected by this choice is the linear fit (XBeta) vector reported under the Case (Diagnostic) Statistics option.
Omit Level: This box will appear only when one or more dummy variables have been included in the model from the Variable Selection Dialogue. Three options are available; (0) do not omit any levels, (1) omit the first level and (2) omit the last level. When no levels are omitted, the model will usually be overparameterised (see 2.1.4. Creating Interaction, Dummy and Lag/Lead Variables).
9.4.4.3. Cox Regression Output Options
When the maximum likelihood model has converged or the maximum number of iterations has been reached without convergence, an Output Options Dialogue will provide access to the following options.
When no covariates have been included in the model, only the 2 Log likelihood value will be reported in the Regression Results output. In this case, a Life Table will be displayed with a survival function that is identical to the KaplanMeier estimate of the same model (see 9.4.2. KaplanMeier Analysis).
9.4.4.3.1. Cox Regression Coefficient Output
Regression Results: The main regression output displays a table for coefficients of the estimated regression equation, their standard errors, Wald statistics, probability values and confidence intervals for the significance level specified in the Variable Selection Dialogue. If any independent variables have been omitted due to multicollinearity, they are reported at the end of the table. If you do not wish to display these variables enter the following line in the [Options] section of Documents\Unistat10\Unistat10.ini file:
DispCollin=0
In StandAlone Mode, the estimated regression coefficients can be saved to the data matrix by clicking on the UNISTAT icon situated on the Output Medium Toolbar.
Wald Statistic: This is defined as:
and has a chisquare distribution with one degree of freedom.
Confidence Intervals: The confidence intervals for regression coefficients are computed from:
where each coefficient’s standard error, , is the square root of the diagonal element of covariance matrix.
2 LogLikelihood Initial Model: This is the value when all independent variables are excluded from the model:
2 LogLikelihood Final Model: This is 2 times the value of the log likelihood function when convergence is achieved.
Pseudo Rsquared: In Cox Regression, an rsquared statistic as in the OLS regression is not available. This is because Cox Regression employs an iterative maximum likelihood estimation method. Equivalent statistics to test the goodness of fit have been proposed using the initial (L_{0}) and maximum (L_{1}) likelihood values.
McFadden:
Adjusted McFadden:
Cox & Snell:
Nagelkerke:
Likelihood Ratio: This is a test statistic for the null hypothesis that “all regression coefficients for covariates are zero”. It is equal to 2 times the difference between the initial and final model likelihood values and has a chisquare distribution with k degrees of freedom (the number of independent variables in the model).
If no covariates have been included in the model, then only the 2 Log likelihood value will be reported.
Hazard Ratio:
Hazard ratio is defined as:
Values of hazard ratio above 1 indicate an increased hazard as the values of the corresponding covariate increase at any time period, and vice versa for values below 1.
The standard error of the hazard ratio is found as:
and its confidence intervals as:
which are simply the exponential of the coefficient confidence intervals.
This option is not available when no covariates have been included in the model.
Correlation Matrix of Regression Coefficients: This is a symmetric matrix with unity diagonal elements. The offdiagonal elements give the correlations between the regression coefficients.
This output is not available when no covariates have been included in the model.
Covariance Matrix of Regression Coefficients: This is a symmetric matrix where diagonal elements are the square of parameter standard errors.
This output is not available when no covariates have been included in the model.
9.4.4.3.2. Cox Regression Case Output
Case (Diagnostic) Statistics: Case statistics are useful in determining the influence of individual observations on the overall fit of the model. For further information see 7.2.1.2.2. Linear Regression Case Output.
The values reported here are not sorted according to strata and the time variable. Therefore, when this output is sent to the Data Processor for further analysis, a casebycase correspondence with the original data will be maintained.
Survival Function:
Cumulative Hazard Function:
CoxSnell Residuals:
These are identical to the cumulative hazard function.
Modified CoxSnell Residuals:
where is if the case terminates at time j and if it is censored.
Martingale Residuals:
Deviance Residuals:
Score (Partial) Residuals:
for uncensored cases and missing otherwise for all covariates i = 1,…,p, where:
DeltaBeta:
where is the p x p covariance matrix of regression coefficients and is an n x p matrix, elements of which are defined as:
for j = 1,…,n, i = 1,…, p where is defined as in score residuals above.
Deltabeta is defined as the change in an estimated coefficient when a case is omitted from the analysis. However, unlike in Linear Regression (see 7.2.1.2.2. Linear Regression Case Output), where an exact estimate can be computed without having to run n regressions, the method employed here is an approximation (see Pettitt & Bin Daud, 1989). The user must be aware that different sources and statistical packages may adopt different approximation methods. The best way to find out which method gives the most accurate results is to run a small example and compute each deltabeta by running n regressions, each time omitting one row from the analysis.
Standardised DeltaBeta:
for i = 1,…, p.
This is deltabeta divided by its standard error for each coefficient.
Likelihood Displacement:
where is the m x m covariance matrix of regression coefficients and is an n x m matrix, elements of which are as defined in deltabetas. This statistic reflects the effect of deleting a case on the likelihood function.
Linear Fit (XBeta):
Defined as:
Alongside baseline functions, this is the only result that is affected by whether the baseline functions are reported centred or relative to the origin (see 9.4.4.2. Cox Regression Intermediate Inputs).
Baseline Functions: Baseline survival, hazard and cumulative hazard functions are reported for each stratum, which are ordered according to the time variable in ascending order. Baseline functions reported here would in general be different for a model where the covariates are centred. If there are no predicted cases, then the Baseline Functions dialogue contains three options.
If at least one case (row) of data does not contain any missing values except for the time variable, then a fourth option Predicted Survival is added to the Baseline Functions dialogue.
Baseline Survival:
where is obtained by solving the following equation:
where D(t_{j}) is the set of cases that terminate at time j.
Baseline Hazard:
where is as defined for the baseline survival.
Baseline Cumulative Hazard:
Predicted Survival: This is computed for cases containing no missing values except for the time variable. For each such case, a column of predicted survival times is computed as:
Plot of Baseline Survival Function: The baseline survival function is plotted against time. The appearance of this graph depends on the selection of estimation at covariate means (see 9.4.4.2. Cox Regression Intermediate Inputs).
Plot of Baseline Cumulative Hazard: The baseline cumulative hazard function is plotted against time. The appearance of this graph depends on the selection of estimation method, at origin or at covariate means (see 9.4.4.2. Cox Regression Intermediate Inputs).
9.4.4.4. Cox Regression Examples
Example 1
Data on survival times of patients in a study on multiple myeloma is given in Table 1.3 (p. 9), in Collett, D. (1994).
Open SURVIVAL and select Statistics 2 → Survival Analysis → Cox Regression. From the Variable Selection Dialogue select the data option 1 Enter Durations and Survival time (C10) as [Time], Status (C11) as [Censored] and Age, Sex, BUN, CA, HB, PC and BJ (C12 to C18) as [Variable]s.
The Regression Results output reproduces the coefficient estimates and their standard errors as Collett reports them on Table 3.1 (p. 71):
Cox Regression
Time Variable: Survival time
Censor Variable: Status
Number of Cases Censored: 12 ( 25.0%)
Valid Number of Cases: 48, 0 Omitted
Regression Results

Coefficient 
Standard Error 
Wald statistic 
Probability 
Lower 95% 
Upper 95% 
Age 
0.0194 
0.0279 
0.4806 
0.4882 
0.0741 
0.0354 
Sex 
0.2509 
0.4023 
0.3890 
0.5328 
1.0394 
0.5376 
BUN 
0.0208 
0.0059 
12.3396 
0.0004 
0.0092 
0.0324 
CA 
0.0131 
0.1324 
0.0098 
0.9211 
0.2465 
0.2727 
HB 
0.1352 
0.0689 
3.8537 
0.0496 
0.2703 
0.0002 
PC 
0.0016 
0.0066 
0.0587 
0.8085 
0.0145 
0.0113 
BJ 
0.6404 
0.4267 
2.2529 
0.1334 
1.4767 
0.1959 
2 Log likelihood: 

Initial Model = 
215.9399 
Final Model = 
199.7009 
Pseudo Rsquared: 

McFadden = 
0.0752 
Adjusted McFadden = 
0.0104 
Cox & Snell = 
0.2870 
Nagelkerke = 
0.2903 
Likelihood Ratio Statistic: 

ChiSquare Statistic = 
16.2390 
Degrees of Freedom = 
7 
RightTail Probability = 
0.0230 
Example 2
Survival times of patients classified according to age group and whether or not they have had a nephrectomy is given in Table 3.3 (p. 76), in Collett, D. (1994). Example 3.2 (p. 69).
Open SURVIVAL and select Statistics 2 → Survival Analysis → Cox Regression. From the Variable Selection Dialogue select the data option 1 Enter Durations and Time (C27) as [Time], Status (C28) as [Censored]. Then highlight Age group (C30) and click [Dummy] then highlight Nephrectomy (C29) and click [Dummy] again. This will add two terms to the model: Dummy(C30 Age Group) and Dummy(C29 Nephrectomy). The next dialogue will ask for a number of model parameters. Leave all entries unchanged as suggested by the program, but change the last one Omit Level? from 0 to 1, to omit the first level of each factor from the model.
The Regression Results output reproduces the coefficient estimates and their standard errors as Collett reports them in Example 3.12 (p. 100):
Cox Regression
Time Variable: Time
Censor Variable: Status
Number of Cases Censored: 4 ( 11.1%)
Valid Number of Cases: 36, 0 Omitted
Regression Results

Coefficient 
Standard Error 
Wald Statistic 
Signifi cance 
Lower 95% 
Upper 95% 
Age Group = 2 
0.0125 
0.4246 
0.0009 
0.9765 
0.8197 
0.8447 
3 
1.3416 
0.5918 
5.1396 
0.0234 
0.1817 
2.5014 
Nephrectomy = 2 
1.4115 
0.5152 
7.5044 
0.0062 
2.4213 
0.4016 
2 Log likelihood: 

Initial Model = 
177.6665 
Final Model = 
165.5084 
Pseudo Rsquared: 

McFadden = 
0.0684 
Adjusted McFadden = 
0.0347 
Cox & Snell = 
0.2866 
Nagelkerke = 
0.2887 
Likelihood Ratio Statistic: 

ChiSquare Statistic = 
12.1581 
Degrees of Freedom = 
3 
RightTail Probability = 
0.0069 
Hazard Ratio

Hazard Ratio 
Standard Error 
Lower 95% 
Upper 95% 
Age Group = 2 
1.0126 
0.4299 
0.4406 
2.3273 
3 
3.8250 
2.2635 
1.1993 
12.1996 
Nephrectomy = 2 
0.2438 
0.1256 
0.0888 
0.6692 
Correlation Matrix of Regression Coefficients

Age Group = 2 
3 
Nephrectomy = 2 
Age Group = 2 
1.0000 
0.3119 
0.0767 
3 
0.3119 
1.0000 
0.0818 
Nephrectomy = 2 
0.0767 
0.0818 
1.0000 
The baseline functions are reported as in Collett Table 3.10 (p. 101):
Cox Regression
Baseline Functions
Time 
Baseline Survival 
Baseline Hazard 
Baseline Cumulative Hazard 
5 
0.9502 
0.0498 
0.0510 
6 
0.8516 
0.1038 
0.1607 
8 
0.7552 
0.1132 
0.2808 
9 
0.5761 
0.2371 
0.5514 
10 
0.5339 
0.0733 
0.6275 
12 
0.4861 
0.0896 
0.7214 
14 
0.4335 
0.1082 
0.8359 
15 
0.3831 
0.1163 
0.9595 
17 
0.3326 
0.1318 
1.1008 
18 
0.2379 
0.2848 
1.4360 
21 
0.1939 
0.1851 
1.6406 
26 
0.1198 
0.3822 
2.1222 
35 
0.0920 
0.2319 
2.3860 
36 
0.0512 
0.4430 
2.9712 
38 
0.0369 
0.2792 
3.2985 
48 
0.0259 
0.2993 
3.6543 
52 
0.0114 
0.5598 
4.4748 
56 
0.0070 
0.3823 
4.9565 
68 
0.0041 
0.4207 
5.5024 
72 
0.0022 
0.4673 
6.1323 
84 
0.0009 
0.5975 
7.0424 
108 
0.0002 
0.8072 
8.6886 
Example 3
Data on times to removal of a catheter following a kidney infection is given in Table 5.1 (p. 157), in Collett, D. (1994), Example 3.2 (p. 69).
Open SURVIVAL and select Statistics 2 → Survival Analysis → Cox Regression. From the Variable Selection Dialogue select the data option 1 Enter Durations and Time (C32) as [Time], Status (C33) as [Censored]. Then select Age (C34) and Sex (C35) as [Variable]s. From the Output Options Dialogue select Case (Diagnostic) Statistics and at the next dialogue click All to select all options.
Different types of residuals calculated are reported by Collett in Table 5.2 (p. 157). Collett also reports the approximate estimated deltabetas in Table 5.4 (p. 172) and the likelihood displacement values in Table 5.5 (p.176).
Cox Regression
Case (Diagnostic) Statistics
Row 
Survival Function 
Cumulative Hazard Function 
CoxSnell Residuals 
Modified CoxSnell Residuals 
Martingale Residuals 
Deviance Residuals 
1 
0.7200 
0.3286 
0.3286 
0.3286 
0.6714 
0.9398 
2 
0.9245 
0.0785 
0.0785 
0.0785 
0.9215 
1.8020 
3 
0.2386 
1.4331 
1.4331 
1.4331 
0.4331 
0.3828 
4 
0.9104 
0.0939 
0.0939 
0.0939 
0.9061 
1.7087 
5 
0.1697 
1.7736 
1.7736 
1.7736 
0.7736 
0.6334 
6 
0.7322 
0.3117 
0.3117 
1.3117 
0.3117 
0.7895 
7 
0.7668 
0.2655 
0.2655 
0.2655 
0.7345 
1.0877 
8 
0.5836 
0.5386 
0.5386 
0.5386 
0.4614 
0.5611 
9 
0.1916 
1.6523 
1.6523 
1.6523 
0.6523 
0.5480 
10 
0.2409 
1.4234 
1.4234 
1.4234 
0.4234 
0.3751 
11 
0.2415 
1.4207 
1.4207 
1.4207 
0.4207 
0.3730 
12 
0.0914 
2.3927 
2.3927 
2.3927 
1.3927 
1.0201 
… 
… 
… 
… 
… 
… 
… 
Row 
Score (Partial) Residuals Age 
Score (Partial) Residuals Sex 
DeltaBeta Age 
DeltaBeta Sex 
Standardised DeltaBeta Age 
Standardised DeltaBeta Sex 
1 
1.0850 
0.2416 
0.0020 
0.1977 
0.0750 
0.1804 
2 
14.4930 
0.6644 
0.0004 
0.5433 
0.0155 
0.4957 
3 
3.1291 
0.3065 
0.0011 
0.0741 
0.0402 
0.0677 
4 
10.2215 
0.4341 
0.0119 
0.5943 
0.4528 
0.5423 
5 
16.5882 
0.5504 
0.0049 
0.0139 
0.1869 
0.0127 
6 
* 
* 
0.0005 
0.1192 
0.0206 
0.1088 
7 
17.8286 
0.0000 
0.0095 
0.1269 
0.3607 
0.1158 
8 
7.6201 
0.0000 
0.0032 
0.0346 
0.1235 
0.0315 
9 
17.0910 
0.0000 
0.0073 
0.0733 
0.2771 
0.0669 
10 
10.2390 
0.0000 
0.0032 
0.2023 
0.1232 
0.1846 
11 
2.8575 
0.0000 
0.0060 
0.2158 
0.2279 
0.1970 
12 
5.5338 
0.0000 
0.0048 
0.1939 
0.1829 
0.1770 
… 
… 
… 
… 
… 
… 
… 
Row 
Likelihood Displacement 
Linear Fit (XBeta) 
1 
0.0328 
1.8604 
2 
0.3387 
4.0852 
3 
0.0046 
1.7389 
… 
… 
… 
9 
0.1334 
3.5993 
10 
0.0353 
4.1156 
11 
0.0611 
4.5104 
12 
0.0432 
4.4800 
… 
… 
… 