UNISTAT - the ultimate Excel statistics add-in

7.2.6. Logistic Regression

The Logistic Regression procedure is suitable for estimating Linear Regression models when the dependent variable is a binary (or dichotomous) variable, that is, it consists of two values such as Yes or No, or in general 0 and 1. In such cases, where the dependent variable has an underlying binomial or binary distribution (and thus the predicted Y values should lie between 0 and 1) the Linear Regression procedure cannot be employed.

Logistic Regression is closely related to Logit / Probit / Gompit. For a brief discussion of similarities and differences of these two procedures see 7.2.5. Logit / Probit / Gompit.

Like Linear Regression, Logistic Regression can be used to estimate models with or without a constant term and regressions may be run on a subset of cases as determined by the levels of an unlimited number of factor columns. An unlimited number of dependent variables (numeric or string) can be selected in order to run the interaction terms, dummy and lag/lead variables in the model, without having to create them as spreadsheet columns first.

options.

Predictions (Interpolations): UNISTAT provides the user with two different methods of making predictions for a Logistic Regression model:

1)       Actual and Fitted Values (see 7.2.6.3.1. Logistic Regression Main Output Options),

2)       The spreadsheet Reg function (see 3.4.2.6.3. UNISTAT Functions). This will give the logit of predicted values, which can be transformed back as:

      Logistic Regression

      For the first option, there are three conditions under which predictions will be computed for estimated Y values:

1)       If, for a case, all independent variables are non-missing, but only the dependent variable is missing,

2)       If a case does not contain missing values but it has been omitted from the analysis by Data Processor’s DataSelect Row function,

3)       If a case does not contain missing values but it has been omitted from the analysis by selecting a [Factor] variable from the Variable Selection Dialogue (see 2.1.2. Categorical Data Analysis).

regression coefficients. When, however, the Actual and Fitted Values option is selected, the program will detect these cases and compute and display the fitted (estimated) Y values. Therefore, it will be by one of the above three methods.

Multicollinearity: Variables causing multicollinearity will be displayed with a zero coefficient at the end of the coefficients table. If you do not wish to display these variables enter the following line in the [Options] section of Documents\Unistat60\Unistat60.ini file:

DispCollin=0

      The rest of the coefficients will be determined as if the regression were run without the variables causing collinearity.

7.2.6.1. Logistic Regression Model Description

Logistic Regression employs the logit model as explained in Logit / Probit / Gompit (see 7.2.5.1. Logit / Probit / Gompit Model Description). However, the log of likelihood function for the logistic model can be expressed more explicitly as:

Logistic Regression

with first derivatives:

Logistic Regression

where:

Logistic Regression

7.2.6.2. Logistic Regression Variable Selection

Logistic Regression

As in Linear Regression, it is possible to create interaction terms, dummy variables, lag/lead terms, select multiple dependent variables (see 2.1.4. Creating Interaction, Dummy and Lag/Lead Variables) and run regressions on subsamples defined by several factor columns with or without weights (see 7.2.1.1. Linear Regression Variable Selection).

It is compulsory to select at least one column containing numeric or String Data as a dependent variable. The program recodes the dependent variable internally such that the minimum value that occurs in the column is zero and the rest are one. Note that in case a categorical variable is not selected, there may be too few 0s and too many 1s in the transformed dependent variable and a convergence may not be achieved.

When more than one dependent variable is selected, the analysis will be repeated as many times as the number of dependent variables, each time only changing the dependent variable and keeping the rest of selections unchanged.

A column containing numeric data can be selected as a weights column. Unlike the Linear Regression procedure, however, weights here are frequency weights. All independent variables are multiplied by this column internally by the program.

An intermediate inputs dialogue is displayed next.

Logistic Regression

Tolerance: This value is used to control the sensitivity of nonlinear minimisation procedure employed. Under normal circumstances, you do not need to edit this value. If a convergence cannot be achieved, then larger values of this parameter can be tried by removing one or more zeros.

Maximum Number of Iterations: When convergence cannot be achieved with the default value of 100 function evaluations, a higher value can be tried.

Omit Level: This field will appear only when one or Dialogue. Three options are available; (0) do not omit any levels, (1) omit the first level and (2) omit the last level. When no levels are omitted, the model will usually be over-parameterised (see 2.1.4. Creating Interaction, Dummy and Lag/Lead Variables).

7.2.6.3. Logistic Regression Output Options

When the calculations are finished an Output Options Dialogue will provide access to the following options.

Logistic Regression

7.2.6.3.1. Logistic Regression Main Output Options

Classification Threshold Probability: In Logistic Regression, the fitted Y is a continuous variable consisting of probabilities. The estimated group membership (in terms of 0 and 1) is therefore dependent on a critical cut-off point provided by the user. This cut-off point is also called the Classification Threshold Probability and its default value is 0.5. The estimated group membership is 0 for any case with an estimated probability (fitted Y value) less than this critical probability and it is 1 otherwise. Changing this value will affect the output in Classification by Group, Actual and Fitted Values and Predicted Group in Case (Diagnostic) Statistics output options.

Regression Results: The main regression output displays a table for coefficients of the estimated regression equation, their standard errors, Wald statistics, probability values and confidence intervals for the significance level specified in the Variable Selection Dialogue. If any independent variables have been omitted due to multicollinearity, they are reported at the end of the table with a zero coefficient. If you do not wish to display these variables enter the following line in the [Options] section of Documents\Unistat60\Unistat60.ini file:

DispCollin=0

      Wald Statistic: This is defined as:

      Logistic Regression

      and has a chi-square distribution with one degree of freedom.

      Confidence Intervals: The confidence intervals for regression coefficients are computed from:

      Logistic Regression

      where each coefficient’s standard error, Logistic Regression, is the square root of the diagonal element of covariance matrix.

      -2 Log-Likelihood Initial Model: This is the value when all independent variables are excluded from the model:

      Logistic Regression

      -2 Log-Likelihood Final Model: This is -2 times the value of the log likelihood function when convergence is achieved.

      Pseudo R-squared: In Logistic Regression (as well as in Multinomial Regression, Poisson Regression and Cox Regression procedures), an R‑squared statistic as in Linear Regression is not available. This is because Logistic Regression employs an iterative maximum likelihood estimation method. Equivalent statistics to test the goodness of fit have been proposed using the initial (L0) and maximum (L1) likelihood values.

            McFadden:

                 Logistic Regression

           Adjusted McFadden:

                 Logistic Regression

           Cox & Snell:

                 Logistic Regression

           Nagelkerke:

                 Logistic Regression

      Likelihood Ratio: This is a test statistic for the null hypothesis that “all regression coefficients for covariates are zero”. It is equal to ‑2 times the difference between the initial and final model likelihood values and has a chi‑square distribution with k degrees of freedom (the number of independent variables in the model).

      Goodness of Fit: This is also known as Pearson’s chi-square statistic and is for the observed versus expected number of responses. It has a chi-square distribution with n ‑ k degrees of freedom (the number of valid cases minus the number of independent variables, including the constant term, if any).

Odds Ratio: 

      Values of the odds ratio indicate the influence of one unit change in a covariate on the regression. It is defined as:

           Logistic Regression

      The standard error of the odds ratio is found as:

           Logistic Regression

      whereLogistic Regressionis the ith coefficient’s standard error, and its confidence intervals as:

           Logistic Regression

      which are simply the exponential of the coefficient confidence intervals.

Classification by Group: Observed group membership (the actual Y in terms of 0 and 1) is tabulated against the estimated group membership. Estimated group membership is 0 for any case with an estimated probability (fitted Y value) less than a user-defined critical probability (the Classification Threshold Probability, with a default value of 0.5) and it is 1 otherwise.

      Sensitivity: This is the percentage ratio of correctly classified 1s out of total number of observed 1s.

      Specificity: This is the percentage ratio of correctly classified 0s out of total number of observed 0s.

      Correctly Classified: This is the percentage ratio of correctly classified 0s and 1s out of total number of observations.

Correlation Matrix of Regression Coefficients: This is a symmetric matrix with unity diagonal elements. The off-diagonal elements give correlations between regression coefficients.

Covariance Matrix of Regression Coefficients: This is a symmetric matrix where diagonal elements are the square of parameter standard errors. The off-diagonal elements are covariances between the regression coefficients.

Actual and Fitted Values: A character plot of estimated and observed values is generated for the dependent variable. The estimated values are displayed and, like residuals, they can be added to the data matrix for further analysis. This option can also be used to make predictions for the dependent variable (see 7.2.6. Logistic Regression).

Residuals: This will display the difference between the observed and predicted values of the dependent variable for each observation in the form of a character plot. Scaling will be made according to the largest deviation and all residuals will be plotted between minus and plus the maximum deviation. Alongside the plot the values of residuals will also be displayed. It is possible to send residuals to Data Processor for further analysis.

7.2.6.3.2. Logistic Regression Further Output Options

Case (Diagnostic) Statistics: Case statistics are useful to determine the influence of individual observations on the overall fit of the model. For further information see 7.2.1.2.2. Linear Regression Further Output Options.

Logistic Regression

      Statistics available under this option are defined as follows.

      Fitted Values:

           Logistic Regression

      where:

           Logistic Regression

      Predicted Group:

      If Logistic Regression cut-off (classification threshold) probability, then the group is 1, and 0 otherwise.

      Leverage:

      Logistic Regression

where the vector:

      Logistic Regression

      Cook’s Distance:

           Logistic Regression

where Logistic Regressionis the standardised residual as defined below.

      Deviance:

           Logistic Regression if Logistic Regression and Logistic Regression otherwise, where:

           Logistic Regression

      Residuals:

           Logistic Regression

      Standardised Residuals:

           Logistic Regression

      Logit Residuals:

           Logistic Regression

      Studentised Residuals:

           Logistic Regression

      Delta-Beta:

           Logistic Regression

      Delta-beta is defined as the change in an estimated coefficient when a case is omitted from the analysis. An estimate can be computed from the above formula without having to run n regressions.

Receiver Operating Characteristic (ROC) Plot: Sensitivity and specificity values are computed for a continuum of classification threshold (cut-off) probabilities between 0 and 1. For each probability, the sensitivity value is plotted against 1 ‑ specificity.

      Area Enclosed: UNISTAT calculates the area enclosed under the ROC curve, which is a measure of predictive power of the model. A value of 0.5 (which means that the curve is a 45º line) shows that the model has no power. A value of 1 (the theoretical maximum) means the full power.

      Best Cut-off Probability: UNISTAT also determines the coordinates of the best classification threshold probability, which is the point nearest to the north-west corner of the plot. This point is represented by a different symbol, which can be changed from EditData Series dialogue.

Logistic Regression

Sensitivity and Specificity versus Probability Cut-off: This plot is similar to the ROC plot, except that sensitivity and specificity values are plotted against the classification threshold (cut-off) probability.

Logistic Regression

7.2.6.4. Logistic Regression Examples

Open SURVIVAL and select Statistics 1Regression Analysis → Logistic Regression. From the Variable Selection Dialogue select Status (C11) as [Dependent] and Age, Sex, BUN, CA and HB (C12 to C16) as [Variable]s. Some of the following results have been shortened due to space considerations.

Logistic Regression

Regression Results

Valid Number of Cases: 48, 0 Omitted

Dependent Variable: Status

Minimum of dependent variable is encoded as 0 and the rest as 1.

 

 

Coefficient

Standard Error

Wald Statistic

Probability

Lower 95%

Upper 95%

Constant

 6.3141

 6.3765

 0.9805

 0.3221

-6.1836

 18.8119

Age

-0.0469

 0.0593

 0.6247

 0.4293

-0.1632

 0.0694

Sex

-0.6672

 0.8549

 0.6091

 0.4351

-2.3428

 1.0084

BUN

 0.0037

 0.0111

 0.1098

 0.7404

-0.0182

 0.0255

CA

 0.1719

 0.2779

 0.3828

 0.5361

-0.3727

 0.7166

HB

-0.2949

 0.1636

 3.2493

 0.0715

-0.6155

 0.0257

 

-2 Log likelihood:

 

Initial Model =

 53.9842

Final Model =

 49.5642

Pseudo R-squared:

 

McFadden =

 0.0819

Adjusted McFadden =

-0.1404

Cox & Snell =

 0.0880

Nagelkerke =

 0.1303

Likelihood Ratio Statistic:

 

Chi-Square Statistic =

 4.4200

Degrees of Freedom =

 5

Right-Tail Probability =

 0.4907

Goodness of Fit:

 

Chi-Square Statistic =

 45.6956

Degrees of Freedom =

 41

Right-Tail Probability =

 0.2833

 

Odds Ratio

 

Odds Ratio

Standard Error

Lower 95%

Upper 95%

Age

 0.9542

 0.0566

 0.8494

 1.0719

Sex

 0.5131

 0.4387

 0.0961

 2.7413

BUN

 1.0037

 0.0112

 0.9820

 1.0259

CA

 1.1876

 0.3300

 0.6889

 2.0474

HB

 0.7446

 0.1218

 0.5404

 1.0261

 

Classification by Group

Observed\Predicted

0

1

0

 0

 12

 

 0.00%

 100.00%

1

 1

 35

 

 2.78%

 97.22%

 

Classification Threshold Probability =

 0.5000

Sensitivity =

 97.22%

Specificity =

 0.00%

Correctly Classified =

 72.92%

 

Correlation Matrix of Regression Coefficients

 

Constant

Age

Sex

BUN

CA

HB

Constant

 1.0000

-0.8151

-0.6444

-0.0620

-0.5603

-0.5393

Age

-0.8151

 1.0000

 0.3750

-0.1119

 0.1563

 0.3360

Sex

-0.6444

 0.3750

 1.0000

 0.0005

 0.2630

 0.4330

BUN

-0.0620

-0.1119

 0.0005

 1.0000

 0.1341

 0.0538

CA

-0.5603

 0.1563

 0.2630

 0.1341

 1.0000

-0.0593

HB

-0.5393

 0.3360

 0.4330

 0.0538

-0.0593

 1.0000

 

Covariance Matrix of Regression Coefficients

 

Constant

Age

Sex

BUN

CA

HB

Constant

 40.6599

-0.3084

-3.5130

-0.0044

-0.9929

-0.5625

Age

-0.3084

 0.0035

 0.0190

-0.0001

 0.0026

 0.0033

Sex

-3.5130

 0.0190

 0.7309

 0.0000

 0.0625

 0.0606

BUN

-0.0044

-0.0001

 0.0000

 0.0001

 0.0004

 0.0001

CA

-0.9929

 0.0026

 0.0625

 0.0004

 0.0772

-0.0027

HB

-0.5625

 0.0033

 0.0606

 0.0001

-0.0027

 0.0268

 

Actual and Fitted Values

Row

Actual *

Fitted +

0.0000                                             

1

 1.0000

 0.5146

                              +                         

**2

 0.0000

 0.7216

*                                          +            

3

 1.0000

 0.8058

                                                +      

4

 1.0000

 0.7613

                                             +         

5

 1.0000

 0.6224

                                     +                 

Cases marked by ‘**’ are misclassified.

Residuals

Row

Residuals

-0.8726                                              0.8726

1

 0.4854

                                              *            

2

-0.7216

     *                                                     

3

 0.1942

                                    *                      

4

 0.2387

                                      *                    

5

 0.3776

                                          *                

 

Case (Diagnostic) Statistics

Row

Fitted Values

Predicted Group

Leverage

Cook’s Distance

Deviance

Residuals

1

 0.5146

 1.0000

 0.1138

 0.1211

 1.1528

 0.4854

2

 0.7216

 1.0000

 0.0621

 0.1716

-1.5993

-0.7216

3

 0.8058

 1.0000

 0.2079

 0.0632

 0.6571

 0.1942

4

 0.7613

 1.0000

 0.0678

 0.0228

 0.7386

 0.2387

5

 0.6224

 1.0000

 0.0656

 0.0426

 0.9739

 0.3776

 

Row

Standardised Residuals

Logit Residuals

Studentised Residuals

Delta-Beta Constant

Delta-Beta Age

Delta-Beta Sex

1

 0.9713

 1.9434

 1.2245

-0.7998

 0.0080

-0.0095

2

-1.6101

-3.5923

-1.6513

 0.7213

-0.0073

 0.0864

3

 0.4909

 1.2409

 0.7383

-0.5143

-0.0035

 0.1127

4

 0.5600

 1.3136

 0.7650

 0.0399

 0.0030

-0.0581

5

 0.7790

 1.6068

 1.0075

-0.1383

 0.0028

-0.0490

 

Row

Delta-Beta BUN

Delta-Beta CA

Delta-Beta HB

1

-0.0003

-0.0093

 0.0435

2

 0.0015

-0.0478

-0.0046

3

 0.0000

 0.0608

 0.0007

4

-0.0008

-0.0009

-0.0077

5

-0.0005

-0.0072

 0.0154

 

Logistic Regression

 

Logistic Regression