UNISTAT - the ultimate Excel statistics add-in

7.3.2. General Linear Model

UNISTAT’s GLM procedure can handle unbalanced designs where the number of observations in cells are not necessarily equal. Data may contain missing values and / or missing cells. Factor columns may be numeric, string, date or time variables and need not be sorted. It is possible to select an unlimited number of dependent variables, factors, repeated measures and covariates. When more than one dependent variable is selected, the analysis will be repeated for each dependent variable with the rest of the settings unchanged.

The GLM procedure assumes that each factor has a maximum of 1000 levels, though this number may be increased by the user, if necessary. It perform multiple comparison tests on the GLM model fitted.

This procedure allows more flexibility in defining the ANOVA model to be used. All the previous models can be built using the General Linear Model procedure. Also, this procedure can handle 4 way and above interactions / nested terms and allows the specification of the F-ratio terms.

Note that the GLM procedure always adopts the Regression Approach to sum of squares. This means the factors can be selected in any order and the same sum of squares will be obtained.

7.3.2.1. GLM Variable Selection

General Linear Model

The variable selection for General Linear Model is slightly different from the ANOVA procedures. When a selection is made from the Variables Available list on the left, the variable remains there, allowing it to be selected again. The [Dependent] and [Covariate] buttons work as before (see 7.3.1.1. ANOVA Variable Selection). The [Factor], [Interaction], [Full] and [Nest] all add factors to the factor list box. They do this in the following way:

Factor: is used to select a single factor variable. If multiple variables are highlighted in the Variables Available list, then all these variables will be added to the factors list.

Interaction: is used to select an interaction term. This will only be active when multiple variables are selected in the Variables Available list. A single factor of the form FactA × FactB × ... is added. You can always highlight multiple nonadjacent variables by pressing on the [Ctrl] key and then clicking on the desired variable.

      Note that a special cross sign is used between the variables in an interaction term. If there is a problem with this character on a non-English operating system, you can enter and edit the following line in Documents\Unistat60\ Unistat60.ini file under [Options] to display any other character (say x):

   InteractionCross=×

      This character also appears in all ANOVA and GLM output with interaction terms.

Nested: is used to create a nested term. This will only be active when a variable is selected in the Variables Available list and a factor is highlighted in the factors list on the right. When selected, this will create a nested factor of the form FactA(FactB). This can be repeated to create a multilevel nested factor e.g. FactA(FactB(FactC)). Nests on interactions, such as FactA(FactB x FactC), are not allowed.

Full: is used to add all the factors and interactions for a fully factorial model. This will only be active when multiple variables are selected in the Variables Available list. You can always highlight multiple nonadjacent variables by pressing on the [Ctrl] key and then clicking on the desired variable.

Dependent: It is compulsory to select at least one column containing numeric data. When more than one data variable is selected, the analysis will be repeated as many times as the number of data variables, each time only changing the data variable and keeping the rest of selections unchanged.

analysis in order to remove its effect on the dependent variable. Any number of covariates can be selected from the Variables Available factors.

Repeated: This button is used to select a factor to define Within and Between Subjects terms. These can be used in the F-ratio selection to create a number of different repeated measure designs. If a variable is selected as [Repeated], then at least one factor must use the Error Within term in its F-ratio (see 7.3.2.2. GLM F-Ratio Selection).

7.3.2.2. GLM F-Ratio Selection

General Linear Model

This dialogue allows the denominator in the F-ratio of each

Residual: The denominator is given by the total error of the model. This is the default in GLM and it is also the denominator used in ANOVA.

None: The F-statistic and its tail probability are not calculated.

also known as Fixed Factor. That is, the sum of squares calculated is squares is not adjusted for this factor. The sum of squares is reported

Any Factor’s Mean Square: The denominator is given by the mean square of the selected factor.

Error Between Subjects: This option will only be available when a column is selected as [Repeated] in the variable selection. The denominator is given by the Error Between Subjects mean square. This value is calculated by the sum of squares of the selected column minus the sum of square of all the factors with denominator selected as Error Between Subjects. So the mean square value depends on which factors are selected as Error Between Subjects.

Error Within Subjects: This option will only be available when a column is selected as [Repeated] in the variable selection. This term partitions the residual sum of squares with the Error Between Subjects. So the mean square value depends on which factors are selected as Error Between Subjects.

7.3.2.3. GLM Output Options

General Linear Model

The main output of the GLM procedure is the ANOVA table. However, GLM also reports a wide range of diagnostic statistics, post hoc tests and plots.

Use Least Squares Means: Note that the following three output options are preceded by an asterisk in the Output Options Dialogue;

·      Table of Means,

·      Multiple Comparisons and

·      Profile Plot.

      The common point for these options is that they display information based on cell and / or marginal means (where a cell is defined as a unique combination of factor levels used in the model and a marginal mean is the mean of a group of cell means). It is possible to display these output options based on the least squares means (LSMeans or adjusted means) instead of arithmetic means. To do this check the Use Least Squares Means box at the top of the dialogue. Then these three output options will use the least squares means and this will be indicated on the output. Note that the least squares means output will not be available when a nested term exists in the model.

      Normally, in balanced designs arithmetic and least squares means will not be different. However, in unbalanced designs with more than one effect, the arithmetic mean of a group may not represent an appropriate response for that group, since it does not take other effects in the model into account. Least squares means can also be described as within-group means adjusted for the other effects in the model. In other words, least squares means are predicted population margins which estimate the marginal means over a balanced population.

      The least squares means are computed as:

           General Linear Model

      where L is the hypothesis matrix and General Linear Model is the vector of estimated regression coefficients as displayed in the third output option:

           General Linear Model

      The standard error of LSMeans is defined as:

           General Linear Model

      where MSE is the mean square error for the model.

      To save the hypothesis matrix L in a file enter the following line under the [Options] section of Documents\Unistat60\Unistat60.ini file:

      GLMSaveLMatrix=1

      The matrix will be saved in:

           ..\Documents\Unistat60\GLMLMatrix.txt

      To save the General Linear Model matrix enter the following line under the [Options] section of Unistat60.ini:

      GLMSaveInvXpXMatrix=1

      The matrix will be saved in the file:

           ..\Documents\Unistat60\GLMInvXpXMatrix.txt

Note that some of the GLM Output Options (such as coefficients, fitted values, residuals and their plots) are based on the underlying regression model for the GLM model. These are normally identical to the results obtained from the Linear Regression procedure if the same model is constructed using dummy variables (see 7.3.2.4. GLM Example below).

ANOVA: The factors in the model appear at the top of the table. Any factors that have been calculated but are not in the model appear at the bottom of the table in a section separated with a dividing line. These values may be used as denominators in F-statistic calculations. If a repeated measure has been specified, then the factor will be split into Between Subjects and Within Subjects terms.

* Table of Means: Number of cases, mean, standard deviation, standard error and the lower and upper limits of the confidence interval are displayed for each cell and marginal mean of the model. Missing cells are omitted from the analysis at the outset. This output option is similar to the Table of Means procedure available under Tests for ANOVA, but has the advantage of matching exactly the model specified in the GLM procedure. When the Use Least Squares Means box is checked, LSMeans will be displayed instead of arithmetic means.

Coefficients: Estimated coefficients for the underlying regression model are displayed. The Row Labels of this table are identical to that of Table of Means output above. Variables causing multicollinearity will be displayed with a zero coefficient at the end of the coefficients table. If you do not wish to display these variables enter the following line in the [Options] section of Documents\Unistat60\Unistat60.ini file:

DispCollin=0

Actual and Fitted Values: Actual and fitted Y values for the underlying regression model are displayed.

Residuals: Residual Y values for the underlying regression model are displayed.

* Multiple Comparisons: This output option is similar to the Multiple Comparisons procedure available under Tests for ANOVA, but has a major advantage. While the latter is always based on a one-way ANOVA model, this option takes the Mean Square Error and its Degrees of Freedom from the estimated GLM model. Note that both procedures give you the opportunity to override the suggested Mean Square Error and its Degrees of Freedom. When the Use Least Squares Means box is checked, comparisons will be made between LSMeans instead of arithmetic means.

General Linear Model

      When the data used in multiple comparisons is already transformed with a function like e or 10 based logarithm, the results need to be transformed back to the original scale and the user is faced with the task of applying back transformations manually. As of this version of UNISTAT, a new Antilog box added to this dialogue allows the user to specify which back-transformation is to be applied in the output. The output values that are affected by this control are:

·      means,

·      difference between means, and

·      lower and upper confidence limits for difference between means.

      Let X be the value entered into the Antilog box.

·      If X = 0 then no back-transformation is performed.

·      If X = 1 then the natural antilog of the output value Y, Exp(Y) is displayed.

·      If 1 < X ≤ 16, then X-based power of Y, X^Y is displayed.

      When a back-transformation base value is specified, the columns affected will be marked by an asterisk in the output.

Plot of Actual and Fitted Values: Select this option to plot actual and fitted Y values against row numbers (index).

Plot of Residuals: Residuals are plotted against row numbers (index).

Normal Plot of Residuals: Residuals are plotted against the normal probability (probit) axis.

* Profile Plot: The cell means for an unlimited number of factors can be plotted. If only one factor is selected, the plot is similar to the Means Plot with Error Bars option available in X-Y Plots (see 4.1.1.3. Means Plot). When the Use Least Squares Means box is checked, LSMeans will be plotted instead of arithmetic means.

      If two or more factors are selected, the first factor’s levels are represented on the X-Axis and the second factor’s levels as separate lines. This plot is known as Interaction Plot and it is different from a Means Plot with two factors (where interaction terms are plotted either on the X-Axis or as separate lines). The Interaction Plot is useful for comparing means of interaction terms. If the lines are parallel, then it can be concluded that there is no interaction between the two factors.

General Linear Model

      To change the default factor selections, you can click on the [Opt] button situated to the left of this output option. A further Variable Selection Dialogue is displayed. It is compulsory to select a [Row Factor] variable, in which case a Means Plot is displayed for the selected factor. If an optional [Column Factor] is also selected, then an Interaction Plot is displayed.

General Linear Model

      A third [Factor] button allows selection of an unlimited number of further factor columns. If at least one such factor is selected, then the program displays a further dialogue showing the levels of this factor. If two or more factors are selected then the dialogue shows the combinations of all selected factors. Only one level (or level combination) needs to be selected and the final plot is drawn for this selected subsample. When [Factor] columns are selected, selection of a [Column Factor] still remains optional.

General Linear Model

      In earlier versions of UNISTAT, error bars for means plot represented only the standard error of mean. Now, (as in Means Plot), the Error Bars control on the EditData Series dialogue allows selecting one of the following dispersion measures: None, t-interval, Z-interval, Standard Error, Standard Deviation, Variance

7.3.2.4. GLM Example

Published examples using GLM have already been introduced in section 7.3.0.2. ANOVA Designs. Here we shall demonstrate the full GLM output with a simple example. Next, we shall run the same model using the Linear Regression procedure and emphasise the similarities and differences.

Open DEMODATA and select Statistics 1ANOVA and GLMGeneral Linear Model. Highlight Region (C10) and Type (C11) on the Variables Available list and then click on the button [Full]. This will add two main factors and an interaction term to the Variables Selected list. Also select Output2 (C9) as [Dependent]. Accept default values suggested in the next two dialogues. Some of the following results have been shortened due to space considerations.

General Linear Model

ANOVA

Dependent Variable: Output2

 

Due To

Sum of Squares

DoF

Mean Square

F-Stat

 

Prob

Constant

 669862.660

 1

 669862.660

 281.165

 

 0.0000

Region

 70.934

 2

 35.467

 0.774

 

 0.4664

Type

 1.122

 1

 1.122

 0.024

 

 0.8762

Region × Type

 174.832

 2

 87.416

 1.908

 

 0.1586

Explained

 206.738

 5

 41.348

 0.902

 

 0.4866

Error

 2382.454

 52

 45.816

 

 

 

Total

 2589.193

 57

 45.424

 

 

 

 

R-squared =

 0.0798

Adjusted R-squared =

-0.0086

 

Table of Means

 

Cases

Mean

Standard Deviation

Standard Error

Lower 95%

Upper 95%

whole sample

 58

 107.4679

 6.7398

 0.8850

 105.6958

 109.2401

Region = 1

 14

 106.1586

 6.9131

 1.8476

 102.1671

 110.1501

2

 26

 107.8269

 6.7597

 1.3257

 105.0966

 110.5572

3

 18

 107.9678

 6.8330

 1.6106

 104.5698

 111.3658

Type = 1

 16

 107.0837

 6.9366

 1.7342

 103.3875

 110.7800

2

 42

 107.6143

 6.7430

 1.0405

 105.5130

 109.7156

Region × Type = 1 × 1

 7

 103.6271

 7.7791

 2.9402

 96.4327

 110.8216

1 × 2

 7

 108.6900

 5.2991

 2.0029

 103.7892

 113.5908

2 × 1

 5

 111.5200

 1.7219

 0.7701

 109.3820

 113.6580

2 × 2

 21

 106.9476

 7.2320

 1.5782

 103.6557

 110.2396

3 × 1

 4

 107.5875

 7.3881

 3.6940

 95.8315

 119.3435

3 × 2

 14

 108.0764

 6.9572

 1.8594

 104.0594

 112.0934

 

Coefficients

 

Coefficient

Standard Error

t-Statistic

Probability

Lower 95%

Upper 95%

Constant

 108.0764

 1.8090

 59.7426

 0.0000

 104.4463

 111.7065

Region = 1

 0.6136

 3.1333

 0.1958

 0.8455

-5.6740

 6.9011

2

-1.1288

 2.3355

-0.4833

 0.6309

-5.8153

 3.5576

3

*

 

 

 

 

 

Type = 1

-0.4889

 3.8375

-0.1274

 0.8991

-8.1895

 7.2117

2

*

 

 

 

 

 

Region × Type = 1 × 1

-4.5739

 5.2742

-0.8672

 0.3898

-15.1574

 6.0096

1 × 2

*

 

 

 

 

 

2 × 1

 5.0613

 5.1060

 0.9912

 0.3262

-5.1848

 15.3074

2 × 2

*

 

 

 

 

 

3 × 1

*

 

 

 

 

 

3 × 2

*

 

 

 

 

 

* omitted due to multicollinearity

 

Actual and Fitted Values

Row

Actual *

Fitted +

93.1100                                        114.9700

1

 103.3600

 108.0764

                            *           +              

2

 105.1300

 108.6900

                                *         +            

3

 105.9700

 106.9476

                                   * +                 

4

 106.6100

 107.5875

                                    *  +               

5

 107.1900

 108.6900

                                      *   +            

6

 107.6300

 106.9476

                                     + *               

 

Residuals

Row

Residuals

-14.6064                                             14.6064

1

-4.7164

                    *                                      

2

-3.5600

                      *                                    

3

-0.9776

                            *                              

4

-0.9775

                            *                              

 

Dunnett

For Output2, classified by Region

Control Group: 1, Two-Tailed Test

Mean Square Error: 45.8164292353523, Degrees of Freedom: 52

** denotes significantly different pairs. Vertical bars show homogeneous subsets.

A pairwise test result is significant if its q stat value is greater than the table q.

 

Group

Cases

Mean

1

 

1

 14

 106.1586

 

|

2

 26

 107.8269

 

|

3

 18

 107.9678

 

|

 

Comparison

Difference

Standard Error

q Stat

Table q

Probability

3 - 1

 1.8092

 2.4120

 0.7501

 2.2581

 0.6578

2 - 1

 1.6684

 2.2438

 0.7435

 2.2581

 0.6623

 

Comparison

Lower 95%

Upper 95%

Result

3 - 1

-3.6375

 7.2559

 

2 - 1

-3.3985

 6.7352

 

 

Homogeneous Subsets:

 

Group 1:

 1 2 3

Pooled mean =

 107.4679

 95% Confidence Interval =

 105.6845 <> 109.2514

 

General Linear Model

 

General Linear Model

 

General Linear Model

 

General Linear Model

 

Now we run the same model using Linear Regression.

Open DEMODATA and select Statistics 1Regression Analysis → Linear Regression. Highlight Region (C10) and Type (C11) on the Variables Available list and then click on the button [Full]. This will add two main factors and an interaction term to the Variables Selected list. Also select Output2 (C9) as [Dependent]. From the Output Options Dialogue select only the Regression Results, ANOVA of Regression, Actual and Fitted Values and Residuals options. The following results are obtained.

Note that the Regression Results, Actual and Fitted Values and Residuals options produce exactly the same results as the GLM model. In the ANOVA of Regression table, Regression, Error and Total terms are also identical to Explained, Error and Total of GLM’s ANOVA table. Moreover, if we add the two interaction Ssq’s (1x1 and 2x1) we obtain the Region x Type Ssq of GLM. Main interactions are, however, different. This is because the GLM Procedure adopts the Regression Approach when it computes the sum of squares, whereas Linear Regression’s sum of squares decomposition is strictly sequential. See 7.3.1.3. ANOVA Approaches.

Linear Regression

Regression Results

Valid Number of Cases: 58, 0 Omitted

Dependent Variable: Output2

* omitted due to multicollinearity

 

 

Coefficient

Standard Error

t-Statistic

Probability

Lower 95%

Upper 95%

Constant

 108.0764

 1.8090

 59.7426

 0.0000

 104.4463

 111.7065

Region = 1

 0.6136

 3.1333

 0.1958

 0.8455

-5.6740

 6.9011

2

-1.1288

 2.3355

-0.4833

 0.6309

-5.8153

 3.5576

Type = 1

-0.4889

 3.8375

-0.1274

 0.8991

-8.1895

 7.2117

Region × Type = 1 × 1

-4.5739

 5.2742

-0.8672

 0.3898

-15.1574

 6.0096

2 × 1

 5.0613

 5.1060

 0.9912

 0.3262

-5.1848

 15.3074

Region = 3

*

 

 

 

 

 

Type = 2

*

 

 

 

 

 

Region × Type = 1 × 2

*

 

 

 

 

 

2 × 2

*

 

 

 

 

 

3 × 1

*

 

 

 

 

 

3 × 2

*

 

 

 

 

 

 

Residual Sum of Squares =

 2382.4543

Standard Error =

 6.7688

Mean of Y =

 107.4679

Stand Dev of y =

 6.7398

Correlation Coefficient =

 0.2826

R-squared =

 0.0798

Adjusted R-squared =

-0.0086

F(5,52) =

 0.9025

Probability of F =

 0.4866

Durbin-Watson Statistic =

 0.1641

log of likelihood =

-190.2131

Press Statistic =

 2916.1898

 

ANOVA of Regression

Due To

Sum of Squares

DoF

Mean Square

F-Stat

Prob

Region = 1

 31.639

 1

 31.639

 0.691

 0.4098

2

 0.211

 1

 0.211

 0.005

 0.9462

Type = 1

 0.057

 1

 0.057

 0.001

 0.9721

Region × Type = 1 × 1

 129.815

 1

 129.815

 2.833

 0.0983

2 × 1

 45.017

 1

 45.017

 0.983

 0.3262

Regression

 206.738

 5

 41.348

 0.902

 0.4866

Error

 2382.454

 52

 45.816

 

 

Total

 2589.193

 57

 45.424

 0.991

 0.5142

 

Actual and Fitted Values

Row

Actual *

Fitted +

93.1100                                        114.9700

1

 103.3600

 108.0764

                            *           +              

2

 105.1300

 108.6900

                                *         +            

3

 105.9700

 106.9476

                                   * +                 

4

 106.6100

 107.5875

                                    *  +                

5

 107.1900

 108.6900

                                      *   +            

6

 107.6300

 106.9476

                                     + *               

 

Residuals

Row

Residuals

-14.6064                                             14.6064

1

-4.7164

                    *                                      

2

-3.5600

                      *                                    

3

-0.9776

                            *                              

4

-0.9775

                            *                              

5

-1.5000

                          *