UNISTAT - the ultimate Excel statistics add-in

2.1.4. Creating Interaction, Dummy and Lag/Lead Variables

Variable Selection-Creating Interaction, Dummy and Lag/Lead Variables

In most regression models, it is possible to add new terms to the model using transformations of existing variables, thus eliminating the need to create them as data columns in the spreadsheet first. In these procedures, interaction, dummy and lag/lead terms can be specified during the variable selection phase. The program will then create these terms in its temporary memory internally. This feature is available in the following procedures:

      Statistics 1

      Matrix Statistics

Statistics 1Regression Analysis →

      Linear Regression

      Stepwise Regression

      Logit / Probit / Gompit

      Logistic Regression

      Multinomial Regression

      Poisson Regression

Statistics 2Survival Analysis →

      Cox Regression

Note that, although it is not a regression procedure as of the models selected in other regression procedures. A particularly useful feature here is the option to send the entire final raw data (X) matrix of the regression model to the Output Medium, so that you can see the actual values of the interaction, dummy and lag/lead terms generated by the program internally. In Stand-Alone Mode, this output can then be sent to the Data Processor and used in other procedures if necessary (see 7.1. Matrix Statistics).

On the Variable Selection Dialogue of the above procedures, up to four smaller buttons [Interaction], [Dummy], [Full] and [Lag/Lead] will appear just under the [Variable] button. The Cox Regression procedure does not have a [Lag/Lead] button, as it is irrelevant for this procedure.

The behaviour of these buttons differs from other standard task buttons in that:

1)    They can be used to select items from both the Variables Selected list, as well as the Variables Available list.

2)    When they are used on a selection, the source (selected) items are not omitted from the list.

3)    To de-select the new variables created by these buttons from the Variables Selected list, the [Variable] button should be used (not the button with which they were created).

The functions of these buttons are as follows.

Interaction: This button becomes activated when one or more items are selected from one of the Variables Available or Variables Selected lists. If only one variable is highlighted, then a new term will be added to the Variables Selected list, which is the product of this variable by itself, e.g. C2 Wages × C2 Wages. If two variables are highlighted, then the new term will be the product of these two variables, e.g. C10 Region × C11 Type. Maximum three-way interactions are allowed. Interactions of string, date, time, dummy or lag/lead variables are not allowed. In order to create interaction terms for dummy variables, create interactions first, and then create dummy variables for them.

      Note that a special cross sign is used between the variables in an interaction term. If there is a problem with this character on a non-English operating system, you can enter and edit the following line in Unistat60.ini file under the [Options] group to display any other character (say *):

   InteractionCross=×

      This sign also appears in ANOVA and GLM output with interaction terms.

Dummy: This button is used to create n new (dummy) variables for a factor column containing n levels. Each dummy variable corresponds to a level of the factor column. A case in a dummy column will have the value of 1 if the factor contains the corresponding level in the same row, and 0 otherwise.

      One or more categorical variables can be selected from one of the Variables Available or independent variables lists. A new entry will be created in the independent variables list in the form of Dummy(C10 Region). If the selection contains interactions, dummies will be created in the form of Dummy(C10 Region × C11 Type).

      Sometimes, it may be important to select the interaction terms in a particular order, an order which is different from what they have in the Variables Available list. In this case, you can select the columns for the independent variables list by clicking on the [Variable] button in the desired order before selecting them for interactions or dummies.

Variable Selection-Creating Interaction, Dummy and Lag/Lead Variables

dummy variables corresponding to all levels of a factor (or an interaction term of factors), or omit the first level or the last level. Note that this dialogue may have other fields in some procedures. The purpose of this exercise is to provide you with a facility to remove the linear dependencies created when all option was selected and a dummy variable created for each level of a factor. This results in an over-parameterised model, since the full set of dummy variables for any factor will always add up to the unity vector. If a model is run in this configuration, the regression algorithm will detect and omit the dummy variables that cause the collinearity as and when they occur. Note that omit the last dummy variable for each factor, and each factor with i levels will end up contributing (i ‑ 1) degrees of freedom to the model (if Again, if no other dependencies exist, they will only contribute to the model with (i ‑ 1)(j ‑ 1) degrees of freedom in the case of a two-way interaction and (i ‑ 1)(j ‑ 1)(k ‑ 1) in You are recommended to think about these issues before running a model with dummy variables and omit either the first or the last level in order not to end up with surprise results.

      If you wish to omit levels other than the first or to apply contrasts), then you are recommended to construct dummy variables as data columns first. In Stand-Alone Mode, you can do this automatically using the Data Processor’s Dummy() function (see iable]s.

Full: This button becomes activated when two or more items are selected from one of the independent variables or Variables Available lists. Dummy variables for each column selected, as well as dummies for all possible interaction terms (up to 3-way) will be created automatically. The number of new independent variables added to the model is determined by the number of distinct values (levels) of the selected columns. For each interaction term, this number is equal to the product of the number of levels in all variables in the term.

      For example, suppose two columns are highlighted, C10 Region and C11 Type. Then the following three terms will be created: Dummy(C10 Region), Dummy(C11 Type), Dummy(C10 Region*C11 Type). If four columns are highlighted, C1, C2, C3, C4, then 14 new dummy terms will be created in the following order: Dummy(C1), Dummy(C2), Dummy(C3), Dummy(C4), Dummy(C1*C2), Dummy(C1*C3), Dummy(C1*C4), Dummy(C2*C3), Dummy(C2*C4), Dummy(C3*C4), Dummy(C1*C2*C3), Dummy(C1*C2*C4), Dummy(C1*C3*C4), Dummy(C2*C3*C4). Also suppose that C1, C2, C3, C4 contain 2, 3, 4, 5 levels respectively. Then the total number of variables added to the model will be 239.

      WARNING! Ensure that the columns selected as Dummy are categorical variables containing a limited number of distinct values.

Lag/Lead: This button is used to create new variables by shifting the rows of an existing variable up or down. If you highlight some columns and click on the [Lag/Lead] button, these will be transferred to the Variables Selected list as Lag(C1 Label1;0), Lag(C2 Label2;0), etc. In this case, after clicking [Next] a further dialogue will ask for the size of the lags (or leads) for each [Lag/Lead] variable selected.

Variable Selection-Creating Interaction, Dummy and Lag/Lead Variables

      You can select the same variable an unlimited number of Enter a negative integer to define a selection as a lag and a positive integer for a lead. Leaving an entry as zero means that the selected variable will be included in the model without modification. When, for instance, -2 is entered for a lag, the program will create a new variable internally, starting from the third case of the original column. Therefore this variable will have two observations missing at the end. On the other hand, when 2 is entered for a lead variable, its first observation will correspond to the third row of the data matrix, the first two cases will be defined as missing and the last two cases will be omitted.

      WARNING! Selection of lag/lead variables results in a loss of degrees of freedom. You must ensure that there is a sufficient number of cases (rows) in the data matrix when lags and / or leads are selected.

      WARNING! Interpretation of lag/lead variables may not be clear when they are selected along with factors for categorical analysis (see 2.1.2. Categorical Data Analysis) or with the Data Processor’s DataSelect Row function. The use of lag/lead variables is not prevented in such cases you should ensure that their effect is unambiguous.