Documentation of EG13PC (v0.5-pc) last update: BQ 7/10/1991 Purpose: fitting multivariate binary regression models that allow ------- more than one class in each cluster, and a different regression for each class and for the dependence between and within classes. Allows the choice between GEE1 and GEE2. References: ----------- Liang, Zeger and Qaqish (1989). "Multivariate Regression Models using Generalized Estimating Equations" Technical Report, Department of Biostatistics, The Johns Hopkins University, School of Hygiene and Public Health. Later published in 1992, JRSS-B (with discussion). Qaqish and Liang (1990). "Marginal Models for Correlated Binary Data with Multiple Classes and more than one Level of Nesting" Technical Report, Department of Biostatistics, The Johns Hopkins University, School of Hygiene and Public Health. Later published in 1992, Biometrics. Version: 0.5-pc Beta test version. ------- Environment: IBM-PC running DOS ----------- The program does not require a math processor. However if one is installed in the system it will be used. The program was compiled with Turbo Pascal 5.5 ((C) Borland International) Necessary files: To run the program the following files are needed --------------- EG13PC.EXE : the executable code X.DAT : a data file X.CTL : a control file Output: X.LST : the output ------ To run the program: From the DOS command line issue: ------------------ EG13PC X.CTL X.LST Notice that the control file X.CTL tells the program what the data file is. This allows the use of several control file with the same data file. Data file format: Free format with one record per observation. ---------------- The variables are: Cluster id The class number The response variable (y = 0/1) The regressor(s) The data and control files record length must be <= 255. Control file format: ------------------- The control file can have RECFM F or V, maximum LRECL is 80. The first and second records are titles that will be printed on the output file. The third record is the data file name. The fourth record contains an integer, the number of classes. The maximium allowed is 12. The next record contains an integer, the number of variables that follow the response in the data file. It is not necessary that all these variables be used in the regressions. max = 64. The next record contains two integers, i1 and i2: i1 = number of parameters. i2 = number of parameters for main effects. Naturally i1 is greater than or equal to i2. (not checked) It must be arranged so that the odds-ratios parameters are the last in the parameter vector. (not checked) The next record contains a real number, the convergence criterion. Iteration stops when the sum of the absolute changes in all parameters between two iterations is less than that number or the maximum number of iterations is reached, whichever occurs first. The next record contains an integer, the maximum number of iterations. The next record contains an integer i1, say. If i1 = 1 then the current estimates of the parameters will printed at each iteration. The next record contains an integer, i. i = 1 : GEE1 (has a bug, do not use) i = 2 : GEE2 (7/10/1991 : The PC PASCAL version has a bug in the GEE1 implementation. So this option should not be used. The bug will be fixed in the upcoming C version.) The next record contains an integer i1, say. If i1 = 1 then the Zhao and Prentice formulae for third and fourth order moments will be used. if i1 = 2 then the exact solution will be used for these moments. The next record is ignored. The next and following records, as many as there are parameters, specify labels for the parameters. These will be used to label the output. Only the first 16 characters will be used. The next record is ignored. The next record(s) contain initial values for the parameters. These may span one or more records. The next record is ignored. The next records specify the regressions. If the number of classes is C, then C + C + {C * (C-1) / 2} records are required. C specifications for the regressions for each class. C specifications for the regressions for the within class odds ratios. C * (C-1) / 2 specifications for the regressions for the between class odds ratios. examples: C number of specifications 1 2 2 5 3 9 4 14 5 20 6 27 7 35 8 44 Each regression is specified by a sequence of integers as follows: i1 i2 i3 i4 i5 i6 ... where i1 and i2 are class numbers. To specify the regression for the main effects for a class set i2 = 0. i3, i5, .. are the parameter indices. i4, i6, ... are the regressor indices. If B is the regression parameter and x is the vector of regressors in the data file then the regression will be B(i3)*x(i4) + B(i5)*x(i6) + ... If i3 = 0 then that regression is set to 0. It must be arranged so that the odds-ratios parameters are the last in the parameter vector. (not checked) Each parameter should appear at least once in the regression specifications. Order of specification not important. Any additional records will be ignored. Note: extra text following numbers on the control file is allowed except on the following: record number 4: the data file name the record(s) specifying the initial parameter values the record(s) specifying the regressions. This is demonstrated by the example control file. Current program limits: -------------- maximum number of classes = 12 maximum cluster size = 8 maximium number of observations: The sum of n + n * (n-1) / 2 over all clusters must be <=1000, where n is the cluster size. maximum number of parameters = 10 maximum number of potential regressors in the input file = 64 maximum number of iterations that could be specified = 100 Technical notes: --------------- The values of the regressors used in the regression for the within and between class associations should be the same for all members of any given cluster. The program currently uses the values from the last member in each cluster. Don't rely on this "feature". This will change in future versions of the program. The program does a fair amount of checking on the control file and the data file. However it is not an exhaustive check. The model specification is very flexible. Completely ridiculous models can be specified. The program has no way of recognizing these. Care is needed here. Example control file: -------------------- -- Title1: example control file -- -- Title2: -- COPD1.DAT 2 = number of classes. Suppose class 1 = P, class 2 = S 6 = dim (x): x1 x2 x3 x4 x5 x6 9 6 = total=9, main effects=6 0.001 = convergence criterion 50 = maximun number of iterations 1 = print current estimates each iteration. 1=yes, 0=no 2 = 1=GEE1, 2=GEE2 2 = exact 2=exact 1=Z&P approx. labels for beta: these will appear on the output file 1 Intercept 2 Sex (F) 3 Race (B) 4 Age-50 5 Smoker 6 Ex smoker 7 P.P 8 S.S 9 P.S Initial estimates: -0.83188 -0.80439 -0.91741 0.03796 1.14924 0.39144 0.93362 0.934 0.934 model specification: 1 0 1 1 2 2 3 3 4 4 5 5 6 6 2 0 1 1 2 2 3 3 4 4 5 5 6 6 1 1 7 1 2 2 8 1 1 2 9 1 -- End of the control File -- The model specified above is: Main effects: For class 1 (P): logit(Pr{Y=1}) = B1*x1 + B2*x2 + B3*x3 + B4*x4 + B5*x5 + B6*x6 For class 2 (S): logit(Pr{Y=1}) = B1*x1 + B2*x2 + B3*x3 + B4*x4 + B5*x5 + B6*x6 Odds ratios: Within class 1 (P.P): log(odds ratio) = B7*x1 Within class 2 (S.S): log(odds ratio) = B8*x1 Between classes 1 and 2 (P.S): log(odds ratio) = B9*x1 Suppose that a model with no association between classes 1 and 2 is required. Then the last line of the control file should be: 1 2 0 the sixth record becomes 8 6 = total=8, main effects=6 and the label for B9 9 P.S should be deleted.