%REDMON USERS GUIDE (TEXT VERSION) Version 1.0 Prepared by: Sean O'Brien Michael Schell Department of Biostatistics University of North Carolina Chapel Hill, NC 27514 REDMON6.TXT (5/14/97) A: OVERVIEW Isotonic regression is a nonparametric method appropriately used when a dependent response variable is monotonically related to an independent predictor variable. The regression estimate is a step function which reduces the description of n points to L(<=n) level sets. This method yields a model consisting of L more-or-less homogenous subpopulations. The estimate for each group (an interval in the domain) is equal to the average of the response variables for points in the group. Under isotonic regression, the number of level sets is often large, preventing simple description. The reduced monotonic regression and reduced isotonic regression procedures performed by %REDMON improve the parsimony of such models by reducing the number of level sets. This is accomplished using a backward elimination algorithm to combine groups that do not differ significantly from one another. The independent variable is assumed to be observed without error. The errors in the dependent variable, estimated by the residuals obtained by subtracting the reduced isotonic fit from the observed values, are assumed to have an independent, identically distributed Gaussian distribution with zero mean and constant variance. REDUCED MONOTONIC REGRESSION VERSUS REDUCED ISOTONIC REGRESSION Isotonic regression forces the regression estimate to increase or decrease in the direction specified by the user. It is appropriate when the direction of the association is known with certainty. Reduced monotonic regression is a two-sided extension of the reduced isotonic method. The direction of the trend is determined by the data. When the direction is known, the one-sided version is more powerful for detecting differences between adjacent groups. CHOOSING A SIGNIFICANCE LEVEL When the predictor and response variable are monotonically related, the appropriate estimate is a step function with at least one step. When the predictor and response variables are unrelated, however, the correct model is a single flat line. %REDMON will choose the flat line model with probability 1-ALPHA under the null. The value ALPHA is the overall type-I error probability. It corresponds to the test H0: no trend versus H1: isotonic or monotonic trend. The user may specify ALPHA using the ALPHA= option or may use the default ALPHA=.05. The actual number of level sets in the reduced monotonic regression model depends on the data and on the significance criterion ("Significance Level to Stay") used to determine when the elimination algorithm ends. The macro chooses this value automatically such that all groups will be collapsed with probability 1-ALPHA under the null. Because this value is set internally, users do not need to be aware of it. Nonetheless, a short description of how this value is chosen is provided in the details section. Interested users may over-ride the automatic selection of this value using the SLS= option. References: Robertson, T., Wright, F. T., Dykstra, R. L. (1988), Order-Restricted Statistical Inference, New York: Wiley. Schell, M. and Singh B., "The Reduced Monotonic Regression Method", JASA 92:128-35, 1997. B: GETTING STARTED Before %REDMON can be used it has to appear in your SAS program. It is not necessary to re-type the program. Instead use the %INCLUDE statement to read the program from a file. For example, if the macro is stored in the 'c:\' directory use the command: %INCLUDE 'c:\redmon.sas'; After the %INCLUDE statement, the program may be invoked wherever a PROC statement could appear. To do so, submit the command %REDMON, followed by arguments which appear in parentheses. For example: %REDMON(DATA=work.mydata, Y=weight, X=height); ARGUMENTS Arguments, appearing in parentheses after the word %REDMON, specify the model, request special output, and change defaults. The following table lists them: NAME PURPOSE DEFAULT -------- ---------------------------- ---------------------- DATA= Specify the SAS data set use last created dataset X= Specify predictor variable(s) [required] Y= Specify response variable [required] Z= Specify by-group variable no by-groups METHOD= Specify isotonic increasing, monotonic (2-sided) isotonic decreasing, or monotonic method ALPHA= Specify target overall type-I overall alpha = .05 error level SLS= Specify significance level to corresponds to alpha=.05 stay corresponds to alpha = in backward elimination of level sets FREQ= Name a variable containing no weights frequencies PLOT= Request a high-resolution graphics no plots plot and specify location OUT= Request output data sets sets no output data EXPLANATION OF PARAMETERS DATA= The DATA= argument specifies the name of the SAS data set containing your variables. If this argument is omitted, %REDMON uses the most recently created data set (_LAST_). Data set specified: %REDMON(DATA=work.mydata, X=height, Y=weight); Data set unspecified: %REDMON(X=height, Y=weight); X= The X= argument specifies the name of the predictor variable. %REDMON syntax allows more than one predictor to be specified. However, this does not result in a multiple regression model. Instead, %REDMON fits a separate model for each predictor in the X= list. Single predictor: %REDMON(DATA=work.mydata, X=height, Y=weight); Multiple predictors: %REDMON(DATA=work.mydata, X=height wingspan shoesize, Y=weight); Note that X= appears only once and the predictor names are separated by blanks, not commas. NOTE: %REDMON does not currently handle missing predictor values. Including observations with missing values will yield unpredictable results. Y= Y= specifies the name of the response variable. Only one response variable is allowed. NOTE: Observations with missing response variables are eliminated from the analysis. Z= The "Z=" argument allows separate models to be fit for observations at each level of a given stratification variable. If more than one Z- variable is specified, %REDMON fits separate models for each level formed by cross-classifying them. Stratify by gender: %REDMON(DATA=work.mydata, Y=weight, X=height, Z=gender); Stratify by gender*race: %REDMON(DATA=work.mydata, Y=weight, X=height, Z=gender race); METHOD= RECOGNIZED OPTIONS: METHOD=up METHOD=down METHOD=best %REDMON performs reduced monotonic regression by default. This means that the macro determines the direction of the trend from the data. When the direction of the trend is known, reduced isotonic (antitonic) regression is more appropriate. This one-sided method uses lower critical values than reduced monotonic regression, corresponding to greater power. The METHOD= argument is used to request isotonic regression with the direction specified. The following values are allowed: 'up' (for increasing trend, often called isotonic), 'down' (for decreasing trend, often called antitonic), and 'best' (for monotonic). When multiple predictors are included, it is possible to specify a different method for each. The first method in the list corresponds to the first predictor, the second method to the second predictor etc. Single predictor: %REDMON(DATA=work.mydata, X=height, Y=weight, METHOD=up); Multiple predictors: %REDMON(DATA=work.mydata, X=height wingspan shoesize, Y=weight, METHOD=up best down); ALPHA= Reduced monotonic (isotonic) regression improves the parsimony of the conventional isotonic regression model by combining groups that do not significantly differ. When the predictor and response variables are unrelated, the correct model collapses all groups into a single one. This occurs with probability 1-ALPHA under the null. The value ALPHA is the type-I error rate for the test H0: no trend versus H1: isotonic or monotonic trend. By default, the target ALPHA = .05. The ALPHA= option is used to specify other values for ALPHA. Overall ALPHA specified: %REDMON(DATA=work.mydata, X=height, Y=weight, ALPHA=.1); NOTE: ALPHA values are approximate, not exact. Appproximation is accurate for .01 < ALPHA < .10 and sample size 20 < n < 800. (See details.) SLS= Level sets are eliminated using a backward elimination algorithm which combines adjacent groups one at a time. The algorithm ends when each group in the model produces F statistics significant at the SLS= level. By default, the SLS= value is chosen internally as a function of the desired overall type-I error probability, i.e the probability ALPHA that all groups are combined into a single one under the null hypothesis. Unless the user wishes to have direct control over the number of level sets eliminated, this option should not be used. ALPHA= and SLS= should never both be specified. SLS specified: %REDMON(DATA=work.mydata, X=height, Y=weight, SLS=.001); NOTE: When SLS= is specified directly, the overall type-I error rate is no longer controlled. The SLS= value will always be smaller than ALPHA since SLS= is a comparison-wise signifance level and ALPHA refers to an overall error rate which accounts for multiple comparisons. (See details.) FREQ= Like the FREQ statement in PROC REG or PROC LOGISTIC, the FREQ= argument specifies a variable whose values represent frequencies. When this option is used, each observation in the input data set is assumed to represent n observations, where n is the value of the FREQ variable (SAS/STAT Users Guide Version 6). The analysis produced using FREQ= is the same as an analysis produced using a data set that contains n observations in place of each observation in the input data set. Note that the sample size used for determining SLS is considered to be equal to the sum of the values of the FREQ variable. Using the ALPHA= option will yield a conservative test due to the tied observations. FREQ var named 'frq': %REDMON(DATA=work.mydata, X=height, Y=weight,FREQ=frq); PLOT= RECOGNIZED OPTIONS: PLOT=screen PLOT=FILE PLOT=FILE directory The PLOT= argument requests a high resolution plot to be printed, either to a postscript file or to the display manager default device. To print to display manager, use the command PLOT=screen. To print to file, use: PLOT=file. This creates a file named '_PLOT1.PS'. If such a file exists already, it is overwritten. If multiple plots are printed in a single %REDMON invocation then the files are numbered sequentially i.e. 'PLOT1.PS', '_PLOT2.PS' etc. If PLOT=file is used, it is also possible to specify the directory in which to store the files. This is done by including the name of the directory after the keyword 'file'. Plot to screen: %REDMON(DATA=work.mydata, X=height, Y=weight, PLOT=screen); Plot to file: %REDMON(DATA=work.mydata, X=height, Y=weight, PLOT=file); Plot to file in %REDMON(DATA=work.mydata, X=height, Y=weight, 'c:\plots\' PLOT=file c:\plots\); directory: OUT= OUT=yes requests that two data sets be output for each model. _FINAL1 contains one observation for each level set. It provides the sample size, range of predictor values, predicted response, and standard deviation for each level set. _FIT1 contains one observation for each observation in the input data set. It provides the isotonic fit, reduced isotonic (monotonic) fit and residual for each observation. If multiple models are specified the files are numbered sequentially, _FIT1, _FIT2, ... and _FINAL1, _FINAL2,... If files with the same names exist, they are overwritten. Request output data sets: %REDMON(DATA=work.mydata, X=height, Y=weight, OUT=yes); DETAILS Isotonic regression minimizes the sum of squares of deviations from the model to the data under the restriction that the fit is non-decreasing, i.e. E(Y|X = x) is monotonic in X. This is accomplished using the pooled adjacent violators algorithm (PAVA). %REDMON implements this algorithm in a DATA STEP. %REDMON implements the level sets backward elimination algorithm using PROC REG with SELECTION=BACKWARD. L-1 dummy variables are used to identify groups. The model is parameterized such that elimination of a predictor variable corresponds to replacing two adjacent level sets with a single one. MISSING PREDICTOR AND RESPONSE VALUES %REDMON does not currently handle missing predictor values. For best results, eliminate observations with missing predictor values in a data step prior to invoking the macro. Missing response variables are allowed. Observations with missing response values are eliminated from the analysis. ALPHA and SLS The SLS value chosen by %REDMON to yield a given ( are based on simulation results described in Schell and Singh, 1997. Table 1 of this article provides estimates of SLS for three values of ALPHA (ALPHA=.01,.05,.10) and five sample sizes (n=10,20,50,200,800). To handle other values of ALPHA and other sample sizes, %REDMON uses interpolation. The interpolation method used by the macro is appropriate for .01 < ALPHA < .10 and 20 < n < 800. When ALPHA and n are outside of this range the same formulas are used. However, no simulations have been conducted do determine their accuracy. For sample sizes where 10 < n < 20 some guidance is given in Table 1 of Schell and Singh, 1997. MEMORY AND SOFTWARE REQUIREMENTS %REDMON requires SAS version 6.11 or later. An implementation for older SAS versions is available upon request. %REDMON's implementation of the pooled adjacent violators algorithm (PAVA) is optimized for the case where the entire data set (predictor, response, and weight variables plus other temporary variables) fits in memory. We have not experienced any problems with this limitation. An implementation for large data sets is available upon request. The high resolution graphics option was developed on PC SAS version 6.11 for Windows and tested successfully on OS/2 and UNIX platforms. Comments on how to improve this and other aspects of the program are welcome and appreciated. Please contact one of the authors to request the macro or report bugs. Sean O'Brien sobrien@bios.unc.edu Michael Schell mschell@bios.unc.edu