/*************************************************************** Stata code for Causal Inference: What If by Miguel Hernan & Jamie Robins Date: 10/10/2019 Author: Eleanor Murray For errors contact: ejmurray@bu.edu ***************************************************************/ /*************************************************************** PROGRAM 13.1 Estimating the mean outcome within levels of treatment and confounders: Data from NHEFS Section 13.2 ***************************************************************/ clear use "http://bios.unc.edu/~mhudgens/otago/nhefs.dta" *Estimate the the conditional mean outcome within strata of quitting smoking and covariates, among the uncensored*/ glm wt82_71 qsmk sex race c.age##c.age ib(last).education c.smokeintensity##c.smokeintensity c.smokeyrs##c.smokeyrs ib(last).exercise ib(last).active c.wt71##c.wt71 qsmk##c.smokeintensity predict meanY summarize meanY /*Look at the predicted value for subject ID = 24770*/ list meanY if seqn == 24770 /*Observed mean outcome for comparison */ summarize wt82_71 /*************************************************************** PROGRAM 13.2 Standardizing the mean outcome to the baseline confounders Data from Table 2.2 Section 13.3 ****************************************************************/ clear input str10 ID L A Y "Rheia" 0 0 0 "Kronos" 0 0 1 "Demeter" 0 0 0 "Hades" 0 0 0 "Hestia" 0 1 0 "Poseidon" 0 1 0 "Hera" 0 1 0 "Zeus" 0 1 1 "Artemis" 1 0 1 "Apollo" 1 0 1 "Leto" 1 0 0 "Ares" 1 1 1 "Athena" 1 1 1 "Hephaestus" 1 1 1 "Aphrodite" 1 1 1 "Cyclope" 1 1 1 "Persephone" 1 1 1 "Hermes" 1 1 0 "Hebe" 1 1 0 "Dionysus" 1 1 0 end *i.Data set up for standardization: create 3 copies of each subject* *first, duplicate the dataset and create a variable 'interv' which indicates which copy is the duplicate (interv =1) expand 2, generate(interv) *next, duplicate the original copy (interv = 0) again, and create another variable 'interv2' to indicate the copy expand 2 if interv == 0, generate(interv2) *now, change the value of 'interv' to -1 in one of the copies so that there are unique values of interv for each copy* replace interv = -1 if interv2 ==1 drop interv2 *check that the data has the structure you want: there should be 1566 people in each of the 3 levels of interv* tab interv *two of the copies will be for computing the standardized result* *for these two copies (interv = 0 and interv = 1), set the outcome to missing and force qsmk to either 0 or 1, respectively* *you may need to edit this part of the code for your outcome and exposure variables* replace Y = . if interv != -1 replace A = 0 if interv == 0 replace A = 1 if interv == 1 *check that the data has the structure you want: for interv = -1, some people quit and some do not; for interv = 0 or 1, noone quits or everyone quits, respectively* by interv, sort: summarize A *ii.Estimation in original sample* *Now, we do a parametric regression with the covariates we want to adjust for* *You may need to edit this part of the code for the variables you want.* *Because the copies have missing Y, this will only run the regression in the original copy* *The double hash between A & L creates a regression model with A and L and a product term between A and L* regress Y A##L *Ask Stata for expected values - Stata will give you expected values for all copies, not just the original ones* predict predY, xb *Now ask for a summary of these values by intervention* *These are the standardized outcome estimates: you can subtract them to get the standardized difference* by interv, sort: summarize predY *iii.OPTIONAL: Output standardized point estimates and difference* *The summary from the last command gives you the standardized estimates* *We can stop there, or we can ask Stata to calculate the standardized difference and display all the results in a simple table* *The code below can be used as-is without changing any variable names* *The option "quietly" asks Stata not to display the output of some intermediate calculations* *You can delete this option if you want to see what is happening step-by-step* quietly summarize predY if(interv == -1) matrix input observe = (-1,`r(mean)') quietly summarize predY if(interv == 0) matrix observe = (observe \0,`r(mean)') quietly summarize predY if(interv == 1) matrix observe = (observe \1,`r(mean)') matrix observe = (observe \., observe[3,2]-observe[2,2]) *Add some row/column descriptions and print results to screen* matrix rownames observe = observed E(Y(a=0)) E(Y(a=1)) difference matrix colnames observe = interv value matrix list observe *to interpret these results:* *row 1, column 2, is the observed mean outcome value in our original sample* *row 2, column 2, is the mean outcome value if everyone had not quit smoking* *row 3, column 2, is the mean outcome value if everyone had quit smoking* *row 4, column 2, is the mean difference outcome value if everyone had quit smoking compared to if everyone had not quit smoking* /*************************************************************** PROGRAM 13.3 Standardizing the mean outcome to the baseline confounders: Data from NHEFS Section 13.3 ***************************************************************/ clear use "nhefs.dta" *i.Data set up for standardization: create 3 copies of each subject* *first, duplicate the dataset and create a variable 'interv' which indicates which copy is the duplicate (interv =1) expand 2, generate(interv) *next, duplicate the original copy (interv = 0) again, and create another variable 'interv2' to indicate the copy expand 2 if interv == 0, generate(interv2) *now, change the value of 'interv' to -1 in one of the copies so that there are unique values of interv for each copy* replace interv = -1 if interv2 ==1 drop interv2 *check that the data has the structure you want: there should be 1566 people in each of the 3 levels of interv* tab interv *two of the copies will be for computing the standardized result* *for these two copies (interv = 0 and interv = 1), set the outcome to missing and force qsmk to either 0 or 1, respectively* *you may need to edit this part of the code for your outcome and exposure variables* replace wt82_71 = . if interv != -1 replace qsmk = 0 if interv == 0 replace qsmk = 1 if interv == 1 *check that the data has the structure you want: for interv = -1, some people quit and some do not; for interv = 0 or 1, noone quits or everyone quits, respectively* by interv, sort: summarize qsmk *ii.Estimation in original sample* *Now, we do a parametric regression with the covariates we want to adjust for* *You may need to edit this part of the code for the variables you want.* *Because the copies have missing wt82_71, this will only run the regression in the original copy* regress wt82_71 qsmk sex race c.age##c.age ib(last).education c.smokeintensity##c.smokeintensity c.smokeyrs##c.smokeyrs ib(last).exercise ib(last).active c.wt71##c.wt71 qsmk#c.smokeintensity *Ask Stata for expected values - Stata will give you expected values for all copies, not just the original ones* predict predY, xb *Now ask for a summary of these values by intervention* *These are the standardized outcome estimates: you can subtract them to get the standardized difference* by interv, sort: summarize predY *iii.OPTIONAL: Output standardized point estimates and difference* *The summary from the last command gives you the standardized estimates* *We can stop there, or we can ask Stata to calculate the standardized difference and display all the results in a simple table* *The code below can be used as-is without changing any variable names* *The option "quietly" asks Stata not to display the output of some intermediate calculations* *You can delete this option if you want to see what is happening step-by-step* quietly summarize predY if(interv == -1) matrix input observe = (-1,`r(mean)') quietly summarize predY if(interv == 0) matrix observe = (observe \0,`r(mean)') quietly summarize predY if(interv == 1) matrix observe = (observe \1,`r(mean)') matrix observe = (observe \., observe[3,2]-observe[2,2]) *Add some row/column descriptions and print results to screen* matrix rownames observe = observed E(Y(a=0)) E(Y(a=1)) difference matrix colnames observe = interv value matrix list observe *to interpret these results:* *row 1, column 2, is the observed mean outcome value in our original sample* *row 2, column 2, is the mean outcome value if everyone had not quit smoking* *row 3, column 2, is the mean outcome value if everyone had quit smoking* *row 4, column 2, is the mean difference outcome value if everyone had quit smoking compared to if everyone had not quit smoking* /*************************************************************** PROGRAM 13.4 Computing the 95% confidence interval of the standardized means and their difference: Data from NHEFS Section 13.3 ***************************************************************/ *Run program 13.3 to obtain point estimates, and then the code below* *drop the copies* drop if interv != -1 gen meanY_b =. save nhefs_std, replace capture program drop bootstdz program define bootstdz, rclass u nhefs_std, clear preserve *draw bootstrap sample from original observations* bsample *create copies with each value of qsmk in bootstrap sample* *first, duplicate the dataset and create a variable 'interv' which indicates which copy is the duplicate (interv =1) expand 2, generate(interv_b) *next, duplicate the original copy (interv = 0) again, and create another variable 'interv2' to indicate the copy expand 2 if interv_b == 0, generate(interv2_b) *now, change the value of 'interv' to -1 in one of the copies so that there are unique values of interv for each copy* replace interv_b = -1 if interv2_b ==1 drop interv2_b *two of the copies will be for computing the standardized result* *for these two copies (interv = 0 and interv = 1), set the outcome to missing and force qsmk to either 0 or 1, respectively* replace wt82_71 = . if interv_b != -1 replace qsmk = 0 if interv_b == 0 replace qsmk = 1 if interv_b == 1 *run regression* regress wt82_71 qsmk sex race c.age##c.age ib(last).education c.smokeintensity##c.smokeintensity c.smokeyrs##c.smokeyrs ib(last).exercise ib(last).active c.wt71##c.wt71 qsmk#c.smokeintensity *Ask Stata for expected values - Stata will give you expected values for all copies, not just the original ones* predict predY_b, xb summarize predY_b if interv_b == 0 return scalar boot_0 = r(mean) summarize predY_b if interv_b == 1 return scalar boot_1 = r(mean) return scalar boot_diff = return(boot_1) - return(boot_0) drop meanY_b restore end *Then we use the 'simulate' command to run the bootstraps as many times as we want* *Start with reps(10) to make sure your code runs, and then change to reps(1000) to generate your final CIs* simulate EY_a0=r(boot_0) EY_a1 = r(boot_1) difference = r(boot_diff), reps(10) seed(1): bootstdz / *Next, format the point estimate to allow Stata to calculate our standard errors and confidence intervals matrix pe = observe[2..4, 2]' matrix list pe *Finally, the bstat command generates valid 95% confidence intervals under the normal approximation using our bootstrap results* *The default results use a normal approximation to calcutlate the confidence intervals* *note, n contains the original sample size of your data before censoring* bstat, stat(pe) n(1629)