###

Question regarding the use of propensity scores in conjunction with regression adjustments:

Many of the readings, and several of the discussions in class, have covered matched pairs, blocking, and subclassification as a means to balance the joint distribution of covariates between treatment and control groups. Moreover, we have looked at dependent sample assessment plots to detect interaction effects across matched strata. These methods are relatively transparent compared to the list of algorithms (e.g. nearest neighbor using Mahalanobis; kernel matching with Epanechnikov; etc.) that work with propensity scores to construct matching weights.

My question deals with using propensity scores with regression adjustments either as (a) an additional parameter in the model (i.e. in a fashion similar to using an inverse mills ratio to correct for selection bias on observables), or (b) using the propensity scores to construct weights in a weighted OLS procedure. My initial understanding is that there are several advantages to using propensity scores in conjunction with regression adjustments, for one, the functional form of the confounding covariates can be misspecified in the regression model and it will not bias the treatment effect (right?).

Specifically, I would like to know what the implications are (if any) for the interpretation of the coefficients and standard errors of the confounding covariates when applying matched weights to a regression model. The reason I find the use of propensity score weights in a regression context intriguing is because it potentially (or at least I think) gives the researcher the flexibility of regression modeling while establishing covariate balance and minimizing selection bias. This could be a powerful method to assess interaction effects between treatment/control and a specific covariate. Is this an accurate description of the possibilities or a misread? I’ve only started to get into this literature but I’ve come across reference to Robins, Imbens and “inverse probability weighting.” Any thoughts or comments of the use of propensity scores in conjunction with regression adjustments would be appreciated.

Thanks,

Jeremy

## Comments (26)

## nits2007@GMAIL.COM said

at 4:27 pm on Sep 13, 2008

I have been able to get to the matrix and create the table as shown in pdf, but some how the granova.ds file that Dr. Pruzek send us is not running for me. So if some one could advise as to what the next step should be that would be great. Thanks

## bob pruzek said

at 6:17 pm on Sep 13, 2008

When you type >ls( ), does granova.ds show up as in the environment? If yes, then, if you have DAAG (package) loaded, you will have data pair65. Try

running w/ it. granova.ds(pair65) should do it. IF that works, then go back to the data you've input, and type class(this.data) for whatever you've named it,

and see what that is. If it is not a data frame, make it so, as in data.frame(this.data).

If you get errors at any points, and cannot fix them then copy and past the errors here, as well as the first few lines of the data, and it's class info.

data[1:5,] will show the first few lines.

Keep in touch. Bob

## nits2007@GMAIL.COM said

at 6:45 pm on Sep 13, 2008

> ls()

character(0)

It doesn't show granova.ds in environment.

Thanks for the help. Nitendra

## bob pruzek said

at 8:59 pm on Sep 13, 2008

Then recopy and highlight and paste into the console. From what I sent. Best, Bob

## nits2007@GMAIL.COM said

at 10:47 pm on Sep 13, 2008

Thanks. I was able to run the lead.x2.rosnbm data in granova.ds. Same way when I am trying to add granova.2w function it gives me error at multiple steps and doesn't allow to add that function. Please advise for the next step. Thanks. Nitendra.

## bob pruzek said

at 11:07 pm on Sep 13, 2008

If you could load and run the granova.ds, as you say here, then all you need to do is replicate the copy/paste procedure for granova.2w. Make sure you are using the

FULL WIDTH layout of the page on the console. That may be all you need to do. Best, BP

## nits2007@GMAIL.COM said

at 10:57 am on Sep 14, 2008

I tried to run that part in the console as you said for granova.2w. Doesn't work that way. Is there anything else I could do. Thanks. Nitendra.

## bob pruzek said

at 11:34 am on Sep 14, 2008

Nitendra, You are not giving me enough information. Does granova.2w appear in the environment? Does it run w/ the examples shown in the ?granova.2w documentation?

Does it give an error message if it starts, then does not run? TELL ME MORE. ALWAYS. This is the way of the world in computing. BP

## nits2007@GMAIL.COM said

at 11:42 am on Sep 14, 2008

Sorry about that. This is what it looks like when I run it.

> mtx <- is.data.frame(data.A.B)

Error in inherits(x, "data.frame") : object "data.A.B" not found

> if(!mtx)data.A.B <- data.frame(data.A.B)

Error: object "mtx" not found

> vA <- length(unique(data.A.B[,2]))

Error in unique(data.A.B[, 2]) : object "data.A.B" not found

> vB <- length(unique(data.A.B[,3]))

Error in unique(data.A.B[, 3]) : object "data.A.B" not found

> N <- dim(data.A.B)[1]

Error: object "data.A.B" not found

> rnd2 <- function(x)round(x,2)

> A <- data.A.B[,2]

Error: object "data.A.B" not found

> B <- data.A.B[,3]

Error: object "data.A.B" not found

> yy <- data.A.B[,1]

Error: object "data.A.B" not found

> #Means in each cell

> mns <- tapply(yy, list(A,B), mean)

I have pasted the first few lines of message that I get.

Thanks. Nitendra.

## bob pruzek said

at 12:08 pm on Sep 14, 2008

Nitendra, Ok, now give me the contents of ls( ), when you see what is in the environment. The two granova functions should be there and the lead data, probably

in two forms. What you must recognize is that it is THESE names that you need to use when you try to run a function, and access data. So instead of data.A.B, in your

code here, the exact names of the data you mean to access. Nothing will work w/ what you specify here. And then, once you get to where data are accessed, give me

error messages from the functions. Note that >str(function) gives the arguments for the function (explicitly named). BP

## nits2007@GMAIL.COM said

at 1:04 pm on Sep 14, 2008

> ls()

[1] "granova.ds" "x2".

This is what is shows me.

And when I run x2 it is in the matrix form.

Nitendra.

## bob pruzek said

at 1:45 pm on Sep 14, 2008

All of these commands should now work: dim(x2), colMeans(x2), summary(x2). And more of course. But if the dimension shows 33 x 2, and x2, when printed [> x2 <retrn>] should show EXACTLY the same data matrix I gave to you. Check that out. If it checks, then >granova.ds(x2) should give you what I got. The picture is found in the windows menu of course, but the summary stats will be on the console. When you get this done, you can do, more, such as highlight/copy/paste the granova.2w function, and then

enter the 3 column data matrix I gave you. Before we get to this, inform me as to how these things work. BP

## nits2007@GMAIL.COM said

at 5:49 pm on Sep 14, 2008

I have been able to run granova.ds(x2) with the picture and the summary stats. If you could please help me proceed with the granova.2w.

This is what is in my environment.

> ls()

[1] "granova.ds" "rnd2" "x2"

Thanks. Nitendra.

## bob pruzek said

at 12:08 am on Sep 15, 2008

Nitendra, As I said above, now you need to use the granova.2w function that I put in my latest pdf, and then

... you can do, more, such as highlight/copy/paste the granova.2w function, and then enter the 3 column data matrix I gave you.

I'm hoping this will work. But let me know. BP

## bob pruzek said

at 12:10 am on Sep 15, 2008

That is, the pdf I sent on Friday...see the .2w function there.

## harryxkn@... said

at 12:41 am on Sep 16, 2008

Professor Pruzek: I did not get what zzza meant in the pdf leadbloodpsa (lm(formula = Lead.in.blood ~ (Treatmt.Contrl * zzza) + I(Treatmt.Contrl^2) + I(zzza^2))), could you please explain? Thanks!

## bob pruzek said

at 2:49 pm on Sep 16, 2008

It's no wonder you did not get the zzza idea; I inadvertantly missed the code that defined it.

Suppose

> aabb= granova.2w(data.A.B=lead.psa.A.B.df)$Child.ef # i.e. the child effect that is output by granova.2w

Then we note that these are ordered from smallest to largest (I'll discuss in class further), so we need to reorder them

as they were for the original data. The function names(aabb) selects the 33 column names, but they are in quotes (character), so

we need to convert those 'names' , which are numbers, with as.numeric, thus to get the labels without quotes. At this point

we simply use function order, which gives us the child effects, properly ordered (but only 33, when what we need is two of each, for the

two children in each 'block' (i.e. for each pair in a 33 x 2 layout). So we take the vector aabb, of length 33, and repeat each value twice, using

the rep function, as shown here. So zzza is just a reordered and doubled version of the Child.effects from granova.2w.

zzza=rep(aabb[order(as.numeric(names(aabb)))],ea=2)

With zzza used as a variable in function lm, we get the result I print in the pdf I mailed to everyone.

Remember to see help for each function here (and in each discussion of R code), and see if you can

reproduce what I've done, starting from my aabb line here, finishing w/ the lm analysis. Discuss in class

tomorrow.

## bob pruzek said

at 9:20 pm on Sep 29, 2008

Question regarding the use of propensity scores in conjunction with regression adjustments:

Jeremy, please put your questions at the BOTTOM...else it is hard to answer w/ q. in site.

Following your questions above,

--------------------------------------- my answer is brief as I have only 2-4 minutes before having to finish packing to leave for the west! BP

The problem here is that you raise deep questions that mostly attend topics we cover later in the course. In the short run, do not think about regression apps w/ PSA

because there is a great loss of effectiveness of the method when propensity scores are used as if they were covariates themselves. They are NOT, even though one

can act as if they are....and a number of people ignore this advice. Read anything you can find by Rosenbaum (my pdf's) or online (including Statistical Science articles).

You will find more edification in those papers (especially in these early weeks) than in Robins. The concept of inverse weighting comes much later, as do some of Imbens'

other ideas. Think in terms of the examples I've written about, e.g. the Passa apps, also the various slide-show pdfs. You can probe more deeply on all of this later, and

think in terms of pointed questions (starting from applications) about particular treatment comparisons especially.

Hope this helps. BP

## Yi Sun said

at 9:13 am on Sep 30, 2008

Hi, all,

I think that the reason why we can't put the PSAscore in model is of the collinearity. We get the the PSAscore from all the covariates we have so that the PSAscore will be highly correalted to the integration of all the covariates we used. If we put PSAscore in the model, our model will suffered from collinearity and all coefficients estimations will not be stable. (R square will be high but many independent variables will be non significant, i.e the variance for some variable will be inflated). Am I right? Prof Pruzek?

## bob pruzek said

at 4:25 am on Oct 2, 2008

Re: collinearity. In fact, Prop. scores are often included in regression models; there is rarely a problem w/ collinearity.

I'm not sure where your premise came from. But in general I find this approach NOT to be persuasive for a different reason:

That is that if you look at distribution overlaps (as in my CSDA talk slides, and others), you will often find that the two treatment groups can only be compared "in the middle" of the PS distribution...the ends are ruled out to the extent that PS scores do not both extend to the full range (one for the other). Remember, PSA is SELF-CRITICAL (as Rubin commented in the first paper you read of his). It will tell you if or where treatment comparisons cannot be empirically supported.

----re: Harry's comment above:

this is an interesting idea, and bears discussion. Let's plan on that next week!

BP (2 blocks from Waikiki beach!)

## harryxkn@... said

at 11:22 pm on Oct 9, 2008

Hi Professor Pruzek,

I was reading T.e. Love's pdf document, and saw that he talked about a research done by Potosky which ""first take a large set of covariates to form the PS, then use PS and a subset of those covariates in the model for outcomes. I could not download the original paper, I am wondering if using PS in regression for direct covariate adjustment will also be influenced by(or influence) possible incomplete PS overlap. Even if treatment and control group have good PS overlap, I am not clear about the benefits of "use PS and a subset of those covariates in the model for outcomes", is it because this subset of covariates are particularly significant, or is it because they want to regress within PS strata, or is it to "correct for selection bias on observables" as mentioned by Jeremy if I did not misunderstand.

Thank you,

Kuangnan

## Jeremy said

at 8:25 am on Oct 10, 2008

Just an attempt at clarification for my own sake given that I initiated the question:

If the “mechanism” that sorts your unit of analysis into a treatment and control group is not randomization (which in observational studies it is not) AND that mechanism is also highly correlated with your outcome of interest then selection bias is a potential confounder. You can not determine if the mechanism or the treatment is responsible for the outcome. When you have variables in your data that approximate that mechanism, you can model it (i.e. you can model the non-random processes that sort units into treatment and control groups IF you observe the process in your data).

In many cases researchers are also interested in how the mechanism affects the outcome as well as the treatment. In that case endogeneity is a possible concern. In the context of regression that means the correlation between treatment and mechanism will render the residuals to be autocorrelated. Theoretically this is a potential problem because causal direction is not clearly identified. Approaches similar in style (but perhaps not spirit) to that of propensity score have been developed to address these sorts of issues.

This is my rough understanding of the issue given my background. So if my initial question seems obtuse it could be because it is framed in the econometric paradigm that Prof. Pruzek mentioned.

Jeremy

## bob pruzek said

at 10:59 am on Oct 10, 2008

re: Jeremy's note: There are just two points I want to make here (besides noting that in your first sentence, your "...is also highly correlated with your outcome of interest...", the word 'highly' is too strong; drop it and see that the logic works):

1. When we try to tease out causal effects in observational studies, it is the rare situation (generally) when we have time series data (except for before/after). Your reference to endogeneity and autocorrelation (terms that arise mostly in econometric applications) seem almost always to concern time series data. I shall pay relatively little attention to time series questions in this course simply because we have plenty on our plate without TS.

2. Causal analyses are OFTEN done with no reference to propensity scores. I'd like to discuss some of this, but since it usually involves structural equation modeling (SEM), that takes us far afield (and is often LESS basic) than our focus: PSA.

Try whenever possible to provide at least one example, or good reference to a well-described example, when you have questions. Best, BP

## Jeremy said

at 9:00 pm on Oct 14, 2008

Below are four issues that I have run into using PSAgraphics. I’m sure the workarounds are straightforward to those adept at R, so any recommended code would be greatly appreciated.

1. How to “merge” the vector “groups” to the data.frame “ice.” Ice has missing values.

>fit<-glm( wb~ schpoor+ schdrop + schblk+ d_educ + AFQT81 + oc79sei + dadsei + pnhb14 + newspaper + faminc78 + medianinc + momwork14 + siblings + wprents14, family=binomial, data=ice)

>pscore<-fit$fitted

>groups<-cut(pscore, quantile(pscore, seq(0,1,1/5)), include.lowest=T, labels=F)

> dim(ice)

[1] 12686 50

> length(groups)

[1] 9096

> box.psa(schblk, wb, groups, pts=F)

Error in data.frame(continuous, treatment, strata) :

arguments imply differing number of rows: 12686, 9096

2. Not sure about this one. The function works fine until I specify “balance=T”.

> box.psa(pnhb14, wb, groups, pts=F, balance=T)

Press <enter> for bar chart...

Error in ks.test(continuous[treatment == unique(treatment)[1] & strat.f == :

not enough 'y' data

3. Again this looks like a missing data issue. Note that I was able to create the strata “groups” via subclassification in Stata and then bring that data into R. The main question I have is how to run PSA functions that allow/ignore missing data.

> circ.psa(lnwage90, wb, groups)

Error in FUN(X[[1L]], ...) : missing observations in cov/cor

4. Not sure about this one either. Circ.psa works fine also until I specify “summary=T”

> circ.psa(pnhb14, wb, groups, summary=T)

Error in response[, 3:4] : incorrect number of dimensions

Also, Prof Pruzek if we have a chance would you mind discussing the balancing methods used in PSAgraphics (e.g. Kolmogorov-Smirnov test, and the permutation routine evoked by bal.ma.psa).

## bob pruzek said

at 9:23 pm on Oct 14, 2008

Jeremy, Your data set is by all my developed standards not just large, but huge. (That's two this week like this; the other has 168k records (and we've got the functions working, but NO missing data.)

None of the functions in PSAgraphics were designed to accommodate missingness. It is assumed that you take care of this outside of this package. But as a practical matter, w/ so many records available

to you, I'd like to know whether it would be viable for you to eliminate records with missing values, and then proceed. Do you have the information available to say how many records would remain if you went w/ an analysis based on complete records? Could you get the info easily?

As to how to handle missingness in your case, if you are to try some kind of imputation, there are lots of ways within R (and outside it) to proceed. I'm hoping to have Prof. Yurcel (new Prof. in Biostat, who

specializes in missing data issues) to pay us a visit (and I'll be doing a PSA presentation for him too, in return). Let's talk about some of this in class, or outside, depending on details. BP

## bob pruzek said

at 9:24 pm on Oct 14, 2008

PS. I'll definitely get to the balance questions you mention re: PSAg, but we are 2-3 weeks from that point just now.

You don't have permission to comment on this page.