William Dunlap wrote: > In modelling functions some people like to use > a weight of 0 to drop an observation instead of > using a subset value of FALSE. E.g., > weights=c(0,1,1,...) > instead of > subset=c(FALSE, TRUE, TRUE, ...) > to drop the first observation. > > lm() and summary.lm() appear to treat these in the > same way, decrementing the number of degrees of > freedom for each dropped observation. However, > predict.lm() does not treat them the same. It > doesn't seem to diminish the df to account for the > 0-weighted observations. E.g., the last printout > from the following script is as follows, where > predw is the prediction from the fit that used > 0-weights and preds is from using FALSE's in the > subset argument. Is this difference proper?
Nice catch. The issue is that the subset fit and the zero-weighted fit are not completely the same. Notice that the residuals vector has different length in the two analyses. With a simplified setup: > length(lm(y~1,weights=w)$residuals) [1] 10 > length(lm(y~1,subset=-1)$residuals) [1] 9 > w [1] 0 1 1 1 1 1 1 1 1 1 This in turn is what confuses predict.lm because it gets n and residual df from length(object$residuals). summary.lm() uses NROW(Qr$qr), and I suppose that predict.lm should follow suit. > > predw preds > $fit $fit > fit lwr upr fit lwr upr > 1 1.544302 1.389254 1.699350 1 1.544302 1.194879 1.893724 > 2 1.935504 1.719482 2.151526 2 1.935504 1.448667 2.422341 > > $se.fit $se.fit > 1 2 1 2 > 0.06723657 0.09367810 0.1097969 0.1529757 > > $df $df > [1] 8 [1] 3 > > $residual.scale $residual.scale > [1] 0.1031362 [1] 0.1684207 > > ### start of script ### > data <- data.frame(x=1:10, y=log(1:10)) > fitw <- lm(data=data, y~x, weights=as.numeric(x<=5)) > fits <- lm(data=data, y~x, subset=x<=5) > fitw$df.residual == 3 && fits$df.residual == 3 # TRUE > identical(coef(fitw), coef(fits)) # TRUE > > sumw <- summary(fitw) > sums <- summary(fits) > identical(sumw$df, sums$df) # TRUE > > predw <- predict(fitw, interval="confidence", > se.fit=TRUE, newdata=data.frame(x=c(4.5,5.5))) > preds <- predict(fits, interval="confidence", > se.fit=TRUE, newdata=data.frame(x=c(4.5,5.5))) > all.equal(predw, preds) # lots of differences > > sideBySide <- function (a, b, argNames) > { > # print objects a and b side by side > oldWidth <- options(width = getOption("width")/2 - 4) > on.exit(options(oldWidth)) > if (missing(argNames)) { > argNames <- c(deparse(substitute(a))[1], > deparse(substitute(b))[1]) > } > pa <- capture.output(print(a)) > pb <- capture.output(print(b)) > nlines <- max(length(pa), length(pb)) > length(pa) <- nlines > length(pb) <- nlines > pb[is.na(pb)] <- "" > pa[is.na(pa)] <- "" > retval <- cbind(pa, pb, deparse.level = 0) > dimnames(retval) <- list(rep("",nrow(retval)), argNames) > noquote(retval) > } > > # lwr, upr, se.fit, df, residual.scale differ > sideBySide(predw, preds) > > Bill Dunlap > Spotfire, TIBCO Software > wdunlap tibco.com > > ______________________________________________ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel -- Peter Dalgaard Center for Statistics, Copenhagen Business School Phone: (+45)38153501 Email: pd....@cbs.dk Priv: pda...@gmail.com ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel