[Rd] prcomp with previously scaled data: predict with 'newdata' wrong

Jari Oksanen Wed, 23 May 2012 03:51:01 -0700

Hello folks,

it may be regarded as a user error to scale() your data prior to prcomp() 
instead of using its 'scale.' argument. However, it is a user thing that may 
happen and sounds a legitimate thing to do, but in that case predict() with 
'newdata' can give wrong results:


x <- scale(USArrests)
sol <- prcomp(x)
all.equal(predict(sol), predict(sol, newdata=x))
## [1] "Mean relative difference: 0.9033485"

Predicting with the same data gives different results than the original PCA of 
the data.

The reason of this behaviour seems to be in these first lines of 
stats:::prcomp.default():

    x <- scale(x, center = center, scale = scale.)
    cen <- attr(x, "scaled:center")
    sc <- attr(x, "scaled:scale")

If input data 'x' have 'scaled:scale' attribute, it will be retained if scale() 
is called with argument "scale = FALSE" like is the case with default options 
in prcomp(). So scale(scale(x, scale = TRUE), scale = FALSE) will have the 
'scaled:center' of the outer scale() (i.e, numerical zero), but the 
'scaled:scale' of the inner scale(). 

Function princomp  finds the 'scale' directly instead of looking at the 
attributes of the input data, and works like expected:

 sol <- princomp(x)
all.equal(predict(sol), predict(sol, newdata=x))
## [1] TRUE

I don't have any nifty solution to this -- only checking the 'scale.' attribute 
and acting accordingly:

sc <- if (scale.) attr(x, "scaled:scale") else FALSE

Cheers, Jari Oksanen


______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] prcomp with previously scaled data: predict with 'newdata' wrong

Reply via email to