It's not so obvious to me that this is an artifact. What prcomp() says is that some of the eigenvectors have a lot of "activity" in some relatively narrow ranges of SNPs (on the same chromosome, perhaps?). If something artificial is going on, I could imagine effects not so much of centering columns but maybe one of (a) imputing zero for missing values (b) the scaling by sqrt(pi*(1-pi)) implicitly requiring Hardy-Weinberg equilibrium, so if your data are all 0 or 2 (aa or AA) there will be overdispersion.
However, it could also be a biological effect. Are your ids by any chance from the same pedigree? If so, you might be seeing something like the effect of a crossover event in a distant ancestor. (Talk to a geneticist, I just "play one on TV".) To investigate further, you could go looking at the individual scores and see who is having extreme values on component 2-4 and then go back and see if there is something peculiar about their SNPs in the "strange" region. Of course, you might have stumbled upon a bug in R, but I doubt so. Happy hunting! -pd On Oct 3, 2013, at 11:41 , Hermann Norpois wrote: > Hello, > > I did a pca with over 200000 snps for 340 observations (ids). If I plot the > eigenvectors (called rotation in prcomp) 2,3 and 4 (e.g. plot > (rotation[,2]) I see a strange "column" in my data (see attachment). I > suggest it is an artefact (but of what?). > > Suggestion: > I used prcomp this way: prcomp (mat), where mat is a matrix with the column > means already substracted followed by a normalisation procedure (see below > for details). Is that okay? Or does prcomp repeat substraction steps? > > Originally my approach was driven by the idea to compute a covariation > matrix followed by the use of eigen, but the covariation matrix was to huge > to handle. So I switched to prcomp. > > As I guess that the "columns" in my plots reflect some artefact production > I hope to get some help. For the case that my use of prcomp was not okay, > could you please give me instructions how to use it - including with the > normalisation procedure that I need to include before doing a pca. > > Thanks > Hermann > > # > # mat: matrix with genotypes coded as 0,1 and 2 (columns); IDs > (observations) as rows. > # > prcomp.snp <- function (mat) > { > m <- ncol (mat) > n <- nrow (mat) > snp.namen <- colnames (mat) > for (i in 1:m) > { > # snps in columns > ui <- mat[,i] > n <- length (which (!is.na(ui))) > # see methods Price et al. as correction > pi <- (1+ sum(ui, na.rm=TRUE))/(2+2*n) > > # substract mean > ui <- ui - mean (ui, na.rm=TRUE) > # NAs set to zero > ui[is.na(ui)] <- 0 > # normalisation of the genotype for each ID > important normalisation step > ui <- ui/ (sqrt (pi*(1-pi))) > # fill matrix with ui > mat[,i] <- ui > } > mat <- prcomp (mat) > return (mat) > } > <rotplot.png>______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. -- Peter Dalgaard, Professor Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd....@cbs.dk Priv: pda...@gmail.com ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.