Thanks for answering. I already started hunting. But my first doubt was if I used prcomp correctly (and this is in the moment my most important point). So far as I understood your answer is yes. Is that correct?
I am puzzled by the fact that these "columns" are more or less in the middle of my snp-data. > However, it could also be a biological effect. Are your ids by any chance from the same pedigree? If so, you might be seeing something like the effect of a crossover event in a distant ancestor. No there is no such pedigree scheme. Things like this are ruled out by IBD-measurement. (Further, the data is checked by an EIGENSTRAT analysis.) > (b) the scaling by sqrt(pi*(1-pi)) implicitly requiring Hardy-Weinberg equilibrium, so if your data are all 0 or 2 (aa or AA) there will be overdispersion. This is a good point. But why do find such effects in the "middle" of my data? Thanks Hermann 2013/10/3 peter dalgaard <pda...@gmail.com> > It's not so obvious to me that this is an artifact. What prcomp() says is > that some of the eigenvectors have a lot of "activity" in some relatively > narrow ranges of SNPs (on the same chromosome, perhaps?). If something > artificial is going on, I could imagine effects not so much of centering > columns but maybe one of > (a) imputing zero for missing values > (b) the scaling by sqrt(pi*(1-pi)) implicitly requiring Hardy-Weinberg > equilibrium, so if your data are all 0 or 2 (aa or AA) there will be > overdispersion. > > However, it could also be a biological effect. Are your ids by any chance > from the same pedigree? If so, you might be seeing something like the > effect of a crossover event in a distant ancestor. (Talk to a geneticist, I > just "play one on TV".) > > To investigate further, you could go looking at the individual scores and > see who is having extreme values on component 2-4 and then go back and see > if there is something peculiar about their SNPs in the "strange" region. > > Of course, you might have stumbled upon a bug in R, but I doubt so. > > Happy hunting! > > -pd > > > On Oct 3, 2013, at 11:41 , Hermann Norpois wrote: > > > Hello, > > > > I did a pca with over 200000 snps for 340 observations (ids). If I plot > the > > eigenvectors (called rotation in prcomp) 2,3 and 4 (e.g. plot > > (rotation[,2]) I see a strange "column" in my data (see attachment). I > > suggest it is an artefact (but of what?). > > > > Suggestion: > > I used prcomp this way: prcomp (mat), where mat is a matrix with the > column > > means already substracted followed by a normalisation procedure (see > below > > for details). Is that okay? Or does prcomp repeat substraction steps? > > > > Originally my approach was driven by the idea to compute a covariation > > matrix followed by the use of eigen, but the covariation matrix was to > huge > > to handle. So I switched to prcomp. > > > > As I guess that the "columns" in my plots reflect some artefact > production > > I hope to get some help. For the case that my use of prcomp was not okay, > > could you please give me instructions how to use it - including with the > > normalisation procedure that I need to include before doing a pca. > > > > Thanks > > Hermann > > > > # > > # mat: matrix with genotypes coded as 0,1 and 2 (columns); IDs > > (observations) as rows. > > # > > prcomp.snp <- function (mat) > > { > > m <- ncol (mat) > > n <- nrow (mat) > > snp.namen <- colnames (mat) > > for (i in 1:m) > > { > > # snps in columns > > ui <- mat[,i] > > n <- length (which (!is.na(ui))) > > # see methods Price et al. as correction > > pi <- (1+ sum(ui, na.rm=TRUE))/(2+2*n) > > > > # substract mean > > ui <- ui - mean (ui, na.rm=TRUE) > > # NAs set to zero > > ui[is.na(ui)] <- 0 > > # normalisation of the genotype for each ID > > important normalisation step > > ui <- ui/ (sqrt (pi*(1-pi))) > > # fill matrix with ui > > mat[,i] <- ui > > } > > mat <- prcomp (mat) > > return (mat) > > } > > <rotplot.png>______________________________________________ > > R-help@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > -- > Peter Dalgaard, Professor > Center for Statistics, Copenhagen Business School > Solbjerg Plads 3, 2000 Frederiksberg, Denmark > Phone: (+45)38153501 > Email: pd....@cbs.dk Priv: pda...@gmail.com > > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.