Moshe Olshansky <m_olshansky <at> yahoo.com> writes: > > Hi Misha, > > Since PCA is a linear procedure and you have only 6000 observations, you do not need 68000 variables. Using > any 6000 of your variables so that the resulting 6000x6000 matrix is non-singular will do. You can choose > these 6000 variables (columns) randomly, hoping that the resulting matrix is non-singular (and > checking for this). Alternatively, you can try something like choosing one "nice" column, then choosing > the second one which is the mostly orthogonal to the first one (kind of Gram-Schmidt), then choose the > third one which is mostly orthogonal to the first two, etc. (I am not sure how much rounoff may be a problem- > try doing this using higher precision if you can). Note that you do not need to load the entire 6000x68000 > matrix into memory (you can load several thousands of columns, proc > ess them and discard them). > Anyway, you will end up with a 6000x6000 matrix, i.e. 36,000,000 entries, which can fit into a memory and you > can perform the usual PCA on this matrix. > > Good luck! > > Moshe. > > P.S. I am curious to see what other people think. > I think this will give you *a* principal component analysis, but it won't give you *the* principal component analysis in the sense that the first principal component would account for a certain proportion of the total variance etc. If you try this, you see that each random sample will have different eigenvalues, different proportions of eigenvalues and different sum of all eigenvalues like you would expect for different data sets.
I even failed to create the raw data matrix of dimensins 68000 x 6000 (Error: cannot allocate vector of size 3.0 Gb). Cheers, Jari Oksanen ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.