Hi Moshe, Your idea sounds reasonable to me. It seems analogous to have a system of linear equations with more unknowns that equations - there should be several solutions so there is no "exact" PCA solution.
My plan (* = dot product) 1. Pick first "nice" vector to be longest - that is x1 * x1 is maximal. 2. For all second vectors x2 ~= x1, compute (x2 * x1)^2 / (x1 * x1) and pick minimum as my second vector. 3. For all third vectors x3 ~= x2 ~= x1, compute (x3 * x1)^2 / (x1 * x2) + (x3 * x2)^2/(x2 * x2) and pick minimum as my third vector. 4. So on until we have 6000 vectors. 5. Perform PCA on this 6000x6000 resulting matrix. What do you think? Moshe Olshansky-2 wrote: > > Hi Misha, > > Since PCA is a linear procedure and you have only 6000 observations, you > do not need 68000 variables. Using any 6000 of your variables so that the > resulting 6000x6000 matrix is non-singular will do. You can choose these > 6000 variables (columns) randomly, hoping that the resulting matrix is > non-singular (and checking for this). Alternatively, you can try something > like choosing one "nice" column, then choosing the second one which is the > mostly orthogonal to the first one (kind of Gram-Schmidt), then choose the > third one which is mostly orthogonal to the first two, etc. (I am not sure > how much rounoff may be a problem- try doing this using higher precision > if you can). Note that you do not need to load the entire 6000x68000 > matrix into memory (you can load several thousands of columns, process > them and discard them). > Anyway, you will end up with a 6000x6000 matrix, i.e. 36,000,000 entries, > which can fit into a memory and you can perform the usual PCA on this > matrix. > > Good luck! > > Moshe. > > P.S. I am curious to see what other people think. > > --- On Fri, 21/8/09, misha680 <mk144...@bcm.edu> wrote: > >> From: misha680 <mk144...@bcm.edu> >> Subject: [R] Principle components analysis on a large dataset >> To: r-help@r-project.org >> Received: Friday, 21 August, 2009, 10:45 AM >> >> Dear Sirs: >> >> Please pardon me I am very new to R. I have been using >> MATLAB. >> >> I was wondering if R would allow me to do principal >> components analysis on a >> very large >> dataset. >> >> Specifically, our dataset has 68800 variables and around >> 6000 observations. >> Matlab gives "out of memory" errors. I have tried also >> doing princomp in >> pieces, but this does not seem to quite work for our >> approach. >> >> Anything that might help much appreciated. If anyone has >> had experience >> doing this in R much appreciated. >> >> Thank you >> Misha >> -- >> View this message in context: >> http://www.nabble.com/Principle-components-analysis-on-a-large-dataset-tp25072510p25072510.html >> Sent from the R help mailing list archive at Nabble.com. >> >> ______________________________________________ >> R-help@r-project.org >> mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, >> reproducible code. >> > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > -- View this message in context: http://www.nabble.com/Principle-components-analysis-on-a-large-dataset-tp25072510p25085859.html Sent from the R help mailing list archive at Nabble.com. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.