On Mon, 8 Aug 2016, Ellis, Alicia M wrote:
I have a large dataset with ~500,000 columns and 1264 rows. Each column
represents the percent methylation at a given location in the genome.
I need to run 500,000 linear models for each of 4 predictors of interest
in the form of:
Methylation.stie1 ~ predictor1 + covariate1+ covariate2 + ... covariate9
...and save only the pvalue for the predictor
The original methylation data file had methylation sites as row labels
and the individuals as columns so I read the data in chunks and
transposed it so I now have 5 csv files (chunks) with columns
representing methylation sites and rows as individuals.
I was able to get results for all of the regressions by running each
chunk of methylation data separately on our supercomputer using the code
below.
This sounds like a problem for my old laptop, not a supercomputer.
You might want to review the algebra and geometry of least squares.
In particular, covariate1 ... covariate9 are the same 1264 x 9 matrix for
every problem IIUC. So, you can compute the QR decomposition for that
matrix (and the unit vector `intercept') *once* and use it in all the
problems.
Using that decomposition, find the residuals for the regressands and for
`predictor1' (etc) regressors. The rest is simple least squares. You
compute the correlation coefficient of the residuals of a regressand and
those of a regressor, for each combination. Make a table of critical
values for the p-value(s) you require - remember to get the degrees of
freedom right (i.e. account for the covariates). These correlations of
residuals are the partial correlations given the covariates, and a test on
one of them is algebraically equal to the test on regression coefficient
for corresponding regressand and regressor in a modelthat also includes
those 9 covariates.
See:
?qr
?lm.fit
HTH,
Chuck
______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.