On Mon, 8 Aug 2016, Ellis, Alicia M wrote:

I have a large dataset with ~500,000 columns and 1264 rows. Each column represents the percent methylation at a given location in the genome. I need to run 500,000 linear models for each of 4 predictors of interest in the form of:

Methylation.stie1 ~ predictor1 + covariate1+ covariate2 + ... covariate9
...and save only the pvalue for the predictor

The original methylation data file had methylation sites as row labels and the individuals as columns so I read the data in chunks and transposed it so I now have 5 csv files (chunks) with columns representing methylation sites and rows as individuals.

I was able to get results for all of the regressions by running each chunk of methylation data separately on our supercomputer using the code below.

This sounds like a problem for my old laptop, not a supercomputer.

You might want to review the algebra and geometry of least squares.

In particular, covariate1 ... covariate9 are the same 1264 x 9 matrix for every problem IIUC. So, you can compute the QR decomposition for that matrix (and the unit vector `intercept') *once* and use it in all the problems.

Using that decomposition, find the residuals for the regressands and for `predictor1' (etc) regressors. The rest is simple least squares. You compute the correlation coefficient of the residuals of a regressand and those of a regressor, for each combination. Make a table of critical values for the p-value(s) you require - remember to get the degrees of freedom right (i.e. account for the covariates). These correlations of residuals are the partial correlations given the covariates, and a test on one of them is algebraically equal to the test on regression coefficient for corresponding regressand and regressor in a modelthat also includes those 9 covariates.

See:

 ?qr
 ?lm.fit

HTH,

Chuck

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to