Thanks for the reply. Sorry to be a pain, but could perhaps explain what you mean by "you can center each SNP variable at its mean to make the interaction term uncorrelated with the main effects". I'm coming from a web based programming background so often statistical terminology doesn't click with me.
Also, I have never heard of a scores test before but some googling has turned up the Lagrange multiplier test. Is this the one you mentioned. Thanks again, Davy. On 12 March 2012 20:04, Thomas Lumley <tlum...@uw.edu> wrote: > On Tue, Mar 13, 2012 at 7:11 AM, Davy <davykavan...@gmail.com> wrote: > > Dear useRs, > > > > First off, sorry about the long post. Figured it's better to give context > > to get good answers (I hope!). Some time ago I wrote an R function that > > will get all pairwise interactions of variables in a data frame. This > > worked fine at the time, but now a colleague would like me to do this > with > > a much larger dataset. They don't know how many variables they are going > to > > have in the end but they are guessing approximately 2,500 - 3,000 > variables > > with ~2000 observations for each. My function below is way too slow for > > this (4 minutes for 100 variables). At the bottom of this post I have > > included some timings for various numbers of variables and total numbers > of > > interactions. I have the results of calling Rprof() on the 100 variables > > run of my function, so If anyone wants to take a look at it let me know. > I > > don't want to make a super long post any longer than it needs to be. > > > > What I'd like to know is if there is anything I can do to speed this > > function up. I tried looking at going directly to glm.fit, but as far as > I > > understood, for that to be useful the computation of the design matrices > > and all of that other stuff that I frankly don't understand, needs to be > > the same for each model, which is not the case for my analysis, although > > perhaps I am wrong about this. I also took a look at speedglm and it > seems > > to be for the analysis of large datasets. I think the problem is that, in > > effect there are many datasets, so not sure if this package can help. > > > > Any ideas on how to make this run faster would be greatly appreciated. I > am > > planning on using parallelization to run the analysis in the end but I > > don't know how many CPU's I am going to have access to but I'd say it > won't > > be more than 8. In addition I have been advised that the best way to > > account for multiple testing here is by permutation test, which in this > > context becomes almost unfeasible, since I would have to run each > > interaction ~10,000 times. > > > > Thanks in advance, Cheers > > Davy. > > > > getInteractions2 = function(data, fSNPcol, ccCol) > > { > > #fSNPcol is the number of the column that contains the first SNP > > #ccCol is the number of the column that contains the outcome variable > > > > require(lmtest) > > > > a = data.frame() > > > > snps = names(data)[-1:-(fSNPcol-1)] > > > > names(data)[ccCol] = "PHENOTYPE" > > > > terms = as.data.frame(t(combn(snps,2))) > > > > attach(data) > > > > fit1 = c() > > > > fit2 = c() > > > > pval = c() > > > > for(i in 1:length(terms$V1)) > > > > { > > > > fit1 = > glm(PHENOTYPE~get(as.character(terms$V1[i]))+get(as.character(terms$V2[i])),family="binomial") > > > > fit2 = > glm(PHENOTYPE~get(as.character(terms$V1[i]))+get(as.character(terms$V2[i]))+I(get(as.character(terms$V1[i]))*get(as.character(terms$V2[i]))),family="binomial") > > > > a = lrtest(fit1, fit2) > > > > pval = c(pval, a[2,"Pr(>Chisq)"]) > > > > } > > > > detach(data) > > > > results = cbind(terms,pval) > > > > return(results) > > } > > > > In the table below is the system.time results for increasing numbers of > > variables being passed through the function. n is the number of > variables, > > and Ints is the number of pair-wise interactions given by that number of > > variables. > > > > n Ints user.self sys.self elapsed > > > > time 10 45 1.20 0.00 1.30 > > > > time 15 105 3.40 0.00 3.43 > > > > time 20 190 6.62 0.00 6.85 > > ... > > > > time 90 4005 178.04 0.07 195.84 > > > > time 95 4465 199.97 0.13 218.30 > > > > time 100 4950 221.15 0.08 242.18 > > > > Some code to reproduce a data frame in case you want to look at timings > or > > the Rprof() results. Please don't run this unless your machine is super > > fast, or your prepared to wait for about 15-20 minutes. > > > > df = data.frame(paste("sid",1:2000,sep=""),rbinom(2000,1,.5)) > > > > gtypes = matrix(nrow=2000, ncol=3000) > > > > gtypes = apply(gtypes,2,function(x){x=sample(0:2, 2000, replace=T);x}) > > > > snps = paste("rs", 1000:3999,sep="") > > > > df = cbind(df,gtypes) > > > > names(df) = c("sid", "status", snps) > > > > times = c() > > for(i in seq(10,100, by=5)){ > > > > if(i==100){Rprof()} > > > > time = system.time((pvals = getInteractions2(df[,1:i], 3, 2))) > > > > print(time) > > > > times = rbind(times, time) > > > > if(i==100){Rprof(NULL)} > > } > > > > numI = function(n){return(((n^2)-n)/2)} > > > > timings = cbind(seq(10,100,by=5), sapply(seq(10,100,by=5), numI),times) > > > > The first step is to use glm.fit rather than glm. This leads to > simpler as well as faster code in your context -- you feed it a design > matrix rather than a formula. That's probably a factor of two or so. > > Using score tests rather than likelihood ratio tests would also save > some time, but again only a factor of two or so. > > > It looks as though you have SNP data. If the SNPs are not in linkage > disequilibrium and the main effects are fairly small, you can center > each SNP variable at its mean to make the interaction term > uncorrelated with the main effects, and prescreen by just testing for > correlation between the interaction variable and the phenotype. Then > go back and fit the models for the few pairs that look promising. > > Note that in general you *cannot* do a permutation test to handle > multiple testing for interactions > [http://www.ncbi.nlm.nih.gov/pubmed/20384625]. In order for the > permutation null hypothesis to be true you need additional structure, > like zero main effects. You can do a parametric bootstrap, which is > similar. Or just choose a threshold based on expected number of > false positives rather than trying to get strong control of Type I > error. > > -thomas > > -- > Thomas Lumley > Professor of Biostatistics > University of Auckland > -- David Kavanagh Nephrology Research, Centre of Public Health, Queen's University Belfast, A floor, Tower Block, City Hospital, Lisburn Road, BT9 7AB, Belfast, United Kingdom [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.