Hello, I made a terribly inefficient code which runs forever but it does run.
library(dplyr) library(splitstackshape) datalist = list() files <- list.files("/WEIGHTS1/Retina", pattern=".RDat", ignore.case=T) for(i in files) { a<-get(load(i)) names <- rownames(a) data <- as.data.frame(cbind(names,a)) rownames(data) <- NULL dd=na.omit(concat.split.multiple(data = data, split.cols = c("names"), seps = ":")) dd=select(dd,names_1,blup,names_3,names_4) colnames(dd)=c("rsid","weight","ref_allele","eff_allele") dd$WGT<-i datalist[[i]] <- dd # add it to your list } big_data = do.call(rbind, datalist) There is 17345 RDat files this loop has to go through. And each file has approximately 10,000 lines. All RDat files can be downloaded from here: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE115828 and they are compressed in this file: GSE115828_retina_TWAS_wgts.tar.gz . And subset of 3 of those .RDat files is here: https://github.com/montenegrina/sample For one of those files, say i="retina.ENSG00000135776.wgt.RDat" dd looks like this: > head(dd) rsid weight ref_allele eff_allele 1: rs72763981 9.376766e-09 C G 2: rs144383755 -2.093346e-09 A G 3: rs1925717 1.511376e-08 T C 4: rs61827307 -1.625302e-08 C A 5: rs61827308 -1.625302e-08 G C 6: rs199623136 -9.128354e-10 GC G WGT 1: retina.ENSG00000135776.wgt.RDat 2: retina.ENSG00000135776.wgt.RDat 3: retina.ENSG00000135776.wgt.RDat 4: retina.ENSG00000135776.wgt.RDat 5: retina.ENSG00000135776.wgt.RDat 6: retina.ENSG00000135776.wgt.RDat so on attempt to parallelize this I did this: library(parallel) library(data.table) library(foreach) library(doSNOW) n <- parallel::detectCores() cl <- parallel::makeCluster(n, type = "SOCK") doSNOW::registerDoSNOW(cl) files <- list.files("/WEIGHTS1/Retina", pattern=".RDat", ignore.case=T) lst_out <- foreach::foreach(i = seq_along(files), .packages = c("data.table") ) %dopar% { a <- get(load(files[i])) names <- rownames(a) data <- data.table(names, a["blup"]) nm1 <- c("rsid", "ref_allele", "eff_allele") data[, (nm1) := tstrsplit(names, ":")[-2]] return(data[, .(rsid, weight = blup, ref_allele, eff_allele)][, WGT := files[i]][]) } parallel::stopCluster(cl) big_data <- rbindlist(lst_out) I am getting this Error: Error in { : task 7 failed - "object 'blup' not found" > parallel::stopCluster(cl) Can you please advise, Ana ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.