Hello, For basic linear regression lm() does the job well, for datasets that are larger than memory biglm() seems to work.
I'm working on a parallel implementation of multiple linear regression for datasets that are too large for memory. Currently I am working over least squares: calculating: t(X) %*% X and t(X) %*% y separately in parallel on each node This generates a small nxn matrix and an n vector which can then be brought together, summed, and solved for the linear regression result: ( solve(t(X) %*% X) ) %*% ( t(X) %*% y ) Except that this can have problems with t(X) %*% X being singular. My understanding is that both lm() and biglm() deal with this via using QR decomposition, but I can't figure out how this technique would be combined with intermediate summary results which is required for the parallelism technique to work. Does anyone have any pointers how this might be accomplished? Thanks, Caleb ------- # Bigger example # Introduce artificial colinearity to test case P = cbind(USArrests[2]*2, USArrests) names(P) = c("Introduced", "Murder", "Assault", "UrbanPop", "Rape") # Split the data into partitions to be calculated separately # In the real case the full data would have been too large and # each instance would have received just a chunk of data. P1 = P[1:25,] P2 = P[26:50] # Partition 1 - calculated on host 1 P1_X = as.matrix(cbind(1, P1[,c("UrbanPop", "Assault", "Rape", "Introduced")])) P1_A = t(P1_X) %*% P1_X P1_b = t(P1_X) %*% P1[,"Murder"] # Partition 2 - calculated on host 2 P2_X = as.matrix(cbind(1, P2[,c("UrbanPop", "Assault", "Rape", "Introduced")])) P2_A = t(P2_X) %*% P2_X P2_b = t(P2_X) %*% P2[,"Murder"] # repeat for partitions 3:N # ... # collect data from the all the partitions, this is the key for # the parallelism technique, I don't know how to do this step if # I was somehow doing QR decomposition. A = P1_A + P2_A # ... + P3_A ... + PN_A b = P1_b + P2_b # ... + P3_b ... + PN_b # calculate regression, this fails because of sigularity solve(A) %*% b # If I exclude the introduced column it works, but I'm not # sure how this would be generalized. solve(A[-5,-5]) %*% b[-5] # Compare to lm() lm(Murder~UrbanPop+Assault+Rape+Introduced, P) ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.