Taking it back - no need for all.x = T, all.y = F On Tue, Dec 22, 2015 at 3:56 PM, Dimitri Liakhovitski <dimitri.liakhovit...@gmail.com> wrote: > Actually, the correct merge line should be: > my.merge <- merge(myinfo, mydata, by="version", all.x = T, all.y = F) > > On Tue, Dec 22, 2015 at 3:50 PM, Dimitri Liakhovitski > <dimitri.liakhovit...@gmail.com> wrote: >> You are right, guys, merge is working. Somehow I was under the >> erroneous impression that because the second data frame (myinfo) >> contains no column 'myid' merge will not work. >> Below is the cleaner code and comparison: >> >> ######################################### >> ### Example with smaller data frames >> ######################################### >> >> set.seed(123) >> mydata <- data.frame(myid = 1001:1020, >> version = sample(1:10, 20, replace = T)) >> head(mydata) >> table(mydata$version) >> >> set.seed(12) >> myinfo <- data.frame(version = sort(rep(1:10, 5)), a = rnorm(50), b = >> rnorm(50), c = rnorm(50), d = rnorm(50)) >> head(myinfo, 40) >> table(myinfo$version) >> >> ###---------------------------------------- >> ### METHOD 1 - Looping through each id of mydata and grabbing >> ### all columns of myinfo for the corresponding 'version': >> >> >> # Create placeholder list for the results: >> result <- split(mydata[c("myid", "version")], f = list(mydata$myid)) >> length(result) >> (result)[1:3] >> >> >> # Looping through each element of 'result': >> for(i in 1:length(result)){ >> id <- result[[i]]$myid >> result[[i]] <- myinfo[myinfo$version == result[[i]]$version, ] >> result[[i]]$myid <- id >> result[[i]] <- result[[i]][c(6, 1:5)] >> } >> result <- do.call(rbind, result) >> result.order <- arrange(result, myid, version, a, b, c, d) >> head(result.order) # This is the desired result >> >> ###---------------------------------------- >> ### METHOD 2 - merge >> >> my.merge <- merge(myinfo, mydata, by="version") >> names(my.merge) >> result2 <- my.merge[,c("myid", "version", "a", "b", "c", "d")] >> names(result2) >> result2.order <- arrange(result2, myid, version, a, b, c, d) >> dim(result2.order) >> head(result2.order) >> >> # Same result? >> all.equal(result.order, result2.order) >> >> On Tue, Dec 22, 2015 at 3:34 PM, Dimitri Liakhovitski >> <dimitri.liakhovit...@gmail.com> wrote: >>> I know I am overwriting. >>> merge doesn't solve it because each version in mydata is given to more >>> than one id. Hence, I thought I can't merge by version. >>> I am not sure how to answer the question about "the problem". >>> I described the current state and the desired state. If possible, I'd >>> like to get from the current state to the desired state faster than >>> when using a loop. >>> >>> On Tue, Dec 22, 2015 at 2:26 PM, jim holtman <jholt...@gmail.com> wrote: >>>> You seem to be saving 'myid' and then overwriting it with the last >>>> statement: >>>> >>>> result[[i]] <- result[[i]][c(5, 1:4)] >>>> >>>> Why doesn't 'merge' work for you? I tried it on your data, and seem to get >>>> back the same number of rows; may not be in the same order, but the content >>>> looks the same, and it does have 'myid' on it. >>>> >>>> >>>> Jim Holtman >>>> Data Munger Guru >>>> >>>> What is the problem that you are trying to solve? >>>> Tell me what you want to do, not how you want to do it. >>>> >>>> On Tue, Dec 22, 2015 at 12:27 PM, Dimitri Liakhovitski >>>> <dimitri.liakhovit...@gmail.com> wrote: >>>>> >>>>> Hello! >>>>> I have a solution for my task that is based on a loop. However, it's >>>>> too slow for my real-life problem that is much larger in scope. >>>>> However, I cannot use merge. Any advice on how to do it faster? >>>>> Thanks a lot for any hint on how to speed it up! >>>>> >>>>> # I have 'mydata' data frame: >>>>> set.seed(123) >>>>> mydata <- data.frame(myid = 1001:1100, >>>>> version = sample(1:20, 100, replace = T)) >>>>> head(mydata) >>>>> table(mydata$version) >>>>> >>>>> # I have 'myinfo' data frame that contains information for each 'version': >>>>> set.seed(12) >>>>> myinfo <- data.frame(version = sort(rep(1:20, 30)), a = rnorm(60), b = >>>>> rnorm(60), >>>>> c = rnorm(60), d = rnorm(60)) >>>>> head(myinfo, 40) >>>>> >>>>> ### MY SOLUTION WITH A LOOP: >>>>> ### Looping through each id of mydata and grabbing >>>>> ### all columns from 'myinfo' for the corresponding 'version': >>>>> >>>>> # 1. Creating placeholder list for the results: >>>>> result <- split(mydata[c("myid", "version")], f = list(mydata$myid)) >>>>> length(result) >>>>> (result)[1:3] >>>>> >>>>> >>>>> # 2. Looping through each element of 'result': >>>>> for(i in 1:length(result)){ >>>>> id <- result[[i]]$myid >>>>> result[[i]] <- myinfo[myinfo$version == result[[i]]$version, ] >>>>> result[[i]]$myid <- id >>>>> result[[i]] <- result[[i]][c(5, 1:4)] >>>>> } >>>>> result <- do.call(rbind, result) >>>>> head(result) # This is the desired result >>>>> >>>>> -- >>>>> Dimitri Liakhovitski >>>>> >>>>> ______________________________________________ >>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>> PLEASE do read the posting guide >>>>> http://www.R-project.org/posting-guide.html >>>>> and provide commented, minimal, self-contained, reproducible code. >>>> >>>> >>> >>> >>> >>> -- >>> Dimitri Liakhovitski >> >> >> >> -- >> Dimitri Liakhovitski > > > > -- > Dimitri Liakhovitski
-- Dimitri Liakhovitski ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.