Actually, the correct merge line should be: my.merge <- merge(myinfo, mydata, by="version", all.x = T, all.y = F)
On Tue, Dec 22, 2015 at 3:50 PM, Dimitri Liakhovitski <dimitri.liakhovit...@gmail.com> wrote: > You are right, guys, merge is working. Somehow I was under the > erroneous impression that because the second data frame (myinfo) > contains no column 'myid' merge will not work. > Below is the cleaner code and comparison: > > ######################################### > ### Example with smaller data frames > ######################################### > > set.seed(123) > mydata <- data.frame(myid = 1001:1020, > version = sample(1:10, 20, replace = T)) > head(mydata) > table(mydata$version) > > set.seed(12) > myinfo <- data.frame(version = sort(rep(1:10, 5)), a = rnorm(50), b = > rnorm(50), c = rnorm(50), d = rnorm(50)) > head(myinfo, 40) > table(myinfo$version) > > ###---------------------------------------- > ### METHOD 1 - Looping through each id of mydata and grabbing > ### all columns of myinfo for the corresponding 'version': > > > # Create placeholder list for the results: > result <- split(mydata[c("myid", "version")], f = list(mydata$myid)) > length(result) > (result)[1:3] > > > # Looping through each element of 'result': > for(i in 1:length(result)){ > id <- result[[i]]$myid > result[[i]] <- myinfo[myinfo$version == result[[i]]$version, ] > result[[i]]$myid <- id > result[[i]] <- result[[i]][c(6, 1:5)] > } > result <- do.call(rbind, result) > result.order <- arrange(result, myid, version, a, b, c, d) > head(result.order) # This is the desired result > > ###---------------------------------------- > ### METHOD 2 - merge > > my.merge <- merge(myinfo, mydata, by="version") > names(my.merge) > result2 <- my.merge[,c("myid", "version", "a", "b", "c", "d")] > names(result2) > result2.order <- arrange(result2, myid, version, a, b, c, d) > dim(result2.order) > head(result2.order) > > # Same result? > all.equal(result.order, result2.order) > > On Tue, Dec 22, 2015 at 3:34 PM, Dimitri Liakhovitski > <dimitri.liakhovit...@gmail.com> wrote: >> I know I am overwriting. >> merge doesn't solve it because each version in mydata is given to more >> than one id. Hence, I thought I can't merge by version. >> I am not sure how to answer the question about "the problem". >> I described the current state and the desired state. If possible, I'd >> like to get from the current state to the desired state faster than >> when using a loop. >> >> On Tue, Dec 22, 2015 at 2:26 PM, jim holtman <jholt...@gmail.com> wrote: >>> You seem to be saving 'myid' and then overwriting it with the last >>> statement: >>> >>> result[[i]] <- result[[i]][c(5, 1:4)] >>> >>> Why doesn't 'merge' work for you? I tried it on your data, and seem to get >>> back the same number of rows; may not be in the same order, but the content >>> looks the same, and it does have 'myid' on it. >>> >>> >>> Jim Holtman >>> Data Munger Guru >>> >>> What is the problem that you are trying to solve? >>> Tell me what you want to do, not how you want to do it. >>> >>> On Tue, Dec 22, 2015 at 12:27 PM, Dimitri Liakhovitski >>> <dimitri.liakhovit...@gmail.com> wrote: >>>> >>>> Hello! >>>> I have a solution for my task that is based on a loop. However, it's >>>> too slow for my real-life problem that is much larger in scope. >>>> However, I cannot use merge. Any advice on how to do it faster? >>>> Thanks a lot for any hint on how to speed it up! >>>> >>>> # I have 'mydata' data frame: >>>> set.seed(123) >>>> mydata <- data.frame(myid = 1001:1100, >>>> version = sample(1:20, 100, replace = T)) >>>> head(mydata) >>>> table(mydata$version) >>>> >>>> # I have 'myinfo' data frame that contains information for each 'version': >>>> set.seed(12) >>>> myinfo <- data.frame(version = sort(rep(1:20, 30)), a = rnorm(60), b = >>>> rnorm(60), >>>> c = rnorm(60), d = rnorm(60)) >>>> head(myinfo, 40) >>>> >>>> ### MY SOLUTION WITH A LOOP: >>>> ### Looping through each id of mydata and grabbing >>>> ### all columns from 'myinfo' for the corresponding 'version': >>>> >>>> # 1. Creating placeholder list for the results: >>>> result <- split(mydata[c("myid", "version")], f = list(mydata$myid)) >>>> length(result) >>>> (result)[1:3] >>>> >>>> >>>> # 2. Looping through each element of 'result': >>>> for(i in 1:length(result)){ >>>> id <- result[[i]]$myid >>>> result[[i]] <- myinfo[myinfo$version == result[[i]]$version, ] >>>> result[[i]]$myid <- id >>>> result[[i]] <- result[[i]][c(5, 1:4)] >>>> } >>>> result <- do.call(rbind, result) >>>> head(result) # This is the desired result >>>> >>>> -- >>>> Dimitri Liakhovitski >>>> >>>> ______________________________________________ >>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>> PLEASE do read the posting guide >>>> http://www.R-project.org/posting-guide.html >>>> and provide commented, minimal, self-contained, reproducible code. >>> >>> >> >> >> >> -- >> Dimitri Liakhovitski > > > > -- > Dimitri Liakhovitski -- Dimitri Liakhovitski ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.