> This is the exact solution I came up with ... "exact", really?
Is the time-consuming part the initial merge DFm <- merge(DF1, DF2, by=c("X.DATE", "X.TIME"), all=TRUE) or the postprocessing to turn runs of NAs into the last non-NA value in the column while(any(is.na(DFm))){ if (any(is.na(DFm[1,]))) stop("Complete first row required!") ind <- which(is.na(DFm), arr.ind=TRUE) prind <- matrix(c(ind[,"row"]-1, ind[,"col"]), ncol=2) DFm[is.na(DFm)] <- DFm[prind] } If it is the latter, you may get better results from applying zoo::na.locf() to each non-key column of DFm. E.g., library(zoo) f2 <- function(DFm) { for(i in 3:length(DFm)) { DFm[[i]] <- na.locf(DFm[[i]]) } DFm } f(DFm) gives the same result as Blaser's algorithm f1 <- function (DFm) { while (any(is.na(DFm))) { if (any(is.na(DFm[1, ]))) stop("Complete first row required!") ind <- which(is.na(DFm), arr.ind = TRUE) prind <- matrix(c(ind[, "row"] - 1, ind[, "col"]), ncol = 2) DFm[is.na(DFm)] <- DFm[prind] } DFm } If there are not a huge number of columns I would guess that f2() would be much faster. Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com > -----Original Message----- > From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On > Behalf > Of Adeel - SafeGreenCapital > Sent: Thursday, May 23, 2013 5:54 AM > To: 'Blaser Nello'; r-help@r-project.org > Subject: Re: [R] adding rows without loops > > Thank you Blaser: > > This is the exact solution I came up with but when comparing 8M rows even on > an 8G machine, one runs out of memory. To run this effectively, I have to > break the DF into smaller DFs, loop through them and then do a massive > rmerge at the end. That's what takes 8+ hours to compute. > > Even the bigmemory package is causing OOM issues. > > -----Original Message----- > From: Blaser Nello [mailto:nbla...@ispm.unibe.ch] > Sent: Thursday, May 23, 2013 12:15 AM > To: Adeel Amin; r-help@r-project.org > Subject: RE: [R] adding rows without loops > > Merge should do the trick. How to best use it will depend on what you > want to do with the data after. > The following is an example of what you could do. This will perform > best, if the rows are missing at random and do not cluster. > > DF1 <- data.frame(X.DATE=rep(01052007, 7), X.TIME=c(2:5,7:9)*100, > VALUE=c(37, 42, 45, 45, 45, 42, 45), VALE2=c(29,24,28,27,35,32,32)) > DF2 <- data.frame(X.DATE=rep(01052007, 7), X.TIME=c(2:8)*100, > VALUE=c(37, 42, 45, 45, 45, 42, 45), VALE2=c(29,24,28,27,35,32,32)) > > DFm <- merge(DF1, DF2, by=c("X.DATE", "X.TIME"), all=TRUE) > > while(any(is.na(DFm))){ > if (any(is.na(DFm[1,]))) stop("Complete first row required!") > ind <- which(is.na(DFm), arr.ind=TRUE) > prind <- matrix(c(ind[,"row"]-1, ind[,"col"]), ncol=2) > DFm[is.na(DFm)] <- DFm[prind] > } > DFm > > Best, > Nello > > -----Original Message----- > From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] > On Behalf Of Adeel Amin > Sent: Donnerstag, 23. Mai 2013 07:01 > To: r-help@r-project.org > Subject: [R] adding rows without loops > > I'm comparing a variety of datasets with over 4M rows. I've solved this > problem 5 different ways using a for/while loop but the processing time > is murder (over 8 hours doing this row by row per data set). As such > I'm trying to find whether this solution is possible without a loop or > one in which the processing time is much faster. > > Each dataset is a time series as such: > > DF1: > > X.DATE X.TIME VALUE VALUE2 > 1 01052007 0200 37 29 > 2 01052007 0300 42 24 > 3 01052007 0400 45 28 > 4 01052007 0500 45 27 > 5 01052007 0700 45 35 > 6 01052007 0800 42 32 > 7 01052007 0900 45 32 > ... > ... > ... > n > > DF2 > > X.DATE X.TIME VALUE VALUE2 > 1 01052007 0200 37 29 > 2 01052007 0300 42 24 > 3 01052007 0400 45 28 > 4 01052007 0500 45 27 > 5 01052007 0600 45 35 > 6 01052007 0700 42 32 > 7 01052007 0800 45 32 > > ... > ... > n+4000 > > In other words there are 4000 more rows in DF2 then DF1 thus the > datasets are of unequal length. > > I'm trying to ensure that all dataframes have the same number of X.DATE > and X.TIME entries. Where they are missing, I'd like to insert a new > row. > > In the above example, when comparing DF2 to DF1, entry 01052007 0600 > entry is missing in DF1. The solution would add a row to DF1 at the > appropriate index. > > so new dataframe would be > > > X.DATE X.TIME VALUE VALUE2 > 1 01052007 0200 37 29 > 2 01052007 0300 42 24 > 3 01052007 0400 45 28 > 4 01052007 0500 45 27 > 5 01052007 0600 45 27 > 6 01052007 0700 45 35 > 7 01052007 0800 42 32 > 8 01052007 0900 45 32 > > Value and Value2 would be the same as row 4. > > Of course this is simple to accomplish using a row by row analysis but > with of 4M rows the processing time destroying and rebinding the > datasets is very time consuming and I believe highly un-R'ish. What am > I missing? > > Thanks! > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.