> -----Original Message----- > From: r-help-boun...@r-project.org > [mailto:r-help-boun...@r-project.org] On Behalf Of Ali Salekfard > Sent: Wednesday, December 29, 2010 6:25 AM > To: r-help@r-project.org > Subject: Re: [R] Removing rows with earlier dates > > Thanks to everyone. Joshua's response seemed the most concise > one, but it > used up so much memory that my R just gave error. I checked the other > replies and all in all I came up with this, and thought to > share it with > others and get comments. > > My structure was as follows: > > ACCOUNT RULE DATE > A1 xxxx 2010-01-01 > A2 xxxx 2007-05-01 > A2 xxxx 2007-05-01 > A2 xxxx 2005-05-01 > A2 xxxx 2005-05-01 > A1 xxxx 2009-01-01
This printout is not really sufficient to tell us what is in your dataset. E.g., I tried to convert it to a data.frame with the following code my.mapping.Date <- read.table(header=TRUE, colClasses=c("character","character","Date"), textConnection(" ACCOUNT RULE DATE A1 Rule1 2010-01-01 A2 Rule2 2007-05-01 A2 Rule3 2007-05-01 A2 Rule4 2005-05-01 A2 Rule5 2005-05-01 A1 Rule6 2009-01-01") ) and your processing code failed in the as.Date(a,"%Y-%m-%d") step because tapply() corrupts things of class Date (it turns them into integers). tapply() often has problems dealing with nontrivial data classes. If I read in the DATE column as character data then your code doesn't crash. (I did not try it with the default factors for all columns.) my.mapping.character <- read.table(header=TRUE, colClasses=c("character","character","character"), textConnection(" ACCOUNT RULE DATE A1 Rule1 2010-01-01 A2 Rule2 2007-05-01 A2 Rule3 2007-05-01 A2 Rule4 2005-05-01 A2 Rule5 2005-05-01 A1 Rule6 2009-01-01") ) f0 <- function (my.mapping) { # your code converted to a function so it doesn't # overwrite its input and so it can be easily compared # with other functions. a <- tapply(my.mapping$DATE, my.mapping$ACCOUNT, max) a <- data.frame(ACCOUNT = names(a), DT = as.Date(a, "%Y-%m-%d")) my.mapping <- merge(x = my.mapping, y = a, by.x = "ACCOUNT", by.y = "ACCOUNT") my.mapping <- cbind(my.mapping, TAKE = my.mapping$DATE == my.mapping$DT) my.mapping <- my.mapping[my.mapping$TAKE == TRUE, ] my.mapping } > f0(my.mapping.character) ACCOUNT RULE DATE DT TAKE 1 A1 Rule1 2010-01-01 2010-01-01 TRUE 3 A2 Rule2 2007-05-01 2007-05-01 TRUE 4 A2 Rule3 2007-05-01 2007-05-01 TRUE In your original post you wrote > What I would like to do is to create a data frame > with only the most recent rule for each account. but your code gives 2 rules for account A2, because there is a tie in the dates. Is that what you want? It makes thinks much simpler for R-helpers if a request for help includes details how how to make a typical input object and exactly what is wanted to be done. In the runs-based approach I suggested, ties are broken by the original order of the file. Returning all rules for the maximum date would be more complicated using this approach. isLastInRun <- function (x, ...) { retval <- c(x[-1] != x[-length(x)], TRUE) for (y in list(...)) { stopifnot(length(x) == length(y)) retval <- retval | c(x[-1] != x[-length(x)], TRUE) } retval } f2 <- function(data) { o <- order(data[, "ACCOUNT"], data[, "DATE"]) tmp <- logical(length(o)) tmp[o] <- isLastInRun(data[o, "ACCOUNT"]) data[tmp,] } f2() works on either class of DATE column. It returns the same class of DATE as the input class, because it just returns a subset of the rows of the original data.frame. The row names/numbers in the output show which rows of the input were selected. > f2(my.mapping.Date) ACCOUNT RULE DATE 1 A1 Rule1 2010-01-01 3 A2 Rule3 2007-05-01 > f2(my.mapping.character) ACCOUNT RULE DATE 1 A1 Rule1 2010-01-01 3 A2 Rule3 2007-05-01 I generated a random dataset with 1 million rows (and c. 2.3 rules/account) with gen <- function (n, Date = FALSE) { set.seed(1) d <- data.frame(stringsAsFactors = FALSE, ACCOUNT = paste(sep = "", "A", sample(floor(n/2), size = n, replace = TRUE)), RULE = paste(sep = "", "Rule", 1:n), DATE = sprintf("%04d-%02d-%02d", sample(1995:2010, size = n, replace = TRUE), sample(1:12, size = n, replace = TRUE), sample(1:28, size = n, replace = TRUE)) ) if (Date) { d$DATE <- as.Date(d$DATE) } d } d6 <- gen(n=10^6, Date=FALSE) and got the following processing times > system.time(r0 <- f0(d6)) user system elapsed 79.96 0.36 73.94 > system.time(r2 <- f2(d6)) user system elapsed 19.81 0.02 18.18 For n=10^5, the times were 4.47 for f0 and 1.23 for f2. The outputs differed only in that f0 returned all rules for the last date for an account and f2 returned only the last rule. If you want all rules for the last date (and don't mind the failure if your dataset contains Date columns) then you may prefer f0. Otherwise you may prefer f2 for its speed. Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com > > The most efficient solution I came across involves the > following steps: > > 1. Find the latest date for each account, and convert it to a > data frame: > > a<-tapply(my.mapping$DATE,my.mapping$ACCOUNT,max) > a<-data.frame(ACCOUNT=names(a),DT=as.Date(a,"%Y-%m-%d")) > 2. merge the set with the original data > > my.mapping<-merge(x=my.mapping,y=a,by.x="ACCOUNT",by.y="ACCOUNT") > > 3. Create a take column, which is to confirm if the date of > the row is the > maximum date for the account. > my.mapping<-cbind(my.mapping,TAKE=my.mapping$DATE==my.mapping$DT) > 4. Filter out all lines except those with TAKE==TRUE. > > my.mapping<-my.mapping[my.mapping$TAKE==TRUE,] > The running time for my whole list was 4.5 sec which is far > better than any > other ways I tried. Let me have your thoughts on that. > > Ali > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.