Yes, if the Special_Dates are not sorted then f4 needs to sort them. Perhaps closestValue should just sort its vec argument.
I didn't realize that the output should not have any duplicate entries. I thought it should have the same number of rows as the input A. Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com > -----Original Message----- > From: Rui Barradas [mailto:ruipbarra...@sapo.pt] > Sent: Sunday, August 19, 2012 3:54 PM > To: William Dunlap; Francesco > Cc: r-help > Subject: Re: [R] merging and obtaining the nearest value > > Hello, > > You're right, your solution is much faster, but it doesn't remove > duplicates. > When I ran f4() with larger datasets it poduced an error, > > Error in findInterval(x, vec) : 'vec' must be sorted non-decreasingly > > So here they all are. > > f1 <- function(A, B){ > m <- merge(A, B) > result <- do.call( rbind, lapply(split(m, list(m$DATE, m$TYPE)), > function(x){ > if(nrow(x)){ > a <- abs(x$DATE - x$Special_Date) > x[which.min(a), ] }}) ) > result$Difference <- result$DATE - result$Special_Date > result$Special_Date <- NULL > rownames(result) <- seq_len(nrow(result)) > result > } > > closestValue <- function (x, vec) > { > # for each value in x, find closest value in vec. > # Break ties by using highest. > # Assume vec is sorted. > intervalNo <- findInterval(x, vec) > lowerValue <- vec[pmax(1, intervalNo)] > upperValue <- vec[pmin(length(vec), intervalNo+1)] > ifelse(x - lowerValue < upperValue - x, lowerValue, upperValue) > } > f4 <- function (A, B) { > A$TYPE <- as.factor(A$TYPE) > uA <- levels(A$TYPE) > As <- split(A$DATE, A$TYPE) > B <- B[order(B$TYPE, B$Special_Date), ] > Bs <- split(B$Special_Date, factor(B$TYPE, levels = uA)) > closest <- numeric(nrow(A)) > split(closest, A$TYPE) <- mapply(closestValue, As, Bs) > A$Difference <- A$DATE - closest > A > } > > # Test data, not many types > nA <- 1e3 > nB <- 1e4 > set.seed(1) > ta <- sample(LETTERS, nA, TRUE); da <- sample(1e2, nA, TRUE) > tb <- sample(LETTERS, nB, TRUE); db <- sample(nB, nB, TRUE) > > aa <- data.frame(TYPE = ta, DATE = da) > bb <- data.frame(TYPE = tb, Special_Date = db) > > t1 <- system.time(r1 <- f1(aa, bb)) > t4 <- system.time(r4 <- f4(aa, bb)) > rbind(t1 = t1, t4 = t4) > > sum( duplicated(r4) ) # 165 > > Rui Barradas > Em 19-08-2012 22:58, William Dunlap escreveu: > > And the following, f4, uses the same algorithm as f2 but codes > > it somewhat more efficiently. It uses the same closestValue() > > function. > > f4 <- function (A, B) { > > A$TYPE <- as.factor(A$TYPE) > > uA <- levels(A$TYPE) > > As <- split(A$DATE, A$TYPE) > > Bs <- split(B$Special_Date, factor(B$TYPE, levels = uA)) > > closest <- numeric(nrow(A)) > > split(closest, A$TYPE) <- mapply(closestValue, As, Bs) > > A$Difference <- A$DATE - closest > > A > > } > > > > Bill Dunlap > > Spotfire, TIBCO Software > > wdunlap tibco.com > > > > > >> -----Original Message----- > >> From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On > Behalf > >> Of William Dunlap > >> Sent: Sunday, August 19, 2012 1:49 PM > >> To: Francesco; r-help@r-project.org > >> Subject: Re: [R] merging and obtaining the nearest value > >> > >> The following, f2(A,B), should do well with lots of rows in A and B > >> as long as the number of types is not huge. > >> > >> f2 <- function(A, B) { > >> types <- as.character(unique(A$TYPE)) > >> result <- numeric(nrow(A)) > >> Bs <- split(B$Special_Date, B$TYPE) > >> for(type in types) { > >> w <- A$TYPE == type > >> # can omit the sort() below if you know that B$Special_Date is > >> sorted. > >> result[w] <- closestValue(A$DATE[w], sort(Bs[[type]])) > >> } > >> A$Difference <- A$DATE - result > >> A > >> } > >> > >> closestValue <- function (x, vec) > >> { > >> # for each value in x, find closest value in vec. > >> # Break ties by using highest. > >> # Assume vec is sorted. > >> intervalNo <- findInterval(x, vec) > >> lowerValue <- vec[pmax(1, intervalNo)] > >> upperValue <- vec[pmin(length(vec), intervalNo+1)] > >> ifelse(x - lowerValue < upperValue - x, lowerValue, upperValue) > >> } > >> > >> Bill Dunlap > >> Spotfire, TIBCO Software > >> wdunlap tibco.com > >> > >> > >>> -----Original Message----- > >>> From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] > >>> On > Behalf > >>> Of William Dunlap > >>> Sent: Sunday, August 19, 2012 9:43 AM > >>> To: Francesco; r-help@r-project.org > >>> Subject: Re: [R] merging and obtaining the nearest value > >>> > >>> How many different types are there? Just a handful or many thousands? > >>> For this sort of problem it is often handy to write a function which > >>> generates > >>> datasets of the sort you are thinking of but parameterized by the > >>> number of rows, levels, etc., so you can see how the execution time > >>> varies with these things. > >>> > >>> If there are just a few types, try looping over types and using > >>> findInterval > >>> to see where A$Date fits into the sequence of B$Special_Date. > >>> > >>> > >>> Bill Dunlap > >>> Spotfire, TIBCO Software > >>> wdunlap tibco.com > >>> > >>> > >>>> -----Original Message----- > >>>> From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] > >>>> On > >> Behalf > >>>> Of Francesco > >>>> Sent: Sunday, August 19, 2012 4:01 AM > >>>> To: r-help@r-project.org > >>>> Subject: Re: [R] merging and obtaining the nearest value > >>>> > >>>> Dear Riu, Many thanks for your suggestion > >>>> > >>>> However these are just simplified examples... in reality the dataset A > >>>> contains millions of observations and B several thousands of rows... > >>>> Could I still use a modified form of your suggestion? > >>>> > >>>> Thanks > >>>> > >>>> On 19 August 2012 12:51, Rui Barradas <ruipbarra...@sapo.pt> wrote: > >>>>> Hello, > >>>>> > >>>>> Try the following. > >>>>> > >>>>> > >>>>> A <- read.table(text=" > >>>>> > >>>>> TYPE DATE > >>>>> A 2 > >>>>> A 5 > >>>>> A 20 > >>>>> B 10 > >>>>> B 2 > >>>>> ", header = TRUE) > >>>>> > >>>>> > >>>>> B <- read.table(text=" > >>>>> > >>>>> TYPE Special_Date > >>>>> A 2 > >>>>> A 6 > >>>>> A 20 > >>>>> A 22 > >>>>> B 5 > >>>>> B 6 > >>>>> ", header = TRUE) > >>>>> > >>>>> result <- do.call( rbind, lapply(split(merge(A, B), list(m$DATE, > >>>>> m$TYPE)), > >>>>> function(x){ > >>>>> a <- abs(x$DATE - x$Special_Date) > >>>>> if(nrow(x)) x[which(min(a) == a), ] }) ) > >>>>> result$Difference <- result$DATE - result$Special_Date > >>>>> result$Special_Date <- NULL > >>>>> rownames(result) <- seq_len(nrow(result)) > >>>>> result > >>>>> > >>>>> > >>>>> Also, it's a good practice to post data examples using dput(). For > >>>>> instance, > >>>>> > >>>>> dput(A) > >>>>> structure(list(TYPE = structure(c(1L, 1L, 1L, 2L, 2L), .Label = c("A", > >>>>> "B"), class = "factor"), DATE = c(2L, 5L, 20L, 10L, 2L)), .Names = > >>>>> c("TYPE", > >>>>> "DATE"), class = "data.frame", row.names = c(NA, -5L)) > >>>>> > >>>>> Now all we have to do is run the statement A <- structure(... etc...) to > >>>>> have an exact copy of the data example. > >>>>> Anyway, your example with input and the wanted result was very welcome. > >>>>> > >>>>> Hope this helps, > >>>>> > >>>>> Rui Barradas > >>>>> > >>>>> Em 19-08-2012 11:10, Francesco escreveu: > >>>>>> Dear R-help > >>>>>> > >>>>>> Î would like to know if there is a short solution in R for this > >>>>>> merging problem... > >>>>>> > >>>>>> Let say I have a dataset A as: > >>>>>> > >>>>>> TYPE DATE > >>>>>> A 2 > >>>>>> A 5 > >>>>>> A 20 > >>>>>> B 10 > >>>>>> B 2 > >>>>>> > >>>>>> (there can be duplicates for the same type and date) > >>>>>> > >>>>>> and I have another dataset B as : > >>>>>> > >>>>>> TYPE Special_Date > >>>>>> A 2 > >>>>>> A 6 > >>>>>> A 20 > >>>>>> A 22 > >>>>>> B 5 > >>>>>> B 6 > >>>>>> > >>>>>> The question is : I would like to obtain the difference between the > >>>>>> date of each observation in A and the closest special date in B with > >>>>>> the same type. In case of ties I would take the latest date of the > >>>>>> two. > >>>>>> > >>>>>> For example I would obtain here > >>>>>> > >>>>>> TYPE DATE Difference > >>>>>> A 2 0=2-2 > >>>>>> A 5 -1=5-6 > >>>>>> A 20 0=20-20 > >>>>>> B 10 +4=10-6 > >>>>>> B 2 -3=2-5 > >>>>>> > >>>>>> Do you know how to (simply?) obtain this in R? > >>>>>> > >>>>>> Many thanks! > >>>>>> Best Regards > >>>>>> > >>>>>> ______________________________________________ > >>>>>> R-help@r-project.org mailing list > >>>>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>>>> PLEASE do read the posting guide > >>>>>> http://www.R-project.org/posting-guide.html > >>>>>> and provide commented, minimal, self-contained, reproducible code. > >>>>> > >>>> ______________________________________________ > >>>> R-help@r-project.org mailing list > >>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>> PLEASE do read the posting guide > >>>> http://www.R-project.org/posting-guide.html > >>>> and provide commented, minimal, self-contained, reproducible code. > >>> ______________________________________________ > >>> R-help@r-project.org mailing list > >>> https://stat.ethz.ch/mailman/listinfo/r-help > >>> PLEASE do read the posting guide > >>> http://www.R-project.org/posting-guide.html > >>> and provide commented, minimal, self-contained, reproducible code. > >> ______________________________________________ > >> R-help@r-project.org mailing list > >> https://stat.ethz.ch/mailman/listinfo/r-help > >> PLEASE do read the posting guide > >> http://www.R-project.org/posting-guide.html > >> and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > > R-help@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.