Re: [R] Fastest way to compare a single value with all values in one column of a data frame

Dimitri Liakhovitski Fri, 01 Feb 2013 14:08:43 -0800

I've compared the solutions.

*Solution 1:*
myf <- function( df1, df2 ){
  cond <- df2$a > min(df1$a)
  if( cond )
  {
    idx <- which( df1$a == min(df1$a) )
    df1[idx, ] <- df2[1, ]
  }
  df1
}


# On a larger example,
set.seed(4530)
tst <- data.frame(item = 1:1000,a = rnorm(1000),b = rnorm(1000)) # large
data frame
u<-tst
system.time(
for(i in 1:100000){
  y<-data.frame(item=(1000+i),a=rnorm(1),b=rnorm(1)) # small data frame,
every time new
  u <- myf(u, y)
})

Took me about 31.90 sec
 *Solution 2:*
set.seed(4530)
x <- data.frame(item = 1:1000,a = rnorm(1000),b = rnorm(1000)) # large data
frame
system.time(
for(i in 1:100000){
  y<-data.frame(item=(1000+i),a=rnorm(1),b=rnorm(1)) # small data frame,
every time new
  u[intersect(which(u$a < y$a),which.min(u$a)),] <- y
})
The solution is correct (despite warnings) but took longer - about 48.84
sec.

Dimitri


On Wed, Jan 30, 2013 at 3:27 PM, Dimitri Liakhovitski <
dimitri.liakhovit...@gmail.com> wrote:

> In realy, values in a will be not integers, but numeric. They will never
> be identical, but it could be that they are pretty close - I don't know
> after how many points after the comma matter.
> Dimitri
>
>  On Wed, Jan 30, 2013 at 2:06 PM, arun <smartpink...@yahoo.com> wrote:
>
>> Hi,
>> Any chance x$a to have the same number repeated?
>>
>> If `Item` and `a` are unique,  I guess both the solutions should work.
>>
>> set.seed(1851)
>> x<-
>> data.frame(item=sample(letters[1:20],20,replace=F),a=sample(1:45,20,replace=F),b=sample(20:50,20,replace=F),stringsAsFactors=F)
>> y<- data.frame(item="z",a=3,b=10,stringsAsFactors=F)
>>
>> x[intersect(which(x$a < y$a),which.min(x$a)),]
>>  #  item a  b
>> #17    c 1 48
>>  x[x$a==which.min(x$a[x$a<y$a]),]
>> #   item a  b
>> #17    c 1 48
>> #or
>>
>> x[x$a%in%which.min(x$a[x$a<y$a]),]
>> #   item a  b
>> #17    c 1 48
>>
>> x[x$a%in%which.min(x$a[x$a<y$a]),]<-y
>>
>> tail(x)
>> #   item  a  b
>> #15    q 45 30
>> #16    g 10 23
>> #17    z  3 10
>> #18    r 15 39
>> #19    l 18 45
>> #20    t 35 33
>>
>> #However, if `item` column is unique, but `a` is not, then the one I
>> mentioned previously arise.
>> set.seed(1851)
>> x1<-
>> data.frame(item=sample(letters[1:20],20,replace=F),a=sample(1:10,20,replace=T),b=sample(20:50,20,replace=F),stringsAsFactors=F)
>> y1<- data.frame(item="z",a=3,b=10,stringsAsFactors=F)
>>
>>
>> x1[intersect(which(x1$a < y1$a),which.min(x1$a)),]
>>  # item a  b
>> #3    s 1 41
>> x1[x1$a==which.min(x1$a[x1$a<y1$a]),]
>>  #  item a  b
>> #3     s 1 41
>> #11    h 1 46
>> #17    c 1 48
>> x1[x1$a==which.min(x1$a[x1$a<y1$a]),]<- y1
>> A.K.
>>
>>
>> ________________________________
>> From: Dimitri Liakhovitski <dimitri.liakhovit...@gmail.com>
>> To: arun <smartpink...@yahoo.com>
>> Cc: R help <r-help@r-project.org>; Jessica Streicher <
>> j.streic...@micromata.de>
>> Sent: Wednesday, January 30, 2013 1:49 PM
>> Subject: Re: [R] Fastest way to compare a single value with all values in
>> one column of a data frame
>>
>>
>> Sorry - I should have clarified:
>> My identifiers (in column "item") will always be unique. In other words,
>> one entry in column "item" will never be repeated - neither in x nor in y.
>> Dimitri
>>
>>
>> On Wed, Jan 30, 2013 at 1:27 PM, Dimitri Liakhovitski <
>> dimitri.liakhovit...@gmail.com> wrote:
>>
>> Thank you, everyone! I'll try to test those different approaches. Really
>> appreciate your help!
>> >Dimitri
>> >
>> >
>> >On Wed, Jan 30, 2013 at 11:03 AM, arun <smartpink...@yahoo.com> wrote:
>> >
>> >HI,
>> >>
>> >>Sorry, my previous solution doesn't work.
>> >>This should work for your dataset:
>> >>set.seed(1851)
>> >>x<-
>> data.frame(item=sample(letters[1:5],20,replace=TRUE),a=sample(1:15,20,replace=TRUE),b=sample(20:30,20,replace=TRUE),stringsAsFactors=F)
>> >>y<- data.frame(item="f",a=3,b=10,stringsAsFactors=F)
>> >> x[x$a%in%which.min(x[x$a<y$a,]$a),]<- y #if there are multiple minimum
>> values
>> >>
>> >>set.seed(1241)
>> >>x1<-
>> data.frame(item=sample(letters[1:10],1e4,replace=TRUE),a=sample(1:30,1e4,replace=TRUE),b=sample(1:100,1e4,replace=TRUE),stringsAsFactors=F)
>> >>y1<- data.frame(item="f",a=3,b=10,stringsAsFactors=F)
>> >>length(x1$a[x1$a==1])
>> >>#[1] 330
>> >> system.time({x1[x1$a%in%which.min(x1[x1$a<y1$a,]$a),]<- y1})
>> >>#   user  system elapsed
>> >> # 0.000   0.000   0.001
>> >>length(x1$a[x1$a==1])
>> >>#[1] 0
>> >>
>> >>
>> >>#For some reason, it is not working when the multiple number of minimum
>> values > some value
>> >>
>> >>set.seed(1241)
>> >>x1<-
>> data.frame(item=sample(letters[1:10],1e5,replace=TRUE),a=sample(1:30,1e5,replace=TRUE),b=sample(1:100,1e5,replace=TRUE),stringsAsFactors=F)
>> >>y1<- data.frame(item="f",a=3,b=10,stringsAsFactors=F)
>> >>length(x1$a[x1$a==1])
>> >>#[1] 3404
>> >>x1[x1$a%in%which.min(x1[x1$a<y1$a,]$a),]<- y1
>> >> length(x1$a[x1$a==1])
>> >>#[1] 3404 #not getting replaced
>> >>
>> >>#However, if I try:
>> >>set.seed(1241)
>> >> x1<-
>> data.frame(item=sample(letters[1:10],1e6,replace=TRUE),a=sample(1:5000,1e6,replace=TRUE),b=sample(1:100,1e6,replace=TRUE),stringsAsFactors=F)
>> >> y1<- data.frame(item="f",a=3,b=10,stringsAsFactors=F)
>> >> length(x1$a[x1$a==1])
>> >>#[1] 208
>> >> system.time(x1[x1$a%in%which.min(x1[x1$a<y1$a,]$a),]<- y1)
>> >>#user  system elapsed
>> >> # 0.124   0.016   0.138
>> >>  length(x1$a[x1$a==1])
>> >>#[1] 0
>> >>
>> >>
>> >>#Tried Jessica's solution:
>> >>set.seed(1851)
>> >> x<-
>> data.frame(item=sample(letters[1:5],20,replace=TRUE),a=sample(1:15,20,replace=TRUE),b=sample(20:30,20,replace=TRUE),stringsAsFactors=F)
>> >> y<- data.frame(item="f",a=3,b=10,stringsAsFactors=F)
>> >> x[intersect(which(x$a < y$a),which.min(x$a)),] <- y
>> >>
>> >> x
>> >>#   item  a  b
>> >>#1     a  8 25
>> >>#2     a 10 26
>> >>#3     f  3 10 #replaced
>> >>#4     e 15 26
>> >>#5     b 13 20
>> >>#6     a  5 23
>> >>#7     d  4 29
>> >>#8     e  2 24
>> >>#9     c  7 30
>> >>#10    e 14 24
>> >>#11    d  2 20
>> >>#12    e 10 21
>> >>#13    c 13 27
>> >>#14    d 12 23
>> >>#15    b 11 26
>> >>#16    e  5 22
>> >>#17    c  1 26  #it is not replaced
>> >>#18    a  8 21
>> >>#19    e 10 26
>> >>#20    c  2 22
>> >>
>> >>
>> >>
>> >>
>> >>A.K.
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>----- Original Message -----
>> >>From: Dimitri Liakhovitski <dimitri.liakhovit...@gmail.com>
>> >>To: r-help <r-help@r-project.org>
>> >>Cc:
>> >>Sent: Tuesday, January 29, 2013 4:11 PM
>> >>Subject: [R] Fastest way to compare a single value with all values in
>> one column of a data frame
>> >>
>> >>
>> >>Hello!
>> >>
>> >>I have a large data frame x:
>> >>x<-data.frame(item=letters[1:5],a=1:5,b=11:15)  # in actuality, x has
>> 1000
>> >>rows
>> >>x$item<-as.character(x$item)
>> >>I also have a small data frame y with just 1 row:
>> >>y<-data.frame(item="f",a=3,b=10)
>> >>y$item<-as.character(y$item)
>> >>
>> >>I have to decide if y$a is larger than the smallest of all the values in
>> >>x$a. If it is, I want y to replace the whole row in x that has the
>> lowest
>> >>value in column a.
>> >>This is how I'd do it.
>> >>
>> >>if(y$a>min(x$a)){
>> >>  whichmin<-which(x$a==min(x$a))
>> >>  x[whichmin,]<-y[1,]
>> >>}
>> >>
>> >>
>> >>I am wondering if there is a faster way of doing it. What would be the
>> >>fastest possible way? I'd have to do it, unfortunately, many-many times.
>> >>
>> >>Thank you very much!
>> >>
>> >>--
>> >>Dimitri Liakhovitski
>> >>
>> >>gfk.com <http://marketfusionanalytics.com/>
>> >>
>> >>    [[alternative HTML version deleted]]
>> >>
>> >>______________________________________________
>> >>R-help@r-project.org mailing list
>> >>https://stat.ethz.ch/mailman/listinfo/r-help
>> >>PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html>
>> >>and provide commented, minimal, self-contained, reproducible code.
>> >>
>> >>
>> >
>> >
>> >--
>> >
>> >Dimitri Liakhovitski
>> >gfk.com
>>
>>
>> --
>>
>> Dimitri Liakhovitski
>> gfk.com
>>
>
>
>
> --
> Dimitri Liakhovitski
> gfk.com <http://marketfusionanalytics.com/>
>



-- 
Dimitri Liakhovitski
gfk.com <http://marketfusionanalytics.com/>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Fastest way to compare a single value with all values in one column of a data frame

Reply via email to