Hi All, I think I figured out what's the problem. I have been a matlab user, so in all my codes, I maintain the as.matrix format, which is much slower to do unique.
I tried to not do the as.matrix conversion, and now it takes just few seconds to do unique, as well as other computations. Thanks a lot Duncan, Steve, David, and Douglas, Hopefully, this case can also help future matlab->R users who got stucked in the matlab thinking style. Gang On Mon, Jun 21, 2010 at 7:01 PM, Douglas Bates <ba...@stat.wisc.edu> wrote: > On Mon, Jun 21, 2010 at 8:38 PM, David Winsemius <dwinsem...@comcast.net> > wrote: >> >> On Jun 21, 2010, at 9:18 PM, Duncan Murdoch wrote: >> >>> On 21/06/2010 9:06 PM, G FANG wrote: >>>> >>>> Hi, >>>> >>>> I want to get the unique set from a large numeric k by 1 vector, k is >>>> in tens of millions >>>> >>>> when I used the matlab function unique, it takes less than 10 secs >>>> >>>> but when I tried to use the unique in R with similar CPU and memory, >>>> it is not done in minutes >>>> >>>> I am wondering, am I using the function in the right way? >>>> >>>> dim(cntxtn) >>>> [1] 13584763 1 >>>> uniqueCntxt = unique(cntxtn); # this is taking really long >>> >>> What type is cntxtn? If I do that sort of thing on a numeric vector, it's >>> quite fast: >>> >>> > x <- sample(100000, size=13584763, replace=T) >>> > system.time(unique(x)) >>> user system elapsed >>> 3.61 0.14 3.75 >> >> If it's a factor, it could be as simple as: >> >> levels(cntxtn) # since the work of "unique-ification" has already been >> done. > > Not quite. When you generate a factor, as you do in your example, the > levels correspond to the unique values of the original vector. But > when you take a subset of a factor the levels are preserved intact, > even if some of those levels do not occur in the subset. This is why > there are unusual arguments with names like drop.unused.levels in > functions like model.frame. It is also a subtle difference in the > behavior of factor(x) and as.factor(x) when x is already a factor. > >> ff <- factor(sample.int(200, 1000, replace = TRUE)) >> ff1 <- ff[1:40] >> length(levels(ff)) > [1] 199 >> length(levels(ff1)) > [1] 199 >> length(levels(as.factor(ff1))) > [1] 199 >> length(levels(factor(ff1))) > [1] 34 > >>> x <- factor(sample(100000, size=13584763, replace=T)) >>> system.time(levels(x)) >> user system elapsed >> 0 0 0 >>> system.time(y <- levels(x)) >> user system elapsed >> 0 0 0 > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.