Re: [R] which rows are duplicates?

Wacek Kusnierczyk Tue, 31 Mar 2009 05:44:10 -0700

Dimitris Rizopoulos wrote:
>
>>>
>>
>> another approach (maybe a bit cleaner) seems to be:
>>
>> data <- data.frame(x=sample(1:2, 5, replace=TRUE), y=sample(1:2, 5,
>> replace = TRUE))
>>
>> vals <- do.call('paste', c(data, sep = '\r'))
>> data$class <- match(vals, unique(vals))
>> data
>>
>>
>> I have tried benchmarking it.
>
> sorry, I wanted to write: I have *not* tried benchmarking it.



    # dummy data frame, just integers
    n = 100; m = 100
    data = as.data.frame(
        matrix(nrow=n, ncol=m,
            sample(n, m*n, replace=TRUE)))

    # do a simple benchmarking
    library(rbenchmark)
    benchmark(
        replications=100, 
        order='elapsed', 
        columns=c('test', 'elapsed'),
        waku=local({
            rows = do.call('paste', c(data, sep='\r'))
            data$class = with(
                rle(sort(rows)),
                rep(1:length(values), lengths)[rank(rows)] ) }),
        diri=local({
            values = do.call('paste', c(data, sep='\r'))
            data$class = match(values, unique(values)) }) )

        #  test elapsed
        # 2 diri    0.43
        # 1 waku    0.52


comparable for m=n=100 (and even better for n >> m), but way cleaner
code, and the class ids are now better sorted.  that's collaborative
problem solving ;)

best,
vQ

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] which rows are duplicates?

Reply via email to