Re: [R] Deleting duplicate rows in a matrix at random

Magnus Torfason Thu, 03 Jun 2010 11:06:08 -0700

> I need to remove all but one of each [row in a matrix], which
> must be chosen at random.

This request (included in full at the bottom), has been unanswered for awhile, but I had the same problem and ended up writing a function tosolve it. I call it "duplicated.random()" and it does exactly the samething as the "duplicated()" function apart from the fact that the choiceof which of the duplicated observations gets a FALSE in the result israndom, rather than always being the first. There is no way to specifyany distribution probabilities; each duplicated observation is equallylikely to be chosen.

The implementation is through permuting the original using "sample()",then running "duplicated()" and finally reversing the permutation on theresult. So the randomization should have "similar properties" assample(), probably including reproducibility by setting the random seed(although haven't tested that explicitly).

The function and some test code are included below. It handles vectorsand matrices for now, but adding other data structures that are handledcorrectly by duplicated() should be a simple matter of ensuring that theindexing is handled correctly in the permutation process. If anyonemakes any improvements to the function, I'd be grateful to be notified.


#############################################################

# This function returns a logical vector, the elements of which
# are FALSE, unless there are duplicated values in x, in which
# case all but one elements are TRUE (for each set of duplicates).
# The only difference between this function and the duplicated()
# function is that rather than always returning FALSE for the first
# instance of a duplicated value, the choice of instance is random.
duplicated.random = function(x, incomparables = FALSE, ...)
{
    if ( is.vector(x) )
    {
        permutation = sample(length(x))
        x.perm      = x[permutation]
        result.perm = duplicated(x.perm, incomparables, ...)
        result      = result.perm[order(permutation)]
        return(result)
    }
    else if ( is.matrix(x) )
    {
        permutation = sample(nrow(x))
        x.perm      = x[permutation,]
        result.perm = duplicated(x.perm, incomparables, ...)
        result      = result.perm[order(permutation)]
        return(result)
    }
    else
    {
        stop(paste("duplicated.random() only supports vectors",
                "matrices for now."))
    }
}

#############################################################

# Test code for vector case
x = sample(1:5,10,T)
d = duplicated(x)
r = duplicated.random(x)
cbind(x,d,r)
x[!d]
x[!r]

# Test code for matrix case
x = matrix(sample(1:2,30,T), ncol=3)
d = duplicated(x)
r = duplicated.random(x)
cbind(x,d,r)

#############################################################


On 3/24/2010 11:44 AM, jeff.m.ewers wrote:


Hello,

I am relatively new to R, and I've run into a problem formatting my data for
input into the package RankAggreg.

I have a matrix of gene titles and P-values (weights) in two columns:

KCTD12  4.06904E-22
UNC93A  9.91852E-22
CDKN3   1.24695E-21
CLEC2B  4.71759E-21
DAB2    1.12062E-20
HSPB1   1.23125E-20
...

The data contains many, many duplicate gene titles, and I need to remove all
but one of each, which must be chosen at random. I have looked for quite
some time, and I've been unable to find a way to do this. Any help would be
greatly appreciated!

Thanks,

Jeff


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Deleting duplicate rows in a matrix at random

Reply via email to