> I need to remove all but one of each [row in a matrix], which
> must be chosen at random.

This request (included in full at the bottom), has been unanswered for a while, but I had the same problem and ended up writing a function to solve it. I call it "duplicated.random()" and it does exactly the same thing as the "duplicated()" function apart from the fact that the choice of which of the duplicated observations gets a FALSE in the result is random, rather than always being the first. There is no way to specify any distribution probabilities; each duplicated observation is equally likely to be chosen.

The implementation is through permuting the original using "sample()", then running "duplicated()" and finally reversing the permutation on the result. So the randomization should have "similar properties" as sample(), probably including reproducibility by setting the random seed (although haven't tested that explicitly).

The function and some test code are included below. It handles vectors and matrices for now, but adding other data structures that are handled correctly by duplicated() should be a simple matter of ensuring that the indexing is handled correctly in the permutation process. If anyone makes any improvements to the function, I'd be grateful to be notified.

#############################################################

# This function returns a logical vector, the elements of which
# are FALSE, unless there are duplicated values in x, in which
# case all but one elements are TRUE (for each set of duplicates).
# The only difference between this function and the duplicated()
# function is that rather than always returning FALSE for the first
# instance of a duplicated value, the choice of instance is random.
duplicated.random = function(x, incomparables = FALSE, ...)
{
    if ( is.vector(x) )
    {
        permutation = sample(length(x))
        x.perm      = x[permutation]
        result.perm = duplicated(x.perm, incomparables, ...)
        result      = result.perm[order(permutation)]
        return(result)
    }
    else if ( is.matrix(x) )
    {
        permutation = sample(nrow(x))
        x.perm      = x[permutation,]
        result.perm = duplicated(x.perm, incomparables, ...)
        result      = result.perm[order(permutation)]
        return(result)
    }
    else
    {
        stop(paste("duplicated.random() only supports vectors",
                "matrices for now."))
    }
}

#############################################################

# Test code for vector case
x = sample(1:5,10,T)
d = duplicated(x)
r = duplicated.random(x)
cbind(x,d,r)
x[!d]
x[!r]

# Test code for matrix case
x = matrix(sample(1:2,30,T), ncol=3)
d = duplicated(x)
r = duplicated.random(x)
cbind(x,d,r)

#############################################################


On 3/24/2010 11:44 AM, jeff.m.ewers wrote:

Hello,

I am relatively new to R, and I've run into a problem formatting my data for
input into the package RankAggreg.

I have a matrix of gene titles and P-values (weights) in two columns:

KCTD12  4.06904E-22
UNC93A  9.91852E-22
CDKN3   1.24695E-21
CLEC2B  4.71759E-21
DAB2    1.12062E-20
HSPB1   1.23125E-20
...

The data contains many, many duplicate gene titles, and I need to remove all
but one of each, which must be chosen at random. I have looked for quite
some time, and I've been unable to find a way to do this. Any help would be
greatly appreciated!

Thanks,

Jeff

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to