On May 18, 2012, at 6:37 AM, Axel Urbiz wrote:

Would I be able to accomplish the same if x.sample was created from x
instead of x.sorted. The problem is that in my real problem, I have to sort with respect to many variables and thus keep the sample indexes consistent
across variables. So I need to first take the sample and then sort it
with respect to potentially any variable.

Either of the strategies I suggested should be generalizable to many sort criteria. It should be possible to work with indices that point back to the source of the sample. (Feel free to post an example. At the moment your requirements seem a bit vague.)

--
David.

Thanks again,
Axel.

On Thu, May 17, 2012 at 1:43 PM, Petr Savicky <savi...@cs.cas.cz> wrote:

On Thu, May 17, 2012 at 06:45:52AM -0400, Axel Urbiz wrote:
Dear List,

Is there a way I can sort a sample based on a sort index constructed from the data from which the sample is taken? Basically, I need to take 'many' samples from the same source data and sort them. This can be very time consuming for long vectors. Is there any way I can sort the data only
once
initially, and use that sort order for the samples?

I believe that idea is what is implemented in tree-based classifiers, so the data is sorted only once initially and that sort order is used for
the
child nodes.


set.seed(12345)
x <- sample(0:100, 10)
x.order <- order(x)
x.sorted <- x[x.order]

sample.ind <- sample(1:length(x), 5, replace = TRUE) #sample 1/2 size
with replacement
x.sample <- x[sample.ind]

x.sample.sorted <-   #??? (without sorting again)

Hi.

Formally, it is possible to avoid sorting using tabulate() and rep().
However, i am not sure, whether this approach is more efficient.

set.seed(12345)
x <- sample(0:100, 10)
x.order <- order(x)
x.sorted <- x[x.order]

sample.ind <- sample(1:length(x), 5, replace = TRUE) #sample 1/2 size
with replacement

 x.sample <- x.sorted[sample.ind]
freq <- tabulate(sample.ind, nbins=length(x))
x.sample.sorted <- rep(x.sorted, times=freq)

identical(sort(x.sample), x.sample.sorted) # [1] TRUE

Note that x.sample is created from x.sorted in order to make x.sample
and x.sample.sorted consistent. Since sample.ind has random order, the
distributions of x[sample.ind] and x.sorted[sample.ind] are the same.

Computing the frequencies of indices, whose range is known in advance,
can be done in linear time, so theoretically more efficiently than
sorting. However, only a test may determine, what is more efficient in
your situation.

Hope this helps.

Petr Savicky.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
West Hartford, CT

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to