Re: [R] What don't I understand about sample()?

avi.e.gross Sat, 15 Mar 2025 18:32:42 -0700

Rui,

I just want to discuss this point you made for now:

        2.
        Systematic use of apply(., 1, mean).
        rowMeans (and colMeans) are much faster.

You are, of course, technically correct. I consider an alternative aspect in 
which you are teaching a programming language as a small set of "commands" that 
can do quite a bit in various legal combinations. These are often reserved 
words and primitives like "if" or operators like "<-" and so on.

You then can discuss various built-in functionality that extends your abilities 
such as an array of built in functions commonly used. I see these in several 
categories. Some, can be examined in R at the prompt to see what R code 
produces it. You can learn how it works and perhaps cannibalize from it to make 
your own code that uses helpful parts and perhaps avoids functionality you do 
not need.

The other kinds are relatively hidden as .Internal or a function that after 
someprocessing then calls .Internal(...) and can be code written in various 
(perhaps usually compiled) languages such as a version of C,  or perhaps pulled 
from libraries such as FORTRAN mathematical functions. These are sometimes 
faster partially because they are not being read and interpreted, or perhaps 
use cute programming tricks.

So, as I see it, whatever teacher or book or course is being used by the OP, 
they are supposedly learning ways to approach problems from some level in which 
they want them to learn about ideas. The goal is not just learning how to do 
one problem, but to add to your tool bag for additional problems. And, in R, 
some of those ideas revolve in how to do implicit loops and in some sense do 
vectorized operations. 

So, if I understood it, the assignment included getting some number of 
equal-sized vectors and instead of working on them one at a time, assemble them 
into a collection of vectors. Besides generalized lists, R supports dataframes 
as a list of vectors (or actually other things) and matrices implemented as 
vectors with attributes and some other structures.

The assignment seems to ask them to assemble the vectors into rows of a matrix 
object and then use something that operates on one row at a time, as already 
discussed. Clearly, someone has already made implementations that allow you to 
avoid directly using apply() and get the same, faster result, using rowMeans. 
But note that although there are perhaps additional rowwise operations 
available this way, such as rowSums, in the general case, there are many 
possible functions not already supplied.

If I want something like a rowMinMax, or something exotic where I consider the 
row to contain the coefficients of a polynomial to evaluate for some given 
value for X, then a less efficient solution can easily be put together using 
something like apply, or if using other data structures, using cousins like 
lapply/sapply/vapply and quite a few other methods in packages including 
functions like pmap and interesting methods in dplyr such as rowwise that allow 
per row calculations easily.

There are usually many ways you can do something in R, as in many languages, 
and sometimes a highly efficient solution for some situations is not the best 
or most efficient. An example might be how some forms of looping are better 
suited when the number of items is not known in advance. Often a compiled 
language will optimize away a loop over a fixed small number of items like 
asking for index going from 1 to 2 to calculate the squares of the index and 
storing them in an array of two. It may get replaced without a loop and calling 
a function twice, or even doing the calculation at compile them and inserting a 
data structure with the results and not doing anything major at run time.

In this case, the OP may actually have a small fixed number of such results and 
possibly there are faster ways to do the calculation. But, if the point is to 
learn how to do arbitrary calculations when the magnitude of the parts may 
vary, then this is reasonable. Of course, if the samples being used are not all 
the same length, this method can fail badly while other methods may still work 
well.

I note relatedly, that one of us here has expressed a wish to do some of their 
code as near to using base R as possible rather than jump right away into the 
tidyverse. I can respect that even as, I, personally, prefer using whatever 
works for me faster and easier. At least, that is true when I am prototyping or 
doing something one-off. If the code turns out to be used a lot and is slow, I 
might reconsider. But, having said that, I suspect some packages may allow 
faster code than the basics. Of course, things like the built-in native pipe |> 
can change things so an altered package using a slower pipe may no longer ...

-----Original Message-----
From: R-help <r-help-boun...@r-project.org> On Behalf Of Rui Barradas
Sent: Saturday, March 15, 2025 5:28 PM
To: Kevin Zembower <ke...@zembower.org>; r-help@r-project.org
Subject: Re: [R] What don't I understand about sample()?

Hello,

I have been following this thread and though answers have been given, 
some of them address R coding in general, not necessarily the sample() 
function. Here are some random notes I think the OP could use, prompted 
by the text linked to, chap3.pdf.

1.
Throughout the text, assignments use the equal sign instead of the left 
arrow. The left arrow is generally considered more idiomatic and there 
is an important diference beteewn he wo, see help("assignOps").

time.mean = with(CommuteAtlanta, mean(Time))
B = 1000
n = nrow(CommuteAtlanta)

# This should be used, not the above.
time.mean <- with(CommuteAtlanta, mean(Time))
B <- 1000
n <- nrow(CommuteAtlanta)

2.
Systematic use of apply(., 1, mean).
rowMeans (and colMeans) are much faster.

boot.statistics <- apply(boot.samples, 1, mean)
boot.statistics <- rowMeans(boot.samples)

3.
The first confidence interval computation seems awkward. I had never 
seen this way of computing a CI95.
Moreover, it's plain common sense to keep the results with the returned 
decimals and round for display purposes only.
And the normal intervals are computed in a more usual way later in the 
text, see sections 1.2 and 1.3.

me <- ceiling(10 * 2 * time.se)/10
round(time.mean, 1) + c(-1, 1) * me

# Straightforward.
normal_ci95 <- time.mean + c(-1, 1) * 2 * time.se
normal_ci95
round(normal_ci95, 1)

# section 1.2 , function boot.mean
interval = mean(x) + c(-1,1)*2*se

# section 1.3
with(students, mean(Height) + c(-1, 1) * 2 * sd(result))

4.
In section 1.2 there is a bootstrap function boot.mean().
The function could be improved to let users pass a conf.level of their 
choice.
And why force the function user to always have the plot displayed? Base 
functions hist() and barplot() have na argument 'plot' with default TRUE 
allowing the user to choose.

The following seems more user friendly.

boot.mean <- function(x, B, binwidth = NULL, conf.level = 0.95, plot = 
TRUE) {
   require(ggplot2)
   n <- length(x)
   boot.samples <- matrix( sample(x, size = n*B, replace = TRUE), B, n)
   boot.statistics <- rowMeans(boot.samples)
   se <- sd(boot.statistics)
   if ( is.null(binwidth) )
     binwidth <- diff(range(boot.statistics))/30

   p <- ggplot(data.frame(x = boot.statistics), aes(x = x)) +
     geom_histogram(aes(y = ..density..),binwidth = binwidth) +
     geom_density(color = "red")

   alpha <- 1 - (1 - conf.level)/2
   interval <- mean(x) + c(-1, 1) * qnorm(alpha) * se

   if(plot) {
     plot(p)
   }

   list(boot.statistics = boot.statistics, interval = interval, se = se, 
plot = p)
}

Hope this helps,

Rui Barradas

Às 17:28 de 15/03/2025, Kevin Zembower via R-help escreveu:
> Hi, Richard, thanks for replying. I should have mentioned the third
> edition, which we're using. The data file didn't change between the
> second and third editions, and the data on Body Mass Gain was the same
> as in the first edition, although the first edition data file contained
> additional variables.
> 
> According to my text, the BMGain was measured in grams. Thanks for
> pointing out that my statement of the problem lacked crucial
> information.
> 
> The matrix in my example comes from an example in
> https://pages.stat.wisc.edu/~larget/stat302/chap3.pdf, where the author
> created a bootstrap example with a matrix that consisted of one row for
> every sample in the bootstrap, and one column for each mean in the
> original data. This allowed him to find the mean for each row to create
> the bootstrap statistics.
> 
> The only need for the tidyverse is to use the read_csv() function. I'm
> regrettably lazy in not determining which of the multiple functions in
> the tidyverse library loads read_csv(), and just using that one.
> 
> Thanks, again, for helping me to further understand R and this problem.
> 
> -Kevin
> 
> On Sat, 2025-03-15 at 12:00 +0100, r-help-requ...@r-project.org wrote:
>> Not having the book (and which of the three editions are you using?),
>> I downloaded the data and played with it for a bit.
>> dotchart() showed the Dark and Light conditions looked quite
>> different, but also showed that there are not very many cases.
>> After trying t.test, it occurred to me that I did not know whether
>> "BMGain" means gain in *grams* or gain in *percent*.
>> Reflection told me that for a growth experiment, percent made more
>> sense, which reminded my of one of my first
>> student advising experiences, where I said "never give the computer
>> percentages; let IT calculate the percentages
>> from the baseline and outcome, because once you've thrown away
>> information, the computer can't magically get it back."
>> In particular, in the real world I'd be worried about the possibility
>> that there was some confounding going on, so I would
>> much rather have initial weight and final weight as variables.
>> If BMGain is an absolute measure, the p value for a t test is teeny
>> tiny.
>> If BMGain is a percentage, the p value for a sensible t test is about
>> 0.03.
>>
>> A permutation test went like this.
>> is.light <- d$Group == "Light"
>> is.dark <- d$Group == "Dark"
>> score <- function (g) mean(g[is.light]) - mean(g[is.dark])
>> base.score <- score(d$BMGain)
>> perm.scores <- sapply(1:997, function (i) score(sample(d$BMGain)))
>> sum(perm.scores >= base.score) / length(perm.scores)
>>
>> I don't actually see where matrix() comes into it, still less
>> anything
>> in the tidyverse.
>>
> 
> ______________________________________________
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Este e-mail foi analisado pelo software antivírus AVG para verificar a presença 
de vírus.
www.avg.com

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] What don't I understand about sample()?

Reply via email to