Re: [Rd] Return/print standard error in t.test()

2019-02-22 Thread Martin Maechler
> Thomas J Leeper 
> on Thu, 21 Feb 2019 22:21:21 + writes:

> Hi John,
> Thanks for your reply. Of course I could write a package and of course I
> would find that trivial to do. The point is this is a main entry point to 
R
> for probably (at this point) hundreds of thousands of students. I’d like
> them to be able to get a basic quantity of interest from a t-test without
> four subsequent function calls.

> I also don’t really see the point about the object class, given we’re
> talking S3. print() doesn’t have to print everything in the object (see
> e.g., print.lm() ), so there should be little harm in returning additional
> information when relevant. Leaving the print() method unchanged and simply
> returning the SE as an additional element should affect almost nothing.

> I’m all for continuity and conservative development, but we also should 
aim
> to make R as useful and usable as possible. This seems like a nice simple
> way to do that.

I agree with both John Fox and Thomas Leeper
(well, with a subset of their union ;-)

John made the point [and Rui nicely showed how to implement it]
that *printing* such a standard error in addition to the other
stats needs potentially more changes via a specialized print() method.

Also, R in general has refused for good reasons to behave like
old batch stats software and printing all possibly interesting
output.
.. and I'd really like us to stay in that tradition and hence
*not* print such extra numbers.

Also, if your students learn slightly more about stats, they are
hopefully taught that the t.test is a simple special case of
linear (gaussian) regression, and you can teach them the
corresponding

   summary(lm( .. ))

which gives identical t-stat and p-value and does print SEs.

OTOH, I agree with Thomas (and IIRC earlier correspondents on
this issue) that it does seem natural to return the SE here, as
it is crucially used in the formula for the t-stat anyway, and
people can use it to easily compute confidence intervals (or
p-values if really desired) for other levels than 95% / 5% ..

So adding another component to the list returned by t.test()
seems fine to me, and hopefully saves us future e-mails on the topic
  [... well almost surely there will be those asking us to
   change the print() method too, but we'll survive that.]

Martin


> On Thu, 21 Feb 2019 at 21:51 Fox, John  wrote:

>> Dear Thomas,
>> 
>> it is, unfortunately, not that simple. t.test() returns an object of 
class
>> "htest" and not all such objects have standard errors. I'm not entirely
>> sure what the point is since it's easy to compute the standard error of 
the
>> difference from the information in the object (adapting an example from
>> ?t.test):
>> 
>> > (res <- t.test(1:10, y = c(7:20)))
>> 
>> Welch Two Sample t-test
>> 
>> data:  1:10 and c(7:20)
>> t = -5.4349, df = 21.982, p-value = 1.855e-05
>> alternative hypothesis: true difference in means is not equal to 0
>> 95 percent confidence interval:
>> -11.052802  -4.947198
>> sample estimates:
>> mean of x mean of y
>> 5.5  13.5
>> 
>> > as.vector(abs(diff(res$estimate)/res$statistic)) # SE
>> [1] 1.47196
>> > class(res)
>> [1] "htest"
>> 
>> and if you really want to print the SE as a matter of course, you could
>> always write your own wrapper for t.test() that returns an object of 
class,
>> say, "t.test" for which you can provide a print() method. Much of the
>> advantage of working in a statistical computing environment like R (or
>> Stata, for that matter) is that you can make things work the way you 
like.
>> 
>> Best,
>> John
>> 
>> -
>> John Fox, Professor Emeritus
>> McMaster University
>> Hamilton, Ontario, Canada
>> Web: http::/socserv.mcmaster.ca/jfox
>> 
>> > On Feb 21, 2019, at 3:57 PM, Thomas J. Leeper 
>> wrote:
>> >
>> > A recent thread on Twitter [1] by a Stata user highlighted that 
t.test()
>> > does not return or print the standard error of the mean difference,
>> despite
>> > it being calculated by the function.
>> >
>> > I know this isn’t the kind of change that’s likely to be made but could
>> we
>> > at least return the SE even if the print() method isn’t updated? Or,
>> > better, update the print() method to display this as well?
>> >
>> > Best,
>> > Thomas
>> >
>> > [1]
>> > https://twitter.com/amandayagan/status/1098314654470819840?s=21
>> > --
>> >
>> > Thomas J. Leeper
>> > http://www.thomasleeper.com
>> >
>> >   [[alternative HTML version deleted]]
>> >
>> > __
>> > R-devel@r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-devel
   

Re: [Rd] Proposed patch for ?Extract

2019-02-22 Thread Martin Maechler
> Marc Schwartz via R-devel 
> on Thu, 21 Feb 2019 14:39:45 +0100 writes:

 >   Hi,

> In follow up to the thread on R-Help yesterday:
>https://stat.ethz.ch/pipermail/r-help/2019-February/461725.html

> I am attaching a proposed patch against the trunk version of
> Extract.Rd, with wording added to the "Matrices and arrays" section, to
> note that indexing these object by factors behaves in a manner
> consistent with vectors.
> Regards,
> Marc Schwartz

Seems reasonable to me: some redundancy is allowed even in the
reference manuals (:= union of all help pages).
{good useRs knowning that 'matrix' and 'arrays' are atomic
 vectors, too .. as was mentioned.}

Martin

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Return/print standard error in t.test()

2019-02-22 Thread peter dalgaard
It's not a problem per se to put additional information into class htest 
objects (hey, it's S3 after all...) and there is a precedent in chisq.test 
which returns $observed and $expected.

Getting such information printed by print.htest is more tricky, although it 
might be possible to (ab)use the $estimate slot. 

The further question is whether one would really want to do that (change the 
output and/or modify the current return values), at the risk of affecting a 
rather large bundle of existing scripts, books, lecture notes, etc. I don't 
think that I would want to do that for the case of the s.e.d., although I'll 
admit that there is another thing that has always been a bit of an eyesore to 
me: We give a confidence interval but not the corresponding point estimate 
(i.e. the _difference_ of the means).

It might be better to simply start over and write a new function. In the 
process one might address other things that people have been asking for, like 
calculations based on the sample mean and SDs (which would useful for dealing 
with published summaries and textbook examples). Oh, and a formula interface 
for the one-sample test.

-pd

> On 21 Feb 2019, at 22:51 , Fox, John  wrote:
> 
> Dear Thomas,
> 
> it is, unfortunately, not that simple. t.test() returns an object of class 
> "htest" and not all such objects have standard errors. I'm not entirely sure 
> what the point is since it's easy to compute the standard error of the 
> difference from the information in the object (adapting an example from 
> ?t.test):
> 
>> (res <- t.test(1:10, y = c(7:20)))
> 
>   Welch Two Sample t-test
> 
> data:  1:10 and c(7:20)
> t = -5.4349, df = 21.982, p-value = 1.855e-05
> alternative hypothesis: true difference in means is not equal to 0
> 95 percent confidence interval:
> -11.052802  -4.947198
> sample estimates:
> mean of x mean of y 
>  5.5  13.5 
> 
>> as.vector(abs(diff(res$estimate)/res$statistic)) # SE
> [1] 1.47196
>> class(res)
> [1] "htest"
> 
> and if you really want to print the SE as a matter of course, you could 
> always write your own wrapper for t.test() that returns an object of class, 
> say, "t.test" for which you can provide a print() method. Much of the 
> advantage of working in a statistical computing environment like R (or Stata, 
> for that matter) is that you can make things work the way you like.
> 
> Best,
> John
> 
>  -
>  John Fox, Professor Emeritus
>  McMaster University
>  Hamilton, Ontario, Canada
>  Web: http::/socserv.mcmaster.ca/jfox
> 
>> On Feb 21, 2019, at 3:57 PM, Thomas J. Leeper  wrote:
>> 
>> A recent thread on Twitter [1] by a Stata user highlighted that t.test()
>> does not return or print the standard error of the mean difference, despite
>> it being calculated by the function.
>> 
>> I know this isn’t the kind of change that’s likely to be made but could we
>> at least return the SE even if the print() method isn’t updated? Or,
>> better, update the print() method to display this as well?
>> 
>> Best,
>> Thomas
>> 
>> [1]
>> https://twitter.com/amandayagan/status/1098314654470819840?s=21
>> -- 
>> 
>> Thomas J. Leeper
>> http://www.thomasleeper.com
>> 
>>  [[alternative HTML version deleted]]
>> 
>> __
>> R-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd@cbs.dk  Priv: pda...@gmail.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] model.matrix.default() silently ignores bad contrasts.arg

2019-02-22 Thread Martin Maechler
> Ben Bolker 
> on Thu, 21 Feb 2019 08:18:51 -0500 writes:

> On Thu, Feb 21, 2019 at 7:49 AM Fox, John  wrote:
>> 
>> Dear Ben,
>> 
>> Perhaps I'm missing the point, but contrasts.arg is documented to be a 
list. From ?model.matrix: "contrasts.arg: A list, whose entries are values 
(numeric matrices or character strings naming functions) to be used as 
replacement values for the contrasts replacement function and whose names are 
the names of columns of data containing factors."

> I absolutely agree that this is not a bug/behaves as documented (I
> could have said that more clearly).  It's just that (for reasons I
> attempted to explain) this is a really easy mistake to make.

>> This isn't entirely accurate because a function also works as a named 
element of the list (in addition to a character string naming a function and a 
contrast matrix), as your example demonstrates, but nowhere that I'm aware of 
is it suggested that a non-list should work.
>> 
>> It certainly would be an improvement if specifying contrast.arg as a 
non-list generated an error or warning message, and it at least arguably would 
be convenient to allow a general contrast specification such as 
contrasts.arg-"contr.sum", but I don't see a bug here.

> I agree.  That's what my patch does (throws a warning message if
> contrasts.arg is non-NULL and not a list).

I currently do think this is a good idea... "even though" I'm 99%
sure that this will make work for package maintainers and others
whose code may suddenly show warnings.
I hope they would know better than suppressWarnings(.) ...

I see a version of the patch using old style indentation which
makes the diff even "considerably" smaller -- no need to submit
this different, though --
and I plan to test that a bit, and commit eventually to R-devel,
possibly in a 5 days or so. 

Thank you Ben for the suggestion and patch !
Martin

> cheers
> Ben Bolker

>> Best,
>> John
>> 
>> -
>> John Fox, Professor Emeritus
>> McMaster University
>> Hamilton, Ontario, Canada
>> Web: http::/socserv.mcmaster.ca/jfox
>> 
>> > On Feb 20, 2019, at 7:14 PM, Ben Bolker  wrote:
>> >
>> > An lme4 user pointed out  that
>> > passing contrasts as a string or symbol to [g]lmer (which would work if
>> > we were using `contrasts<-` to set contrasts on a factor variable) is
>> > *silently ignored*. This goes back to model.matrix(), and seems bad
>> > (this is a very easy mistake to make, because of the multitude of ways
>> > to specify contrasts for factors in R  - e.g. options(contrasts=...);
>> > setting contrasts on the specific factors; passing contrasts as a list
>> > to the model function ... )
>> >
>> > The relevant code is here:
>> >
>> > 
https://github.com/wch/r-source/blob/trunk/src/library/stats/R/models.R#L578-L603
>> >
>> > The following code shows the problem: a plain-vanilla model.matrix()
>> > call with no contrasts argument, followed by two wrong contrasts
>> > arguments, followed by a correct contrasts argument.
>> >
>> > data(cbpp, package="lme4")
>> > mf1 <- model.matrix(~period, data=cbpp)
>> > mf2 <- model.matrix(~period, contrasts.arg="contr.sum", data=cbpp)
>> > all.equal(mf1,mf2) ## TRUE
>> > mf3 <- model.matrix(~period, contrasts.arg=contr.sum, data=cbpp)
>> > all.equal(mf1,mf3)  ## TRUE
>> > mf4 <- model.matrix(~period, contrasts.arg=list(period=contr.sum),
>> > data=cbpp)
>> > isTRUE(all.equal(mf1,mf4))  ## FALSE
>> >
>> >
>> >  I've attached a potential patch for this, which is IMO the mildest
>> > possible case (if contrasts.arg is non-NULL and not a list, it produces
>> > a warning).  I haven't been able to test it because of some mysterious
>> > issues I'm having with re-making R properly ...
>> >
>> >  Thoughts?  Should I submit this as a bug report/patch?
>> >
>> >  cheers
>> >   Ben Bolker
>>
>> > __

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] model.matrix.default() silently ignores bad contrasts.arg

2019-02-22 Thread Fox, John
Dear Martin and Ben,

I agree that a warning is a good idea (and perhaps that wasn't clear in my 
response to Ben's post). 

Also, it would be nice to correct the omission in the help file, which as far 
as I could see doesn't mention that a contrast-generating function (as opposed 
to its quoted name) can be an element of the contrasts.arg list.

Best,
 John

> -Original Message-
> From: Martin Maechler [mailto:maech...@stat.math.ethz.ch]
> Sent: Friday, February 22, 2019 11:50 AM
> To: Ben Bolker 
> Cc: Fox, John ; r-devel@r-project.org
> Subject: Re: [Rd] model.matrix.default() silently ignores bad contrasts.arg
> 
> > Ben Bolker
> > on Thu, 21 Feb 2019 08:18:51 -0500 writes:
> 
> > On Thu, Feb 21, 2019 at 7:49 AM Fox, John  wrote:
> >>
> >> Dear Ben,
> >>
> >> Perhaps I'm missing the point, but contrasts.arg is documented to be a
> list. From ?model.matrix: "contrasts.arg: A list, whose entries are values
> (numeric matrices or character strings naming functions) to be used as
> replacement values for the contrasts replacement function and whose
> names are the names of columns of data containing factors."
> 
> > I absolutely agree that this is not a bug/behaves as documented (I
> > could have said that more clearly).  It's just that (for reasons I
> > attempted to explain) this is a really easy mistake to make.
> 
> >> This isn't entirely accurate because a function also works as a named
> element of the list (in addition to a character string naming a function and a
> contrast matrix), as your example demonstrates, but nowhere that I'm
> aware of is it suggested that a non-list should work.
> >>
> >> It certainly would be an improvement if specifying contrast.arg as a 
> non-
> list generated an error or warning message, and it at least arguably would be
> convenient to allow a general contrast specification such as contrasts.arg-
> "contr.sum", but I don't see a bug here.
> 
> > I agree.  That's what my patch does (throws a warning message if
> > contrasts.arg is non-NULL and not a list).
> 
> I currently do think this is a good idea... "even though" I'm 99% sure that 
> this
> will make work for package maintainers and others whose code may
> suddenly show warnings.
> I hope they would know better than suppressWarnings(.) ...
> 
> I see a version of the patch using old style indentation which makes the diff
> even "considerably" smaller -- no need to submit this different, though --
> and I plan to test that a bit, and commit eventually to R-devel, possibly in 
> a 5
> days or so.
> 
> Thank you Ben for the suggestion and patch !
> Martin
> 
> > cheers
> > Ben Bolker
> 
> >> Best,
> >> John
> >>
> >> -
> >> John Fox, Professor Emeritus
> >> McMaster University
> >> Hamilton, Ontario, Canada
> >> Web: http::/socserv.mcmaster.ca/jfox
> >>
> >> > On Feb 20, 2019, at 7:14 PM, Ben Bolker  wrote:
> >> >
> >> > An lme4 user pointed out
>  that
> >> > passing contrasts as a string or symbol to [g]lmer (which would work 
> if
> >> > we were using `contrasts<-` to set contrasts on a factor variable) is
> >> > *silently ignored*. This goes back to model.matrix(), and seems bad
> >> > (this is a very easy mistake to make, because of the multitude of 
> ways
> >> > to specify contrasts for factors in R  - e.g. options(contrasts=...);
> >> > setting contrasts on the specific factors; passing contrasts as a 
> list
> >> > to the model function ... )
> >> >
> >> > The relevant code is here:
> >> >
> >> > https://github.com/wch/r-
> source/blob/trunk/src/library/stats/R/models.R#L578-L603
> >> >
> >> > The following code shows the problem: a plain-vanilla model.matrix()
> >> > call with no contrasts argument, followed by two wrong contrasts
> >> > arguments, followed by a correct contrasts argument.
> >> >
> >> > data(cbpp, package="lme4")
> >> > mf1 <- model.matrix(~period, data=cbpp)
> >> > mf2 <- model.matrix(~period, contrasts.arg="contr.sum", data=cbpp)
> >> > all.equal(mf1,mf2) ## TRUE
> >> > mf3 <- model.matrix(~period, contrasts.arg=contr.sum, data=cbpp)
> >> > all.equal(mf1,mf3)  ## TRUE
> >> > mf4 <- model.matrix(~period, contrasts.arg=list(period=contr.sum),
> >> > data=cbpp)
> >> > isTRUE(all.equal(mf1,mf4))  ## FALSE
> >> >
> >> >
> >> >  I've attached a potential patch for this, which is IMO the mildest
> >> > possible case (if contrasts.arg is non-NULL and not a list, it 
> produces
> >> > a warning).  I haven't been able to test it because of some 
> mysterious
> >> > issues I'm having with re-making R properly ...
> >> >
> >> >  Thoughts?  Should I submit this as a bug report/patch?
> >> >
> >> >  cheers
> >> >   Ben Bolker
> >>
> >> >

Re: [Rd] Bug: time complexity of substring is quadratic as string size and number of substrings increases

2019-02-22 Thread Tomas Kalibera

On 2/20/19 7:55 PM, Toby Hocking wrote:

Update: I have observed that stringi::stri_sub is linear time complexity,
and it computes the same thing as base::substring. figure
https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.png
source:
https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.R

To me this is a clear indication of a bug in substring, but again it would
be nice to have some feedback/confirmation before posting on bugzilla.

Also this suggests a fix -- just need to copy whatever stringi::stri_sub is
doing.


Thanks for the report, I am working on a patch that will address this.

I confirm there is a lot of potential for speedup. On my system,

'N=20; x <- substring(paste(rep("A", N), collapse=""), 1:N, 1:N)'

spends 96% time in checking if the string is ascii and 3% in strlen(); 
if we take advantage of the pre-computed value in the ASCII bit, the 
speed up is about 40x. Of course, with micro-benchmarks, any performance 
limitation can be arbitrarily inflated, users cannot expect to see these 
or any close speedups in applications as a result of the patch. The 
patch is going to do other easy optimizations that will not complicate 
the code, including avoiding the strlen() call (taking advantage of 
pre-computed length of R character object).


Best
Tomas





On Wed, Feb 20, 2019 at 11:16 AM Toby Hocking  wrote:


Hi all, (and especially hi to Tomas Kalibera who accepted my patch sent
yesterday)

I believe that I have found another bug, this time in the substring
function. The use case that I am concerned with is when there is a single
(character scalar) text/subject, and many substrings to extract. For example

substring("", 1:4, 1:4)

or more generally,

N=1000
substring(paste(rep("A", N), collapse=""), 1:N, 1:N)

The problem I observe is that the time complexity is quadratic in N, as
shown on this figure
https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.png
source:
https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.R

I expected the time complexity to be linear in N.

The example above may seem contrived/trivial, but it is indeed relevant to
a number of packages (rex, rematch2, namedCapture) which provide functions
that use gregexpr and then substring to extract the text in the captured
sub-patterns. The figure
https://github.com/tdhock/namedCapture-article/blob/master/figure-trackDb-pkgs.png
shows the issue: these packages have quadratic time complexity, whereas
other packages (and the gregexpr function with perl=TRUE after applying the
patch discussed yesterday) have linear time complexity. I believe the
problem is the substring function. Source for this figure:
https://github.com/tdhock/namedCapture-article/blob/master/figure-trackDb-pkgs.R

I suspect that a fix can be accomplished by optimizing the implementation
of substring, for the special case when the text/subject is a single
element (character scalar). Right now I notice that the substring R code
uses rep_len so that the text/subject which is passed to the C code is a
character vector with the same length as the number of substrings to
extract. Maybe the C code is calling strlen for each of these (identical)
text/subject elements?

Anyway, it would be useful to have some feedback to make sure this is
indeed a bug before I post on bugzilla. (btw thanks Martin for signing me
up for an account)

Toby


[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel