[Rd] Documentation for sd (stats) + suggestion

2019-02-19 Thread PatrickT
I cannot file suggestions on bugzilla, so writing here.

As far as I can tell, the manual help page for ``sd``

?sd

does not explicitly mention that the formula for the standard deviation is
the so-called "Bessel-corrected" formula (divide by n-1 rather than n).

I suggest it should be stated near the top.

I would also suggest (feature request!) that either

 - a population standard deviation formula, e.g. ``sdp`` or ``sd.p`` be
made available (that would be my preference)

or

 - the current ``sd`` be extended to accept a ``population=FALSE`` or
``sample=TRUE`` argument.

Same for the variance. Excel, Calc, etc. offer these.

Motivation: I encourage my students to use R rather than Python (which has
picked up big time in recent years) on the grounds that it is easier to get
started with and is specialized in statistics. But then there is no
"population" formula for the standard deviation. And ``mode`` is not the
mode they expect... (btw I suggest adding a ``modes`` function to the
core)  All things a beginner will look for in a stats software.

Thanks for listening. And thanks for the great work!

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Documentation for sd (stats) + suggestion

2019-02-19 Thread S Ellison
> As far as I can tell, the manual help page for ``sd``
> 
> ?sd
> 
> does not explicitly mention that the formula for the standard deviation is
> the so-called "Bessel-corrected" formula (divide by n-1 rather than n).

See Details, where it says 
"Details:

 Like 'var' this uses denominator n - 1.
"



***
This email and any attachments are confidential. Any use...{{dropped:8}}

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] mle (stat4) crashing due to singular Hessian in covariance matrix calculation

2019-02-19 Thread Francisco Matorras

Hi, R developers.
when running mle inside a loop I found a nasty behavior. From time to 
time, my model had a degenerate minimum and the loop just crashed. I 
tracked it down to "vcov <- if (length(coef)) solve(oout$hessian)" line, 
being the hessian singular.
Note that the minimum reached was good, it just did not make sense to 
calculate the covariance matrix as the inverse of a singular Hessian. In 
my case i am just interested on the value of the log-likelihood. For my 
application, I patched it easily in a local version of mle just removing 
this call since I am not using vcov at all, but i wonder if it can be 
improved in the official release. I can imagine of two simple solutions, 
either including vcov calculation as an option or avoiding the call to 
solve if the hessian is singular (setting vcov to NA). I am willing to 
write a few lines of coded if you think it is worth.


regards

Francisco Matorras
Instituto de Física de Cantabria
Universidad de Cantabria

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] code for sum function

2019-02-19 Thread Paul Gilbert

(I didn't see anyone else answer this, so ...)

You can probably find the R code in src/main/ but I'm not sure. You are 
talking about a very simple calculation, so it seems unlike that the 
algorithm is the cause of the difference. I have done much more 
complicated things and usually get machine precision comparisons. There 
are four possibilities I can think of that could cause (small) differences.


0/ Your code is wrong, but that seems unlikely on such a simple 
calculations.


1/ You are summing a very large number of numbers, in which case the sum 
can become very large compared to numbers being added, then things can 
get a bit funny.


2/ You are using single precision in fortran rather than double. Double 
is needed for all floating point numbers you use!


3/ You have not zeroed the double precision numbers in fortran. (Some 
compilers do not do this automatically and you have to specify it.) Then 
if you accidentally put singles, like a constant 0.0 rather than a 
constant 0.0D+0, into a double you will have small junk in the lower 
precision part.


(I am assuming you are talking about a sum of reals, not integer or 
complex.)


HTH,
Paul Gilbert

On 2/14/19 2:08 PM, Rampal Etienne wrote:

Hello,

I am trying to write FORTRAN code to do the same as some R code I have. 
I get (small) differences when using the sum function in R. I know there 
are numerical routines to improve precision, but I have not been able to 
figure out what algorithm R is using. Does anyone know this? Or where 
can I find the code for the sum function?


Regards,

Rampal Etienne

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] code for sum function

2019-02-19 Thread William Dunlap via R-devel
The algorithm does make a differece.  You can use Kahan's summation
algorithm (https://en.wikipedia.org/wiki/Kahan_summation_algorithm) to
reduce the error compared to the naive summation algorithm.  E.g., in R
code:

naiveSum <-
function(x) {
   s <- 0.0
   for(xi in x) s <- s + xi
   s
}
kahanSum <- function (x)
{
   s <- 0.0
   c <- 0.0 # running compensation for lost low-order bits
   for(xi in x) {
  y <- xi - c
  t <- s + y # low-order bits of y may be lost here
  c <- (t - s) - y
  s <- t
   }
   s
}

> rSum <- vapply(c(1:20,10^(2:7)), function(n) sum(rep(1/7,n)), 0)
> rNaiveSum <- vapply(c(1:20,10^(2:7)), function(n) naiveSum(rep(1/7,n)), 0)
> rKahanSum <- vapply(c(1:20,10^(2:7)), function(n) kahanSum(rep(1/7,n)), 0)
>
> table(rSum == rNaiveSum)

FALSE  TRUE
   21 5
> table(rSum == rKahanSum)

FALSE  TRUE
323


Bill Dunlap
TIBCO Software
wdunlap tibco.com


On Tue, Feb 19, 2019 at 10:36 AM Paul Gilbert  wrote:

> (I didn't see anyone else answer this, so ...)
>
> You can probably find the R code in src/main/ but I'm not sure. You are
> talking about a very simple calculation, so it seems unlike that the
> algorithm is the cause of the difference. I have done much more
> complicated things and usually get machine precision comparisons. There
> are four possibilities I can think of that could cause (small) differences.
>
> 0/ Your code is wrong, but that seems unlikely on such a simple
> calculations.
>
> 1/ You are summing a very large number of numbers, in which case the sum
> can become very large compared to numbers being added, then things can
> get a bit funny.
>
> 2/ You are using single precision in fortran rather than double. Double
> is needed for all floating point numbers you use!
>
> 3/ You have not zeroed the double precision numbers in fortran. (Some
> compilers do not do this automatically and you have to specify it.) Then
> if you accidentally put singles, like a constant 0.0 rather than a
> constant 0.0D+0, into a double you will have small junk in the lower
> precision part.
>
> (I am assuming you are talking about a sum of reals, not integer or
> complex.)
>
> HTH,
> Paul Gilbert
>
> On 2/14/19 2:08 PM, Rampal Etienne wrote:
> > Hello,
> >
> > I am trying to write FORTRAN code to do the same as some R code I have.
> > I get (small) differences when using the sum function in R. I know there
> > are numerical routines to improve precision, but I have not been able to
> > figure out what algorithm R is using. Does anyone know this? Or where
> > can I find the code for the sum function?
> >
> > Regards,
> >
> > Rampal Etienne
> >
> > __
> > R-devel@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] bias issue in sample() (PR 17494)

2019-02-19 Thread Tierney, Luke


Before the next release we really should to sort out the bias issue in
sample() reported by Ottoboni and Stark in
https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf and
filed aa a bug report by Duncan Murdoch at
https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17494.

Here are two examples of bad behavior through current R-devel:

 set.seed(123)
 m <- (2/5) * 2^32
 x <- sample(m, 100, replace = TRUE)
 table(x %% 2, x > m / 2)
 ##
 ##FALSE   TRUE
 ## 0 300620 198792
 ## 1 200196 300392

 table(sample(2/7 * 2^32, 100, replace = TRUE) %% 2)
 ##
 ##  0  1
 ## 429054 570946

I committed a modification to R_unif_index to address this by
generating random bits (blocks of 16) and rejection sampling, but for
now this is only enabled if the environment variable R_NEW_SAMPLE is
set before the first call.

Some things still needed:

- someone to look over the change and see if there are any issues
- adjustment of RNGkind to allowing the old behavior to be selected
- make the new behavior the default
- adjust documentation
- ???

Unfortunately I don't have enough free cycles to do this, but I can
help if someone else can take the lead.

There are two other places I found that might suffer from the same
issue, in walker_ProbSampleReplace (pointed out bu O & S) and in
src/nmath/wilcox.c.  Both can be addressed by using R_unif_index. I
have done that for walker_ProbSampleReplace, but the wilcox change
might need adjusting to support the standalone math library and I
don't feel confident enough I'd get that right.

Best,

luke


-- 
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa  Phone: 319-335-3386
Department of Statistics andFax:   319-335-3017
Actuarial Science
241 Schaeffer Hall  email:   luke-tier...@uiowa.edu
Iowa City, IA 52242 WWW:  http://www.stat.uiowa.edu

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] patch for gregexpr(perl=TRUE)

2019-02-19 Thread Toby Hocking
Hi all,

Several people have noticed that gregexpr is very slow for large subject
strings when perl=TRUE is specified.
-
https://stackoverflow.com/questions/31216299/r-faster-gregexpr-for-very-large-strings
-
http://r.789695.n4.nabble.com/strsplit-perl-TRUE-gregexpr-perl-TRUE-very-slow-for-long-strings-td4727902.html
- https://stat.ethz.ch/pipermail/r-help/2008-October/178451.html

I figured out the issue, which is fixed by changing 1 line of code in
src/main/grep.c -- there is a strlen function call which is currently
inside of the while loop over matches, and the patch moves it before the
loop.
https://github.com/tdhock/namedCapture-article/blob/master/linear-time-gregexpr-perl.patch

I made some figures that show the quadratic time complexity before applying
the patch, and the linear time complexity after applying the patch
https://github.com/tdhock/namedCapture-article#19-feb-2019

I would have posted a bug report on bugs.r-project.org but I do not have an
account. So can an R-devel person please either (1) accept this patch, or
(2) give me an account so I can post the patch on the bug tracker?

Finally I would like to mention that Bill Dunlap noticed a similar problem
(time complexity which is quadratic in subject size) for strsplit with
perl=TRUE. My patch does NOT fix that, but I suspect that a similar fix
could be accomplished (because I see that strlen is being called in a while
loop in do_strsplit as well).

Thanks
Toby Dylan Hocking

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] code for sum function

2019-02-19 Thread Ben Bolker


This SO question may be of interest:

https://stackoverflow.com/questions/38589705/difference-between-rs-sum-and-armadillos-accu/

  which points out that sum() isn't doing anything fancy *except* using
extended-precision registers when available.  (Using Kahan's algorithm
does come at a computational cost ...)

On 2019-02-19 2:08 p.m., William Dunlap via R-devel wrote:
> The algorithm does make a differece.  You can use Kahan's summation
> algorithm (https://en.wikipedia.org/wiki/Kahan_summation_algorithm) to
> reduce the error compared to the naive summation algorithm.  E.g., in R
> code:
> 
> naiveSum <-
> function(x) {
>s <- 0.0
>for(xi in x) s <- s + xi
>s
> }
> kahanSum <- function (x)
> {
>s <- 0.0
>c <- 0.0 # running compensation for lost low-order bits
>for(xi in x) {
>   y <- xi - c
>   t <- s + y # low-order bits of y may be lost here
>   c <- (t - s) - y
>   s <- t
>}
>s
> }
> 
>> rSum <- vapply(c(1:20,10^(2:7)), function(n) sum(rep(1/7,n)), 0)
>> rNaiveSum <- vapply(c(1:20,10^(2:7)), function(n) naiveSum(rep(1/7,n)), 0)
>> rKahanSum <- vapply(c(1:20,10^(2:7)), function(n) kahanSum(rep(1/7,n)), 0)
>>
>> table(rSum == rNaiveSum)
> 
> FALSE  TRUE
>21 5
>> table(rSum == rKahanSum)
> 
> FALSE  TRUE
> 323
> 
> 
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com
> 
> 
> On Tue, Feb 19, 2019 at 10:36 AM Paul Gilbert  wrote:
> 
>> (I didn't see anyone else answer this, so ...)
>>
>> You can probably find the R code in src/main/ but I'm not sure. You are
>> talking about a very simple calculation, so it seems unlike that the
>> algorithm is the cause of the difference. I have done much more
>> complicated things and usually get machine precision comparisons. There
>> are four possibilities I can think of that could cause (small) differences.
>>
>> 0/ Your code is wrong, but that seems unlikely on such a simple
>> calculations.
>>
>> 1/ You are summing a very large number of numbers, in which case the sum
>> can become very large compared to numbers being added, then things can
>> get a bit funny.
>>
>> 2/ You are using single precision in fortran rather than double. Double
>> is needed for all floating point numbers you use!
>>
>> 3/ You have not zeroed the double precision numbers in fortran. (Some
>> compilers do not do this automatically and you have to specify it.) Then
>> if you accidentally put singles, like a constant 0.0 rather than a
>> constant 0.0D+0, into a double you will have small junk in the lower
>> precision part.
>>
>> (I am assuming you are talking about a sum of reals, not integer or
>> complex.)
>>
>> HTH,
>> Paul Gilbert
>>
>> On 2/14/19 2:08 PM, Rampal Etienne wrote:
>>> Hello,
>>>
>>> I am trying to write FORTRAN code to do the same as some R code I have.
>>> I get (small) differences when using the sum function in R. I know there
>>> are numerical routines to improve precision, but I have not been able to
>>> figure out what algorithm R is using. Does anyone know this? Or where
>>> can I find the code for the sum function?
>>>
>>> Regards,
>>>
>>> Rampal Etienne
>>>
>>> __
>>> R-devel@r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>> __
>> R-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
> 
>   [[alternative HTML version deleted]]
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] mle (stat4) crashing due to singular Hessian in covariance matrix calculation

2019-02-19 Thread Ben Bolker


  I don't know if this will get much response from the R developers;
they might just recommend that you protect your mle() call in a try() or
tryCatch() to stop it from breaking your loop.  Alternatively, you could
try mle2() function in the bbmle package, which started out long ago as
a slightly more flexible and robust version of stats4::mle(); I don't
remember/can't promise that it handles fits with singular Hessians, but
I'm guessing it does ...

  cheers
   Ben Bolker


On 2019-02-19 12:02 p.m., Francisco Matorras wrote:
> Hi, R developers.
> when running mle inside a loop I found a nasty behavior. From time to
> time, my model had a degenerate minimum and the loop just crashed. I
> tracked it down to "vcov <- if (length(coef)) solve(oout$hessian)" line,
> being the hessian singular.
> Note that the minimum reached was good, it just did not make sense to
> calculate the covariance matrix as the inverse of a singular Hessian. In
> my case i am just interested on the value of the log-likelihood. For my
> application, I patched it easily in a local version of mle just removing
> this call since I am not using vcov at all, but i wonder if it can be
> improved in the official release. I can imagine of two simple solutions,
> either including vcov calculation as an option or avoiding the call to
> solve if the hessian is singular (setting vcov to NA). I am willing to
> write a few lines of coded if you think it is worth.
> 
> regards
> 
> Francisco Matorras
> Instituto de Física de Cantabria
> Universidad de Cantabria
> 
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Documentation for sd (stats) + suggestion

2019-02-19 Thread Dario Strbenac
Good day,

It is implemented by the CRAN package multicon. The function is named popsd. 
But it does seem like something R should provide without creating a package 
dependency.

--
Dario Strbenac
University of Sydney
Camperdown NSW 2050
Australia
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] bias issue in sample() (PR 17494)

2019-02-19 Thread Gabriel Becker
Luke,

I'm happy to help with this. Its great to see this get tackled (I've cc'ed
Kelli Ottoboni who helped flag this issue).

I can prepare a patch for the RNGkind related stuff and the doc update.

As for ???, what are your (and others') thoughts about the possibility of
a) a reproducibility API which takes either an R version (or maybe
alternatively a date) and sets the RNGkind to the default for that
version/date, and/or b) that sessionInfo be modified to capture (and
display) the RNGkind in effect.

Best,
~G


On Tue, Feb 19, 2019 at 11:52 AM Tierney, Luke 
wrote:

>
> Before the next release we really should to sort out the bias issue in
> sample() reported by Ottoboni and Stark in
> https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf and
> filed aa a bug report by Duncan Murdoch at
> https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17494.
>
> Here are two examples of bad behavior through current R-devel:
>
>  set.seed(123)
>  m <- (2/5) * 2^32
>  x <- sample(m, 100, replace = TRUE)
>  table(x %% 2, x > m / 2)
>  ##
>  ##FALSE   TRUE
>  ## 0 300620 198792
>  ## 1 200196 300392
>
>  table(sample(2/7 * 2^32, 100, replace = TRUE) %% 2)
>  ##
>  ##  0  1
>  ## 429054 570946
>
> I committed a modification to R_unif_index to address this by
> generating random bits (blocks of 16) and rejection sampling, but for
> now this is only enabled if the environment variable R_NEW_SAMPLE is
> set before the first call.
>
> Some things still needed:
>
> - someone to look over the change and see if there are any issues
> - adjustment of RNGkind to allowing the old behavior to be selected
> - make the new behavior the default
> - adjust documentation
> - ???
>
> Unfortunately I don't have enough free cycles to do this, but I can
> help if someone else can take the lead.
>
> There are two other places I found that might suffer from the same
> issue, in walker_ProbSampleReplace (pointed out bu O & S) and in
> src/nmath/wilcox.c.  Both can be addressed by using R_unif_index. I
> have done that for walker_ProbSampleReplace, but the wilcox change
> might need adjusting to support the standalone math library and I
> don't feel confident enough I'd get that right.
>
> Best,
>
> luke
>
>
> --
> Luke Tierney
> Ralph E. Wareham Professor of Mathematical Sciences
> University of Iowa  Phone: 319-335-3386
> Department of Statistics andFax:   319-335-3017
> Actuarial Science
> 241 Schaeffer Hall  email:   luke-tier...@uiowa.edu
> Iowa City, IA 52242 WWW:  http://www.stat.uiowa.edu
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel