[Rd] Documentation for sd (stats) + suggestion
I cannot file suggestions on bugzilla, so writing here. As far as I can tell, the manual help page for ``sd`` ?sd does not explicitly mention that the formula for the standard deviation is the so-called "Bessel-corrected" formula (divide by n-1 rather than n). I suggest it should be stated near the top. I would also suggest (feature request!) that either - a population standard deviation formula, e.g. ``sdp`` or ``sd.p`` be made available (that would be my preference) or - the current ``sd`` be extended to accept a ``population=FALSE`` or ``sample=TRUE`` argument. Same for the variance. Excel, Calc, etc. offer these. Motivation: I encourage my students to use R rather than Python (which has picked up big time in recent years) on the grounds that it is easier to get started with and is specialized in statistics. But then there is no "population" formula for the standard deviation. And ``mode`` is not the mode they expect... (btw I suggest adding a ``modes`` function to the core) All things a beginner will look for in a stats software. Thanks for listening. And thanks for the great work! [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Documentation for sd (stats) + suggestion
> As far as I can tell, the manual help page for ``sd`` > > ?sd > > does not explicitly mention that the formula for the standard deviation is > the so-called "Bessel-corrected" formula (divide by n-1 rather than n). See Details, where it says "Details: Like 'var' this uses denominator n - 1. " *** This email and any attachments are confidential. Any use...{{dropped:8}} __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] mle (stat4) crashing due to singular Hessian in covariance matrix calculation
Hi, R developers. when running mle inside a loop I found a nasty behavior. From time to time, my model had a degenerate minimum and the loop just crashed. I tracked it down to "vcov <- if (length(coef)) solve(oout$hessian)" line, being the hessian singular. Note that the minimum reached was good, it just did not make sense to calculate the covariance matrix as the inverse of a singular Hessian. In my case i am just interested on the value of the log-likelihood. For my application, I patched it easily in a local version of mle just removing this call since I am not using vcov at all, but i wonder if it can be improved in the official release. I can imagine of two simple solutions, either including vcov calculation as an option or avoiding the call to solve if the hessian is singular (setting vcov to NA). I am willing to write a few lines of coded if you think it is worth. regards Francisco Matorras Instituto de Física de Cantabria Universidad de Cantabria __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] code for sum function
(I didn't see anyone else answer this, so ...) You can probably find the R code in src/main/ but I'm not sure. You are talking about a very simple calculation, so it seems unlike that the algorithm is the cause of the difference. I have done much more complicated things and usually get machine precision comparisons. There are four possibilities I can think of that could cause (small) differences. 0/ Your code is wrong, but that seems unlikely on such a simple calculations. 1/ You are summing a very large number of numbers, in which case the sum can become very large compared to numbers being added, then things can get a bit funny. 2/ You are using single precision in fortran rather than double. Double is needed for all floating point numbers you use! 3/ You have not zeroed the double precision numbers in fortran. (Some compilers do not do this automatically and you have to specify it.) Then if you accidentally put singles, like a constant 0.0 rather than a constant 0.0D+0, into a double you will have small junk in the lower precision part. (I am assuming you are talking about a sum of reals, not integer or complex.) HTH, Paul Gilbert On 2/14/19 2:08 PM, Rampal Etienne wrote: Hello, I am trying to write FORTRAN code to do the same as some R code I have. I get (small) differences when using the sum function in R. I know there are numerical routines to improve precision, but I have not been able to figure out what algorithm R is using. Does anyone know this? Or where can I find the code for the sum function? Regards, Rampal Etienne __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] code for sum function
The algorithm does make a differece. You can use Kahan's summation algorithm (https://en.wikipedia.org/wiki/Kahan_summation_algorithm) to reduce the error compared to the naive summation algorithm. E.g., in R code: naiveSum <- function(x) { s <- 0.0 for(xi in x) s <- s + xi s } kahanSum <- function (x) { s <- 0.0 c <- 0.0 # running compensation for lost low-order bits for(xi in x) { y <- xi - c t <- s + y # low-order bits of y may be lost here c <- (t - s) - y s <- t } s } > rSum <- vapply(c(1:20,10^(2:7)), function(n) sum(rep(1/7,n)), 0) > rNaiveSum <- vapply(c(1:20,10^(2:7)), function(n) naiveSum(rep(1/7,n)), 0) > rKahanSum <- vapply(c(1:20,10^(2:7)), function(n) kahanSum(rep(1/7,n)), 0) > > table(rSum == rNaiveSum) FALSE TRUE 21 5 > table(rSum == rKahanSum) FALSE TRUE 323 Bill Dunlap TIBCO Software wdunlap tibco.com On Tue, Feb 19, 2019 at 10:36 AM Paul Gilbert wrote: > (I didn't see anyone else answer this, so ...) > > You can probably find the R code in src/main/ but I'm not sure. You are > talking about a very simple calculation, so it seems unlike that the > algorithm is the cause of the difference. I have done much more > complicated things and usually get machine precision comparisons. There > are four possibilities I can think of that could cause (small) differences. > > 0/ Your code is wrong, but that seems unlikely on such a simple > calculations. > > 1/ You are summing a very large number of numbers, in which case the sum > can become very large compared to numbers being added, then things can > get a bit funny. > > 2/ You are using single precision in fortran rather than double. Double > is needed for all floating point numbers you use! > > 3/ You have not zeroed the double precision numbers in fortran. (Some > compilers do not do this automatically and you have to specify it.) Then > if you accidentally put singles, like a constant 0.0 rather than a > constant 0.0D+0, into a double you will have small junk in the lower > precision part. > > (I am assuming you are talking about a sum of reals, not integer or > complex.) > > HTH, > Paul Gilbert > > On 2/14/19 2:08 PM, Rampal Etienne wrote: > > Hello, > > > > I am trying to write FORTRAN code to do the same as some R code I have. > > I get (small) differences when using the sum function in R. I know there > > are numerical routines to improve precision, but I have not been able to > > figure out what algorithm R is using. Does anyone know this? Or where > > can I find the code for the sum function? > > > > Regards, > > > > Rampal Etienne > > > > __ > > R-devel@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] bias issue in sample() (PR 17494)
Before the next release we really should to sort out the bias issue in sample() reported by Ottoboni and Stark in https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf and filed aa a bug report by Duncan Murdoch at https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17494. Here are two examples of bad behavior through current R-devel: set.seed(123) m <- (2/5) * 2^32 x <- sample(m, 100, replace = TRUE) table(x %% 2, x > m / 2) ## ##FALSE TRUE ## 0 300620 198792 ## 1 200196 300392 table(sample(2/7 * 2^32, 100, replace = TRUE) %% 2) ## ## 0 1 ## 429054 570946 I committed a modification to R_unif_index to address this by generating random bits (blocks of 16) and rejection sampling, but for now this is only enabled if the environment variable R_NEW_SAMPLE is set before the first call. Some things still needed: - someone to look over the change and see if there are any issues - adjustment of RNGkind to allowing the old behavior to be selected - make the new behavior the default - adjust documentation - ??? Unfortunately I don't have enough free cycles to do this, but I can help if someone else can take the lead. There are two other places I found that might suffer from the same issue, in walker_ProbSampleReplace (pointed out bu O & S) and in src/nmath/wilcox.c. Both can be addressed by using R_unif_index. I have done that for walker_ProbSampleReplace, but the wilcox change might need adjusting to support the standalone math library and I don't feel confident enough I'd get that right. Best, luke -- Luke Tierney Ralph E. Wareham Professor of Mathematical Sciences University of Iowa Phone: 319-335-3386 Department of Statistics andFax: 319-335-3017 Actuarial Science 241 Schaeffer Hall email: luke-tier...@uiowa.edu Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] patch for gregexpr(perl=TRUE)
Hi all, Several people have noticed that gregexpr is very slow for large subject strings when perl=TRUE is specified. - https://stackoverflow.com/questions/31216299/r-faster-gregexpr-for-very-large-strings - http://r.789695.n4.nabble.com/strsplit-perl-TRUE-gregexpr-perl-TRUE-very-slow-for-long-strings-td4727902.html - https://stat.ethz.ch/pipermail/r-help/2008-October/178451.html I figured out the issue, which is fixed by changing 1 line of code in src/main/grep.c -- there is a strlen function call which is currently inside of the while loop over matches, and the patch moves it before the loop. https://github.com/tdhock/namedCapture-article/blob/master/linear-time-gregexpr-perl.patch I made some figures that show the quadratic time complexity before applying the patch, and the linear time complexity after applying the patch https://github.com/tdhock/namedCapture-article#19-feb-2019 I would have posted a bug report on bugs.r-project.org but I do not have an account. So can an R-devel person please either (1) accept this patch, or (2) give me an account so I can post the patch on the bug tracker? Finally I would like to mention that Bill Dunlap noticed a similar problem (time complexity which is quadratic in subject size) for strsplit with perl=TRUE. My patch does NOT fix that, but I suspect that a similar fix could be accomplished (because I see that strlen is being called in a while loop in do_strsplit as well). Thanks Toby Dylan Hocking [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] code for sum function
This SO question may be of interest: https://stackoverflow.com/questions/38589705/difference-between-rs-sum-and-armadillos-accu/ which points out that sum() isn't doing anything fancy *except* using extended-precision registers when available. (Using Kahan's algorithm does come at a computational cost ...) On 2019-02-19 2:08 p.m., William Dunlap via R-devel wrote: > The algorithm does make a differece. You can use Kahan's summation > algorithm (https://en.wikipedia.org/wiki/Kahan_summation_algorithm) to > reduce the error compared to the naive summation algorithm. E.g., in R > code: > > naiveSum <- > function(x) { >s <- 0.0 >for(xi in x) s <- s + xi >s > } > kahanSum <- function (x) > { >s <- 0.0 >c <- 0.0 # running compensation for lost low-order bits >for(xi in x) { > y <- xi - c > t <- s + y # low-order bits of y may be lost here > c <- (t - s) - y > s <- t >} >s > } > >> rSum <- vapply(c(1:20,10^(2:7)), function(n) sum(rep(1/7,n)), 0) >> rNaiveSum <- vapply(c(1:20,10^(2:7)), function(n) naiveSum(rep(1/7,n)), 0) >> rKahanSum <- vapply(c(1:20,10^(2:7)), function(n) kahanSum(rep(1/7,n)), 0) >> >> table(rSum == rNaiveSum) > > FALSE TRUE >21 5 >> table(rSum == rKahanSum) > > FALSE TRUE > 323 > > > Bill Dunlap > TIBCO Software > wdunlap tibco.com > > > On Tue, Feb 19, 2019 at 10:36 AM Paul Gilbert wrote: > >> (I didn't see anyone else answer this, so ...) >> >> You can probably find the R code in src/main/ but I'm not sure. You are >> talking about a very simple calculation, so it seems unlike that the >> algorithm is the cause of the difference. I have done much more >> complicated things and usually get machine precision comparisons. There >> are four possibilities I can think of that could cause (small) differences. >> >> 0/ Your code is wrong, but that seems unlikely on such a simple >> calculations. >> >> 1/ You are summing a very large number of numbers, in which case the sum >> can become very large compared to numbers being added, then things can >> get a bit funny. >> >> 2/ You are using single precision in fortran rather than double. Double >> is needed for all floating point numbers you use! >> >> 3/ You have not zeroed the double precision numbers in fortran. (Some >> compilers do not do this automatically and you have to specify it.) Then >> if you accidentally put singles, like a constant 0.0 rather than a >> constant 0.0D+0, into a double you will have small junk in the lower >> precision part. >> >> (I am assuming you are talking about a sum of reals, not integer or >> complex.) >> >> HTH, >> Paul Gilbert >> >> On 2/14/19 2:08 PM, Rampal Etienne wrote: >>> Hello, >>> >>> I am trying to write FORTRAN code to do the same as some R code I have. >>> I get (small) differences when using the sum function in R. I know there >>> are numerical routines to improve precision, but I have not been able to >>> figure out what algorithm R is using. Does anyone know this? Or where >>> can I find the code for the sum function? >>> >>> Regards, >>> >>> Rampal Etienne >>> >>> __ >>> R-devel@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-devel >> >> __ >> R-devel@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel >> > > [[alternative HTML version deleted]] > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] mle (stat4) crashing due to singular Hessian in covariance matrix calculation
I don't know if this will get much response from the R developers; they might just recommend that you protect your mle() call in a try() or tryCatch() to stop it from breaking your loop. Alternatively, you could try mle2() function in the bbmle package, which started out long ago as a slightly more flexible and robust version of stats4::mle(); I don't remember/can't promise that it handles fits with singular Hessians, but I'm guessing it does ... cheers Ben Bolker On 2019-02-19 12:02 p.m., Francisco Matorras wrote: > Hi, R developers. > when running mle inside a loop I found a nasty behavior. From time to > time, my model had a degenerate minimum and the loop just crashed. I > tracked it down to "vcov <- if (length(coef)) solve(oout$hessian)" line, > being the hessian singular. > Note that the minimum reached was good, it just did not make sense to > calculate the covariance matrix as the inverse of a singular Hessian. In > my case i am just interested on the value of the log-likelihood. For my > application, I patched it easily in a local version of mle just removing > this call since I am not using vcov at all, but i wonder if it can be > improved in the official release. I can imagine of two simple solutions, > either including vcov calculation as an option or avoiding the call to > solve if the hessian is singular (setting vcov to NA). I am willing to > write a few lines of coded if you think it is worth. > > regards > > Francisco Matorras > Instituto de Física de Cantabria > Universidad de Cantabria > > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Documentation for sd (stats) + suggestion
Good day, It is implemented by the CRAN package multicon. The function is named popsd. But it does seem like something R should provide without creating a package dependency. -- Dario Strbenac University of Sydney Camperdown NSW 2050 Australia __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] bias issue in sample() (PR 17494)
Luke, I'm happy to help with this. Its great to see this get tackled (I've cc'ed Kelli Ottoboni who helped flag this issue). I can prepare a patch for the RNGkind related stuff and the doc update. As for ???, what are your (and others') thoughts about the possibility of a) a reproducibility API which takes either an R version (or maybe alternatively a date) and sets the RNGkind to the default for that version/date, and/or b) that sessionInfo be modified to capture (and display) the RNGkind in effect. Best, ~G On Tue, Feb 19, 2019 at 11:52 AM Tierney, Luke wrote: > > Before the next release we really should to sort out the bias issue in > sample() reported by Ottoboni and Stark in > https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf and > filed aa a bug report by Duncan Murdoch at > https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17494. > > Here are two examples of bad behavior through current R-devel: > > set.seed(123) > m <- (2/5) * 2^32 > x <- sample(m, 100, replace = TRUE) > table(x %% 2, x > m / 2) > ## > ##FALSE TRUE > ## 0 300620 198792 > ## 1 200196 300392 > > table(sample(2/7 * 2^32, 100, replace = TRUE) %% 2) > ## > ## 0 1 > ## 429054 570946 > > I committed a modification to R_unif_index to address this by > generating random bits (blocks of 16) and rejection sampling, but for > now this is only enabled if the environment variable R_NEW_SAMPLE is > set before the first call. > > Some things still needed: > > - someone to look over the change and see if there are any issues > - adjustment of RNGkind to allowing the old behavior to be selected > - make the new behavior the default > - adjust documentation > - ??? > > Unfortunately I don't have enough free cycles to do this, but I can > help if someone else can take the lead. > > There are two other places I found that might suffer from the same > issue, in walker_ProbSampleReplace (pointed out bu O & S) and in > src/nmath/wilcox.c. Both can be addressed by using R_unif_index. I > have done that for walker_ProbSampleReplace, but the wilcox change > might need adjusting to support the standalone math library and I > don't feel confident enough I'd get that right. > > Best, > > luke > > > -- > Luke Tierney > Ralph E. Wareham Professor of Mathematical Sciences > University of Iowa Phone: 319-335-3386 > Department of Statistics andFax: 319-335-3017 > Actuarial Science > 241 Schaeffer Hall email: luke-tier...@uiowa.edu > Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel