from:"John Mount"

Re: [Rd] Use of C++ in Packages

2019-04-24 Thread John Mount

I appreciate the writing on this.

However I definitely think there is a huge difference between "use with care" 
and "don't use".  They just are not the same statement. 

> On Mar 29, 2019, at 10:15 AM, Simon Urbanek  
> wrote:
> 
> Jim,
> 
> I think the main point of Tomas' post was to alert R users to the fact that 
> there are very serious issues that you have to understand when interfacing R 
> from C++. Using C++ code from R is fine, in many cases you only want to 
> access R data, use some library or compute in C++ and return results. Such 
> use-cases are completely fine in C++ as they don't need to trigger the issues 
> mentioned and it should be made clear that it was not what Tomas' blog was 
> about.
> 
> I agree with Tomas that it is safer to give an advice to not use C++ to call 
> R API since C++ may give a false impression that you don't need to know what 
> you're doing. Note that it is possible to avoid longjmps by using 
> R_ExecWithCleanup() which can catch any longjmps from the called function. So 
> if you know what you're doing you can make things work. I think the issue 
> here is not necessarily lack of tools, it is lack of knowledge - which is why 
> I think Tomas' post is so important.
> 
> Cheers,
> Simon
> 
> 
>> On Mar 29, 2019, at 11:19 AM, Jim Hester  wrote:
>> 
>> First, thank you to Tomas for writing his recent post[0] on the R
>> developer blog. It raised important issues in interfacing R's C API
>> and C++ code.
>> 
>> However I do _not_ think the conclusion reached in the post is helpful
>>> don’t use C++ to interface with R
>> 
>> There are now more than 1,600 packages on CRAN using C++, the time is
>> long past when that type of warning is going to be useful to the R
>> community.
>> 
>> These same issues will also occur with any newer language (such as
>> Rust or Julia[1]) which uses RAII to manage resources and tries to
>> interface with R. It doesn't seem a productive way forward for R to
>> say it can't interface with these languages without first doing
>> expensive copies into an intermediate heap.
>> 
>> The advice to avoid C++ is also antithetical to John Chambers vision
>> of first S and R as a interface language (from Extending R [2])
>> 
>>> The *interface* principle has always been central to R and to S
>> before. An interface to subroutines was _the_ way to extend the first
>> version of S. Subroutine interfaces have continued to be central to R.
>> 
>> The book also has extensive sections on both C++ (via Rcpp) and Julia,
>> so clearly John thinks these are legitimate ways to extend R.
>> 
>> So if 'don't use C++' is not realistic and the current R API does not
>> allow safe use of C++ exceptions what are the alternatives?
>> 
>> One thing we could do is look how this is handled in other languages
>> written in C which also use longjmp for errors.
>> 
>> Lua is one example, they provide an alternative interface;
>> lua_pcall[3] and lua_cpcall[4] which wrap a normal lua call and return
>> an error code rather long jumping. These interfaces can then be safely
>> wrapped by RAII - exception based languages.
>> 
>> This alternative error code interface is not just useful for C++, but
>> also for resource cleanup in C, it is currently non-trivial to handle
>> cleanup in all the possible cases a longjmp can occur (interrupts,
>> warnings, custom conditions, timeouts any allocation etc.) even with R
>> finalizers.
>> 
>> It is past time for R to consider a non-jumpy C interface, so it can
>> continue to be used as an effective interface to programming routines
>> in the years to come.
>> 
>> [0]: 
>> https://developer.r-project.org/Blog/public/2019/03/28/use-of-c---in-packages/
>> [1]: https://github.com/JuliaLang/julia/issues/28606
>> [2]: https://doi.org/10.1201/9781315381305
>> [3]: http://www.lua.org/manual/5.1/manual.html#lua_pcall
>> [4]: http://www.lua.org/manual/5.1/manual.html#lua_cpcall
>> 
>> __
>> R-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>> 
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

---
John Mount
http://www.win-vector.com/ <http://www.win-vector.com/> 
Our book: Practical Data Science with R http://www.manning.com/zumel/ 
<http://www.manning.com/zumel/>




[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] should base R have a piping operator ?

2019-10-05 Thread John Mount

Actually, base R already has a pipe fairly close to the one you describe: ->.;

iris ->.; head(.) ->.; dim(.)
# [1] 6 5

I've called it the Bizarro pipe ( 
http://www.win-vector.com/blog/2016/12/magrittrs-doppelganger/ 
<http://www.win-vector.com/blog/2016/12/magrittrs-doppelganger/> ), and for 
some reason we chickened out and didn't spend time on it in the dot pipe paper 
( https://journal.r-project.org/archive/2018/RJ-2018-042/index.html 
<https://journal.r-project.org/archive/2018/RJ-2018-042/index.html> ).

For documentation Bizarro pipe has the advantage that one can work out how it 
works from the application itself, with out reference to a defining function.

> On Oct 5, 2019, at 7:34 AM, Ant F  wrote:
> 
> Dear R-devel,
> 
> The most popular piping operator sits in the package `magrittr` and is used
> by a huge amount of users, and imported /reexported by more and more
> packages too.
> 
> Many workflows don't even make much sense without pipes nowadays, so the
> examples in the doc will use pipes, as do the README, vignettes etc. I
> believe base R could have a piping operator so packages can use a pipe in
> their code or doc and stay dependency free.
> 
> I don't suggest an operator based on complex heuristics, instead I suggest
> a very simple and fast one (>10 times than magrittr in my tests) :
> 
> `%.%` <- function (e1, e2) {
>  eval(substitute(e2), envir = list(. = e1), enclos = parent.frame())
> }
> 
> iris %.% head(.) %.% dim(.)
> #> [1] 6 5
> 
> The difference with magrittr is that the dots must all be explicit (which
> sits with the choice of the name), and that special magrittr features such
> as assignment in place and building functions with `. %>% head() %>% dim()`
> are not supported.
> 
> Edge cases are not surprising:
> 
> ```
> x <- "a"
> x %.% quote(.)
> #> .
> x %.% substitute(.)
> #> [1] "a"
> 
> f1 <- function(y) function() eval(quote(y))
> f2 <- x %.% f1(.)
> f2()
> #> [1] "a"
> ```
> 
> Looking forward for your thoughts on this,
> 
> Antoine
> 
>   [[alternative HTML version deleted]]
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

---
John Mount
http://www.win-vector.com/ <http://www.win-vector.com/> 
Our book: Practical Data Science with R 
https://www.manning.com/books/practical-data-science-with-r-second-edition 
<http://www.manning.com/zumel/>





[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] should base R have a piping operator ?

2019-10-05 Thread John Mount

Many of those issues can be dealt with by introducing curly braces:

compose <- function(f, g) { function(x) g(f(x)) }
plus1 <- function(x) x + 1
plus2 <- { plus1 ->.; compose(., plus1) }
plus2(5)
# [1] 7

And a lot of that is a point to note: we may not all agree on what cases are 
corner cases, and which ones should be handled in a given way.

> On Oct 5, 2019, at 8:48 AM, Ant F  wrote:
> 
> Hi John,
> 
> Thanks, but the Bizzaro pipe comes with many flaws though :
> * It's not a single operator
> * It has a different precedence
> * It cannot be used in a subcall
> * The variable assigned to must be on the right
> * It doesn't trigger indentation when going to the line
> * It creates/overwrite a `.` variable in the worksace. 
> 
> And it doesn't deal gracefully with some lazy evaluation edge cases such as : 
> 
> compose <- function(f, g) { function(x) g(f(x)) }
> plus1   <- function(x) x + 1
> 
> plus2 <- plus1 %.% compose(.,plus1)
> plus2(5)
> #> [1] 7
> 
> plus1 ->.; compose(.,plus1) -> .; . -> plus2
> plus2(5)
> #> Error: C stack usage  15923776 is too close to the limit
> 
> What I propose on the other hand can always substitute any existing proper 
> pipe in their standard feature, as long as the dot is made explicit.
> 
> Best regards,
> 
> Antoine
> 
> 
> 
> Le sam. 5 oct. 2019 à 16:59, John Mount  <mailto:jmo...@win-vector.com>> a écrit :
> Actually, base R already has a pipe fairly close to the one you describe: ->.;
> 
>   iris ->.; head(.) ->.; dim(.)
>   # [1] 6 5
> 
> I've called it the Bizarro pipe ( 
> http://www.win-vector.com/blog/2016/12/magrittrs-doppelganger/ 
> <http://www.win-vector.com/blog/2016/12/magrittrs-doppelganger/> ), and for 
> some reason we chickened out and didn't spend time on it in the dot pipe 
> paper ( https://journal.r-project.org/archive/2018/RJ-2018-042/index.html 
> <https://journal.r-project.org/archive/2018/RJ-2018-042/index.html> ).
> 
> For documentation Bizarro pipe has the advantage that one can work out how it 
> works from the application itself, with out reference to a defining function.
> 
>> On Oct 5, 2019, at 7:34 AM, Ant F > <mailto:antoine.fa...@gmail.com>> wrote:
>> 
>> Dear R-devel,
>> 
>> The most popular piping operator sits in the package `magrittr` and is used
>> by a huge amount of users, and imported /reexported by more and more
>> packages too.
>> 
>> Many workflows don't even make much sense without pipes nowadays, so the
>> examples in the doc will use pipes, as do the README, vignettes etc. I
>> believe base R could have a piping operator so packages can use a pipe in
>> their code or doc and stay dependency free.
>> 
>> I don't suggest an operator based on complex heuristics, instead I suggest
>> a very simple and fast one (>10 times than magrittr in my tests) :
>> 
>> `%.%` <- function (e1, e2) {
>>  eval(substitute(e2), envir = list(. = e1), enclos = parent.frame())
>> }
>> 
>> iris %.% head(.) %.% dim(.)
>> #> [1] 6 5
>> 
>> The difference with magrittr is that the dots must all be explicit (which
>> sits with the choice of the name), and that special magrittr features such
>> as assignment in place and building functions with `. %>% head() %>% dim()`
>> are not supported.
>> 
>> Edge cases are not surprising:
>> 
>> ```
>> x <- "a"
>> x %.% quote(.)
>> #> .
>> x %.% substitute(.)
>> #> [1] "a"
>> 
>> f1 <- function(y) function() eval(quote(y))
>> f2 <- x %.% f1(.)
>> f2()
>> #> [1] "a"
>> ```
>> 
>> Looking forward for your thoughts on this,
>> 
>> Antoine
>> 
>>  [[alternative HTML version deleted]]
>> 
>> __
>> R-devel@r-project.org <mailto:R-devel@r-project.org> mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel 
>> <https://stat.ethz.ch/mailman/listinfo/r-devel>
> 
> ---
> John Mount
> http://www.win-vector.com/ <http://www.win-vector.com/> 
> Our book: Practical Data Science with R 
> https://www.manning.com/books/practical-data-science-with-r-second-edition 
> <http://www.manning.com/zumel/>
> 
> 
> 
> 

---
John Mount
http://www.win-vector.com/ <http://www.win-vector.com/> 
Our book: Practical Data Science with R 
https://www.manning.com/books/practical-data-science-with-r-second-edition 
<http://www.manning.com/zumel/>





[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] should base R have a piping operator ?

2019-10-06 Thread John Mount

Except for the isolation of local() R pretty much already has the parsing 
transformation you mention.


as.list(parse(text="

iris ->.; 
  group_by(., Species) ->.; 
  summarize(., mean_sl = mean(Sepal.Length)) ->.;
  filter(., mean_sl > 5)

"))

#> [[1]]
#> . <- iris
#> 
#> [[2]]
#> . <- group_by(., Species)
#> 
#> [[3]]
#> . <- summarize(., mean_sl = mean(Sepal.Length))
#> 
#> [[4]]
#> filter(., mean_sl > 5)


Created on 2019-10-06 by the [reprex 
package](https://reprex.tidyverse.org) (v0.3.0)

> On Oct 5, 2019, at 4:50 PM, Gabriel Becker  wrote:
> 
> 
> iris %>% group_by(Species) %>% summarize(mean_sl = mean(Sepal.Length)) %>%
> filter(mean_sl > 5)
> 
> 
> were *parsed* as, for example, into
> 
> local({
>. = group_by(iris, Species)
> 
>._tmp2 = summarize(., mean_sl = mean(Sepal.Length))
> 
>filter(., mean_sl > 5)
>   })
> 
> 
> 
> 
> Then debuggiing (once you knew that) would be much easier but behavaior
> would be the same as it is now. There could even be some sort of
> step-through-pipe debugger at that point added as well for additional
> convenience.
> 
> There is some minor precedent for that type of transformative parsing:
> 
>> expr = parse(text = "5 -> x")
> 
>> expr
> 
> expression(5 -> x)
> 
>> expr[[1]]
> 
> x <- 5
> 
> 
> Though thats a much more minor transformation.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [R] choose(n, k) as n approaches k

2020-01-14 Thread John Mount



At the risk of throwing oil on a fire.  If we are talking about fractional 
values of choose() doesn't it make sense to look to the gamma function for the 
correct analytic continuation?  In particular k<0 may not imply the function 
should evaluate to zero until we get k<=-1.

Example:

``` r
choose(5, 4)
#> [1] 5

gchoose <- function(n, k) { 
  gamma(n+1)/(gamma(n+1-k) * gamma(k+1))
}

gchoose(5, 4)
#> [1] 5
gchoose(5, 0)
#> [1] 1
gchoose(5, -0.5)
#> [1] 0.2351727
```

> On Jan 14, 2020, at 10:20 AM, peter dalgaard  wrote:
> 
> OK, I see what you mean. But in those cases, we don't get the catastrophic 
> failures from the 
> 
>if (k <  0) return 0.;
>if (k == 0) return 1.;
>/* else: k >= 1 */
> 
> part, because at that point k is sure to be integer, possibly after rounding. 
> 
> It is when n-k is approximately but not exactly zero and we should return 1, 
> that we either return 0 (negative case) or n (positive case; because the 
> n(n-1)(n-2)... product has at least one factor). In the other cases, we get 1 
> or n(n-1)(n-2)...(n-k+1) which if n is near-integer gets rounded to produce 
> an integer, due to the
> 
>return R_IS_INT(n) ? R_forceint(r) : r;
> 
> part.
> 
> -pd
> 
> 
> 
>> On 14 Jan 2020, at 17:02 , Duncan Murdoch  wrote:
>> 
>> On 14/01/2020 10:50 a.m., peter dalgaard wrote:
>>>> On 14 Jan 2020, at 16:21 , Duncan Murdoch  wrote:
>>>> 
>>>> On 14/01/2020 10:07 a.m., peter dalgaard wrote:
>>>>> Yep, that looks wrong (probably want to continue discussion over on 
>>>>> R-devel)
>>>>> I think the culprit is here (in src/nmath/choose.c)
>>>>> if (k < k_small_max) {
>>>>>int j;
>>>>>if(n-k < k && n >= 0 && R_IS_INT(n)) k = n-k; /* <- Symmetry */
>>>>>if (k <  0) return 0.;
>>>>>if (k == 0) return 1.;
>>>>>/* else: k >= 1 */
>>>>> if n is a near-integer, then k can become non-integer and negative. In 
>>>>> your case,
>>>>> n == 4 - 1e-7
>>>>> k == 4
>>>>> n - k == -1e-7 < 4
>>>>> n >= 0
>>>>> R_IS_INT(n) = TRUE (relative diff < 1e-7 is allowed)
>>>>> so k gets set to
>>>>> n - k == -1e-7
>>>>> which is less than 0, so we return 0. However, as you point out, 1 would 
>>>>> be more reasonable and in accordance with the limit as n -> 4, e.g.
>>>>>> factorial(4 - 1e-10)/factorial(1e-10)/factorial(4) -1
>>>>> [1] -9.289025e-11
>>>>> I guess that the fix could be as simple as replacing n by R_forceint(n) 
>>>>> in the k = n - k step.
>>>> 
>>>> I think that would break symmetry:  you want choose(n, k) to equal 
>>>> choose(n, n-k) when n is very close to an integer.  So I'd suggest the 
>>>> replacement whenever R_IS_INT(n) is true.
>>>> 
>>> But choose() very deliberately ensures that k is integer, so choose(n, n-k) 
>>> is ill-defined for non-integer n.
>> 
>> That's only true if there's a big difference.  I'd be worried about cases 
>> where n and k are close to integers (within 1e-7).  In those cases, k is 
>> silently rounded to integer.  As I read your suggestion, n would only be 
>> rounded to integer if k > n-k.  I think both n and k should be rounded to 
>> integer in this near-integer situation, regardless of the value of k.
>> 
>> I believe that lchoose(n, k) already does this.
>> 
>> Duncan Murdoch
>> 
>>>double r, k0 = k;
>>>k = R_forceint(k);
>>> ...
>>>if (fabs(k - k0) > 1e-7)
>>>MATHLIB_WARNING2(_("'k' (%.2f) must be integer, rounded to %.0f"), 
>>> k0, k);
>>> 
>>>> Duncan Murdoch
>>>> 
>>>>> -pd
>>>>>> On 14 Jan 2020, at 00:33 , Wright, Erik Scott  wrote:
>>>>>> 
>>>>>> This struck me as incorrect:
>>>>>> 
>>>>>>> choose(3.99, 4)
>>>>>> [1] 0.979
>>>>>>> choose(3.999, 4)
>>>>>> [1] 0
>>>>>>> choose(4, 4)
>>>>>> [1] 1
>>>>>>> choose(4.001, 4)
>>>>>> [1] 4
>>>>>>> choose(4.01, 4)
>>>>>> [1] 1.000002
>>>>>> 
>>>>>> Should

Re: [Rd] A bug understanding F relative to FALSE?

2020-01-17 Thread John Mount

>From help(F): TRUE and FALSE are reserved 
><http://127.0.0.1:39090/help/library/base/help/reserved> words denoting 
>logical constants in the Rlanguage, whereas T and F are global variables whose 
>initial values set to these.

> On Jan 15, 2020, at 6:13 AM, IAGO GINÉ VÁZQUEZ  wrote:
> 
> Hi all,
> 
> Is the next behaviour suitable?
> 
> identical(F,FALSE)
> 
> ## [1] TRUE
> 
> utils::getParseData(parse(text = "c(F,FALSE)", keep.so=rce = TRUE))
> 
> ##line1 col1 line2 col2 id parenttoken terminal  text
> ## 14 11 1   10 14  0 exprFALSE
> ## 1  11 11  1  3 SYMBOL_FUNCTION_CALL TRUE c
> ## 3  11 11  3 14 exprFALSE
> ## 2  12 12  2 14  '(' TRUE (
> ## 4  13 13  4  6   SYMBOL TRUE F
> ## 6  13 13  6 14 exprFALSE
> ## 5  14 14  5 14  ',' TRUE ,
> ## 9  15 19  9 10NUM_CONST TRUE FALSE
> ## 10 15 19 10 14 exprFALSE
> ## 11 1   10 1   10 11 14  ')' TRUE )
> 
> I would expect that token for F is the same as token for FALSE.
> 
> 
> Thank you!
> 
> Iago
> 
> 
>   [[alternative HTML version deleted]]
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

---
John Mount
http://www.win-vector.com/ <http://www.win-vector.com/> 
Our book: Practical Data Science with R
http://practicaldatascience.com <http://practicaldatascience.com/> 






[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Request: tools::md5sum should accept connections and finally in-memory objects

2020-05-01 Thread John Mount

Perhaps use the digest package? Isn't "R the R packages?"

> On May 1, 2020, at 2:00 PM, Dénes Tóth  wrote:
> 
> 
> AFAIK there is no hashing utility in base R which can create hash digests of 
> arbitrary R objects. However, as also described by Henrik Bengtsson in [1], 
> we have tools::md5sum() which calculates MD5 hashes of files. Calculating 
> hashes of in-memory objects is a very common task in several areas, as 
> demonstrated by the popularity of the 'digest' package (~850.000 
> downloads/month).
> 
> Upon the inspection of the relevant files in the R-source (e.g., [2] and 
> [3]), it seems all building blocks have already been implemented so that 
> hashing should not be restricted to files. I would like to ask:
> 
> 1) Why is md5_buffer unused?:
> In src/library/tools/src/md5.c [see 2], md5_buffer is implemented which seems 
> to be the counterpart of md5_stream for non-file inputs:
> 
> ---
> #ifdef UNUSED
> /* Compute MD5 message digest for LEN bytes beginning at BUFFER.  The
>   result is always in little endian byte order, so that a byte-wise
>   output yields to the wanted ASCII representation of the message
>   digest.  */
> static void *
> md5_buffer (const char *buffer, size_t len, void *resblock)
> {
>  struct md5_ctx ctx;
> 
>  /* Initialize the computation context.  */
>  md5_init_ctx (&ctx);
> 
>  /* Process whole buffer but last len % 64 bytes.  */
>  md5_process_bytes (buffer, len, &ctx);
> 
>  /* Put result in desired memory area.  */
>  return md5_finish_ctx (&ctx, resblock);
> }
> #endif
> ---
> 
> 2) How can the R-community help so that this feature becomes available in 
> package 'tools'?
> 
> Suggestions:
> As a first step, it would be great if tools::md5sum would support connections 
> (credit goes to Henrik for the idea). E.g., instead of the signature 
> tools::md5sum(files), we could have tools::md5sum(files, conn = NULL), which 
> would allow:
> 
> x <- runif(10)
> tools::md5sum(conn = rawConnection(serialize(x, NULL)))
> 
> To avoid the inconsistency between 'files' (which computes the hash digests 
> in a vectorized manner, that is, one for each file) and 'conn' (which expects 
> a single connection), and to make it easier to extend the hashing for other 
> algorithms without changing the main R interface, a more involved solution 
> would be to introduce tools::hash and tools::hashes, in a similar vein to 
> digest::digest and digest::getVDigest.
> 
> Regards,
> Denes
> 
> 
> [1]: https://github.com/HenrikBengtsson/Wishlist-for-R/issues/21
> [2]: 
> https://github.com/wch/r-source/blob/5a156a0865362bb8381dcd69ac335f5174a4f60c/src/library/tools/src/md5.c#L172
> [3]: 
> https://github.com/wch/r-source/blob/5a156a0865362bb8381dcd69ac335f5174a4f60c/src/library/tools/src/Rmd5.c#L27
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

---
John Mount
http://www.win-vector.com/ <http://www.win-vector.com/> 
Our book: Practical Data Science with R
http://practicaldatascience.com <http://practicaldatascience.com/> 






[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Request: tools::md5sum should accept connections and finally in-memory objects

2020-05-01 Thread John Mount



> So yes, if one wants to use all the utilities or the various algos that the 
> digest package provides, one should install and load it. But if one can live 
> with MD5 hashes, why not use the built-in R function? (Well, without 
> serializing an object to a file, calling tools::md5sum, and then cleaning up 
> the file.)

Doesn't that assume that the serialization method is deterministic? Is that a 
documented property of the serialization tools?

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] summary.default rounding on numeric seems inconsistent with other R behaviors

2016-08-19 Thread John Mount

I was wondering if it would make sense to change the default behavior of the 
following:

summary(1L)
##Min. 1st Qu.  MedianMean 3rd Qu.Max. 
##   15560   15560   15560   15560   15560   15560 

summary.default on numeric values rounds values (not just presentation) to 
getOption("digits")-3L (or four) digits by default, making those values 
surprising and less suitable for further calculation.  Summary on matrix and 
data.frame do not do so.

It seems it would be nice to have x=1L; summary(x)[['Min.']] == min(x) 
evaluate to TRUE.  I know one can alter behavior by changing the global 
“digits” option, but I don’t know what other impacts that might have.  Ideally 
I would think summary.default would not round its values at all, but use digits 
to control presentation (by overriding print and such).  Even in presentation 
the rounding without switching to scientific notation (such as 1.556e+4) is a 
bit surprising (I understand rounding and scientific notation are two different 
presentation issues, but new users are very confused that something that 
appears to be an integer has been rounded).

Example:

summary(data.frame(x=1))
##x
##  Min.   :1  
##  1st Qu.:1  
##  Median :1  
##  Mean   :1  
##  3rd Qu.:1  
##  Max.   :1  
summary(1)
##Min. 1st Qu.  MedianMean 3rd Qu.Max. 
##   15560   15560   15560   15560   15560   15560 

I have a (bit whiny) polemic trying to explain the pain point here 
http://www.win-vector.com/blog/2016/08/my-criticism-of-r-numeric-summary/ 
<http://www.win-vector.com/blog/2016/08/my-criticism-of-r-numeric-summary/> (I 
am not trying to be rude, more I am trying to emphasize why this can be 
confusing to new users).



---
John Mount
http://www.win-vector.com/ <http://www.win-vector.com/> 
Our book: Practical Data Science with R http://www.manning.com/zumel/ 
<http://www.manning.com/zumel/>




[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] summary.default rounding on numeric seems inconsistent with other R behaviors

2016-08-24 Thread John Mount


> On Aug 24, 2016, at 2:36 AM, Martin Maechler  
> wrote:
> 
>>>>>> 
> 
> [Talking to myself .. ;-)]
> Yes, but that's the tough part to change.
> 
> This thread's topic is really only about changing summary.default(),
> and I have started testing such a change now, and that does seem
> very sensible:
> 
> - No rounding in summary.default(),  but
> - (almost) back-compatible rounding in its print() method.
> 
> My current plan is to commit this to R-devel in a day or so,
> unless unforeseen issues emerge.
> 
> Martin
> 


That is potentially a very good outcome.  Thank you so much for producing and 
testing a patch.

---
John Mount
http://www.win-vector.com/ <http://www.win-vector.com/> 
Our book: Practical Data Science with R http://www.manning.com/zumel/ 
<http://www.manning.com/zumel/>




[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] Red: RFC: (in-principle) native unquoting for standard evaluation

2017-03-18 Thread John Mount

Obviously we have to be sensitive about language and parser changes.  However, 
I think Jonathan Carroll has identified a feature that is well understood in 
the Lisp world (quoting and unquoting) that is missing from the R language.  
Some symbol that indirects to a value (only once!, we don’t want this chaining; 
and there are issues of quoting it out) in eval would be very valuable.  
Obviously it is most useful where non-standard evaluation is emphasized 
(plotting, formulas, and dplyr being the examples that I can immediately think 
of).

---
John Mount
http://www.win-vector.com/ <http://www.win-vector.com/> 
Our book: Practical Data Science with R 
https://www.manning.com/books/practical-data-science-with-r 
<https://www.manning.com/books/practical-data-science-with-r> 




[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] lm() takes weights from formula environment

2020-08-09 Thread John Mount

I know this programmers can reason this out from R's late parameter evaluation 
rules PLUS the explicit match.call()/eval() lm() does to work with the passed 
in formula and data frame. But, from a statistical user point of view this 
seems to be counter-productive. At best it works as if the user is passing in 
the name of the weights variable instead of values (I know this is the obvious 
consequence of NSE).

lm() takes instance weights from the formula environment. Usually that 
environment is the interactive environment or a close child of the interactive 
environment and we are lucky enough to have no intervening name collisions so 
we don't have a problem. However it makes programming over formulas for lm() a 
bit tricky. Here is an example of the issue.

Is there any recommended discussion on this and how to work around it? In my 
own work I explicitly set the formula environment and put the weights in that 
environment.


d <- data.frame(x = 1:3, y = c(3, 3, 4))
w <- c(1, 5, 1)

# works
lm(y ~ x, data = d, weights = w)  

# fails, as weights are taken from formul environment
fn <- function() {  # deliberately set up formula with bad value in environment
  w <- c(-1, -1, -1, -1)  # bad weights
  f <- as.formula(y ~ x)  # captures bad weights with as.formula(env = 
parent.frame()) default
  return(f)
}
lm(fn(), data = d, weights = w)
# Error in model.frame.default(formula = fn(), data = d, weights = w, 
drop.unused.levels = TRUE) : 
#   variable lengths differ (found for '(weights)')

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] lm() takes weights from formula environment

2020-08-09 Thread John Mount

Doesn't this preclude "y ~ ." style notations?

> On Aug 9, 2020, at 11:56 AM, Duncan Murdoch  wrote:
> 
> This is fairly clearly documented in ?lm:
> 
> "All of weights, subset and offset are evaluated in the same way as variables 
> in formula, that is first in data and then in the environment of formula."
> 
> There are lots of possible places to look for weights, but this seems to me 
> like a pretty sensible search order.  In most cases the environment of the 
> formula will have a parent environment chain that eventually leads to the 
> global environment, so (with no conflicts) your strategy of defining w there 
> will sometimes work, but looks pretty unreliable.
> 
> When you say you want to work around this search order, I think the obvious 
> way is to add your w vector to your d dataframe.  That way it is guaranteed 
> to be found even if there's a conflicting variable in the formula 
> environment, or the global environment.
> 
> Duncan Murdoch
> 
> On 09/08/2020 2:13 p.m., John Mount wrote:
>> I know this programmers can reason this out from R's late parameter 
>> evaluation rules PLUS the explicit match.call()/eval() lm() does to work 
>> with the passed in formula and data frame. But, from a statistical user 
>> point of view this seems to be counter-productive. At best it works as if 
>> the user is passing in the name of the weights variable instead of values (I 
>> know this is the obvious consequence of NSE).
>> lm() takes instance weights from the formula environment. Usually that 
>> environment is the interactive environment or a close child of the 
>> interactive environment and we are lucky enough to have no intervening name 
>> collisions so we don't have a problem. However it makes programming over 
>> formulas for lm() a bit tricky. Here is an example of the issue.
>> Is there any recommended discussion on this and how to work around it? In my 
>> own work I explicitly set the formula environment and put the weights in 
>> that environment.
>> d <- data.frame(x = 1:3, y = c(3, 3, 4))
>> w <- c(1, 5, 1)
>> # works
>> lm(y ~ x, data = d, weights = w)
>> # fails, as weights are taken from formul environment
>> fn <- function() {  # deliberately set up formula with bad value in 
>> environment
>>   w <- c(-1, -1, -1, -1)  # bad weights
>>   f <- as.formula(y ~ x)  # captures bad weights with as.formula(env = 
>> parent.frame()) default
>>   return(f)
>> }
>> lm(fn(), data = d, weights = w)
>> # Error in model.frame.default(formula = fn(), data = d, weights = w, 
>> drop.unused.levels = TRUE) :
>> #   variable lengths differ (found for '(weights)')
>> __
>> R-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
> 

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] lm() takes weights from formula environment

2020-08-10 Thread John Mount

I wish I had started with "I am disappointed that lm() doesn't continue its 
search for weights into the calling environment" or "the fact that lm() looks 
only in the formula environment and data frame for weights doesn't seem 
consistent with how other values are treated."

But I did not. So I do apologize for both that and for negative tone on my part.


Simplified example:

d <- data.frame(x = 1:3, y = c(1, 2, 1))
w <- c(1, 10, 1)
f <- as.formula(y ~ x)
lm(f, data = d, weights = w)  # works

# fails
environment(f) <- baseenv()
lm(f, data = d, weights = w)
# Error in eval(extras, data, env) : object 'w' not found


> On Aug 9, 2020, at 11:56 AM, Duncan Murdoch  wrote:
> 
> This is fairly clearly documented in ?lm:
> 

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] lm() takes weights from formula environment

2020-08-10 Thread John Mount

Thank you for your suggestion. I do know how to work around the issue.  I 
usually build a fresh environment as a child of base-environment and then 
insurt the weights there. I was just trying to provide an example of the issue.

emptyenv() can not be used, as it is needed for the eval (errors out even if 
weights are not used with "could not find function list").

For some applications one doesn't want the formula to have a non-trivial 
environment with respect to serialization.  Nina Zumel wrote about reference 
leaks in lm()/glm() and a good part of that was environments other than 
global/base (such as those formed when building a formula in a function) 
capturing references to unrelated structures.



> On Aug 10, 2020, at 11:34 AM, Duncan Murdoch  wrote:
> 
> On 10/08/2020 1:42 p.m., John Mount wrote:
>> I wish I had started with "I am disappointed that lm() doesn't continue its 
>> search for weights into the calling environment" or "the fact that lm() 
>> looks only in the formula environment and data frame for weights doesn't 
>> seem consistent with how other values are treated."
> 
> Normally searching is done automatically by following a chain of 
> environments.  It's easy to add something to the head of the chain (e.g. 
> data), it's hard to add something in the middle or at the end (because the 
> chain ends with emptyenv(), which is not allowed to have a parent).
> 
> So I'd suggest using
> 
> environment(f) <- environment()
> 
> before calling lm() if you want the calling environment to be in the search.  
> Setting it to baseenv() doesn't really make sense, unless you want to disable 
> all searches except in data, in which case emptyenv() would make more sense 
> (but I haven't tried it, so it might break something).
> 
> Duncan Murdoch
> 
>> But I did not. So I do apologize for both that and for negative tone on my 
>> part.
>> Simplified example:
>> d <- data.frame(x = 1:3, y = c(1, 2, 1))
>> w <- c(1, 10, 1)
>> f <- as.formula(y ~ x)
>> lm(f, data = d, weights = w)  # works
>> # fails
>> environment(f) <- baseenv()
>> lm(f, data = d, weights = w)
>> # Error in eval(extras, data, env) : object 'w' not found
>>> On Aug 9, 2020, at 11:56 AM, Duncan Murdoch  
>>> wrote:
>>> 
>>> This is fairly clearly documented in ?lm:
>>> 
> 

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] lm() takes weights from formula environment

2020-08-10 Thread John Mount

Forgot the url: 
https://win-vector.com/2014/05/30/trimming-the-fat-from-glm-models-in-r/

On Aug 10, 2020, at 11:50 AM, John Mount 
mailto:jmo...@win-vector.com>> wrote:

Thank you for your suggestion. I do know how to work around the issue.  I 
usually build a fresh environment as a child of base-environment and then 
insurt the weights there. I was just trying to provide an example of the issue.

emptyenv() can not be used, as it is needed for the eval (errors out even if 
weights are not used with "could not find function list").

For some applications one doesn't want the formula to have a non-trivial 
environment with respect to serialization.  Nina Zumel wrote about reference 
leaks in lm()/glm() and a good part of that was environments other than 
global/base (such as those formed when building a formula in a function) 
capturing references to unrelated structures.



On Aug 10, 2020, at 11:34 AM, Duncan Murdoch 
mailto:murdoch.dun...@gmail.com>> wrote:

On 10/08/2020 1:42 p.m., John Mount wrote:
I wish I had started with "I am disappointed that lm() doesn't continue its 
search for weights into the calling environment" or "the fact that lm() looks 
only in the formula environment and data frame for weights doesn't seem 
consistent with how other values are treated."

Normally searching is done automatically by following a chain of environments.  
It's easy to add something to the head of the chain (e.g. data), it's hard to 
add something in the middle or at the end (because the chain ends with 
emptyenv(), which is not allowed to have a parent).

So I'd suggest using

environment(f) <- environment()

before calling lm() if you want the calling environment to be in the search.  
Setting it to baseenv() doesn't really make sense, unless you want to disable 
all searches except in data, in which case emptyenv() would make more sense 
(but I haven't tried it, so it might break something).

Duncan Murdoch

But I did not. So I do apologize for both that and for negative tone on my part.
Simplified example:
d <- data.frame(x = 1:3, y = c(1, 2, 1))
w <- c(1, 10, 1)
f <- as.formula(y ~ x)
lm(f, data = d, weights = w)  # works
# fails
environment(f) <- baseenv()
lm(f, data = d, weights = w)
# Error in eval(extras, data, env) : object 'w' not found
On Aug 9, 2020, at 11:56 AM, Duncan Murdoch 
mailto:murdoch.dun...@gmail.com>> wrote:

This is fairly clearly documented in ?lm:





[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] I would suggest stats::glm() should set "converged" to FALSE in the return value in a few more situations.

2020-08-16 Thread John Mount

I would suggest stats::glm() should set "converged" to FALSE in the return 
value in a few more situations. I believe the current returned converged == 
TRUE can be needlessly misleading when the algorithm has clearly failed (and 
the algo even issued a warning, but the returned structure claims all is well).

In particular there are pathological inputs which cause the residual deviance 
to exceed the null deviance (even with intercept present, and no offset). I 
know we can't catch all cases, and for non-intercept ( ~ 0 +) situations this 
residual check may not apply.

Below is an input showing the effect on current R running on a 10.15.6 Mac (R 
from CRAN, no change to BLAS or such).



R.version.string
#> [1] "R version 4.0.2 (2020-06-22)"

R.version$platform
#> [1] "x86_64-apple-darwin17.0"

d <- data.frame(
  x1 = c(-20.3, -7.147, -7.101, -5.205, -5.166, -5.032, -2.787, -1.362, 1.637, 
15.16),
  y = c(0, 1, 0, 1, 1, 1, 1, 0, 1, 1))
w <- 10 * d$y + 1

m <- glm(
  y ~ x1,
  data = d,
  weights = w,
  family = binomial())
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# We do get a warning

m$converged
#> [1] TRUE

# notice residual deviance is greater than NULL deviance
m$null.deviance
#> [1] 80.16141
m$deviance
#> [1] 216.2619

# also preds are all 1.
predict(m, type='response')
#>  1  2  3  4  5  6  7  8  9 10
#>  1  1  1  1  1  1  1  1  1  1

# would suggest as a fitting step if m$null.deviance < m$deviance
# then set m$converged to FALSE (saving the user remembering such
# an inspection on their own).


[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] New pipe operator

2020-12-05 Thread John Mount

The :: is a case that we worked to get right with wrapr dot-pipe. I shared 
notes on this S3/S4 pipe in the R journal 
https://journal.r-project.org/archive/2018/RJ-2018-042/index.html

library(magrittr)
packageVersion("magrittr")
# [1] ‘2.0.1’
5 %>% base::sin
# Error in .::base : unused argument (sin)

library(wrapr)
5 %.>% base::sin
# [1] -0.9589243


On Dec 5, 2020, at 10:08 AM, Gabor Grothendieck 
mailto:ggrothendi...@gmail.com>> wrote:

The construct utils::head  is not that common but bare functions are
very common and to make it harder to use the common case so that
the uncommon case is slightly easier is not desirable.

Also it is trivial to write this which does work:

mtcars %>% (utils::head)

On Sat, Dec 5, 2020 at 11:59 AM Hugh Parsonage 
mailto:hugh.parson...@gmail.com>> wrote:

I'm surprised by the aversion to

mtcars |> nrow

over

mtcars |> nrow()

and I think the decision to disallow the former should be
reconsidered.  The pipe operator is only going to be used when the rhs
is a function, so there is no ambiguity with omitting the parentheses.
If it's disallowed, it becomes inconsistent with other treatments like
sapply(mtcars, typeof) where sapply(mtcars, typeof()) would just be
noise.  I'm not sure why this decision was taken

If the only issue is with the double (and triple) colon operator, then
ideally `mtcars |> base::head` should resolve to `base::head(mtcars)`
-- in other words, demote the precedence of |>

Obviously (looking at the R-Syntax branch) this decision was
considered, put into place, then dropped, but I can't see why
precisely.

Best,


Hugh.







On Sat, 5 Dec 2020 at 04:07, Deepayan Sarkar 
mailto:deepayan.sar...@gmail.com>> wrote:

On Fri, Dec 4, 2020 at 7:35 PM Duncan Murdoch 
mailto:murdoch.dun...@gmail.com>> wrote:

On 04/12/2020 8:13 a.m., Hiroaki Yutani wrote:
 Error: function '::' not supported in RHS call of a pipe

To me, this error looks much more friendly than magrittr's error.
Some of them got too used to specify functions without (). This
is OK until they use `::`, but when they need to use it, it takes
hours to figure out why

mtcars %>% base::head
#> Error in .::base : unused argument (head)

won't work but

mtcars %>% head

works. I think this is a too harsh lesson for ordinary R users to
learn `::` is a function. I've been wanting for magrittr to drop the
support for a function name without () to avoid this confusion,
so I would very much welcome the new pipe operator's behavior.
Thank you all the developers who implemented this!

I agree, it's an improvement on the corresponding magrittr error.

I think the semantics of not evaluating the RHS, but treating the pipe
as purely syntactical is a good decision.

I'm not sure I like the recommended way to pipe into a particular argument:

  mtcars |> subset(cyl == 4) |> \(d) lm(mpg ~ disp, data = d)

or

  mtcars |> subset(cyl == 4) |> function(d) lm(mpg ~ disp, data = d)

both of which are equivalent to

  mtcars |> subset(cyl == 4) |> (function(d) lm(mpg ~ disp, data = d))()

It's tempting to suggest it should allow something like

  mtcars |> subset(cyl == 4) |> lm(mpg ~ disp, data = .)

Which is really not that far off from

mtcars |> subset(cyl == 4) |> \(.) lm(mpg ~ disp, data = .)

once you get used to it.

One consequence of the implementation is that it's not clear how
multiple occurrences of the placeholder would be interpreted. With
magrittr,

sort(runif(10)) %>% ecdf(.)(.)
## [1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

This is probably what you would expect, if you expect it to work at all, and not

ecdf(sort(runif(10)))(sort(runif(10)))

There would be no such ambiguity with anonymous functions

sort(runif(10)) |> \(.) ecdf(.)(.)

-Deepayan

which would be expanded to something equivalent to the other versions:
but that makes it quite a bit more complicated.  (Maybe _ or \. should
be used instead of ., since those are not legal variable names.)

I don't think there should be an attempt to copy magrittr's special
casing of how . is used in determining whether to also include the
previous value as first argument.

Duncan Murdoch



Best,
Hiroaki Yutani

2020年12月4日(金) 20:51 Duncan Murdoch 
mailto:murdoch.dun...@gmail.com>>:

Just saw this on the R-devel news:


R now provides a simple native pipe syntax ‘|>’ as well as a shorthand
notation for creating functions, e.g. ‘\(x) x + 1’ is parsed as
‘function(x) x + 1’. The pipe implementation as a syntax transformation
was motivated by suggestions from Jim Hester and Lionel Henry. These
features are experimental and may change prior to release.


This is a good addition; by using "|>" instead of "%>%" there should be
a chance to get operator precedence right.  That said, the ?Syntax help
topic hasn't been updated, so I'm not sure where it fits in.

There are some choices that take a little getting used to:

mtcars |> head
Error: The pipe operator requires a function call or an anonymous
function expression as RHS

(I need to say mtcars |> head() inst

Re: [Rd] Use of C++ in Packages

Re: [Rd] should base R have a piping operator ?

Re: [Rd] should base R have a piping operator ?

Re: [Rd] should base R have a piping operator ?

Re: [Rd] [R] choose(n, k) as n approaches k

Re: [Rd] A bug understanding F relative to FALSE?

Re: [Rd] Request: tools::md5sum should accept connections and finally in-memory objects

Re: [Rd] Request: tools::md5sum should accept connections and finally in-memory objects

[Rd] summary.default rounding on numeric seems inconsistent with other R behaviors

Re: [Rd] summary.default rounding on numeric seems inconsistent with other R behaviors

[Rd] Red: RFC: (in-principle) native unquoting for standard evaluation

[Rd] lm() takes weights from formula environment

Re: [Rd] lm() takes weights from formula environment

Re: [Rd] lm() takes weights from formula environment

Re: [Rd] lm() takes weights from formula environment

Re: [Rd] lm() takes weights from formula environment

[Rd] I would suggest stats::glm() should set "converged" to FALSE in the return value in a few more situations.

Re: [Rd] New pipe operator

18 matches

Site Navigation

Mail list logo

Footer information