Re: [Rd] A few suggestions and perspectives from a PhD student

Ista Zahn Fri, 05 May 2017 10:56:03 -0700

On Fri, May 5, 2017 at 1:00 PM, Antonin Klima <anton...@idi.ntnu.no> wrote:
> Dear Sir or Madam,
>
> I am in 2nd year of my PhD in bioinformatics, after taking my Master’s in 
> computer science, and have been using R heavily during my PhD. As such, I 
> have put together a list of certain features in R that, in my opinion, would 
> be beneficial to add, or could be improved. The first two are already 
> implemented in packages, but given that it is implemented as user-defined 
> operators, it greatly restricts its usefulness.


Why do you think being implemented in a contributed package restricts
the usefulness of a feature?

I hope you will find my suggestions interesting. If you find time, I
will welcome any feedback as to whether you find the suggestions
useful, or why you do not think they should be implemented. I will
also welcome if you enlighten me with any features I might be unaware
of, that might solve the issues I have pointed out below.
>
> 1) piping
> Currently available in package magrittr, piping makes the code better 
> readable by having the line start at its natural starting point, and 
> following with functions that are applied - in order. The readability of 
> several nested calls with a number of parameters each is almost zero, it’s 
> almost as if one would need to come up with the solution himself. Pipeline in 
> comparison is very straightforward, especially together with the point (2).

You may be surprised to learn that not everyone thinks pipes are a
good idea. Personally I see some advantages, but there is also a big
downside with is that they mess up the call stack and make tracking
down errors via traceback() more difficult.

There is a simple alternative to pipes already built in to R that
gives you some of the advantages of %>% without messing up the call
stack.  Using Hadley's famous "little bunny foo foo" example:

foo_foo <- little_bunny()

## nesting (it is rough)
bop(
  scoop(
    hop(foo_foo, through = forest),
    up = field_mice
  ),
  on = head
)

## magrittr
foo_foo %>%
  hop(through = forest) %>%
  scoop(up = field_mouse) %>%
  bop(on = head)

## regular R assignment
foo_foo -> .
  hop(., through = forest) -> .
  scoop(., up = field_mouse) -> .
  bop(., on = head)

This is more limited that magrittr's %>%, but it gives you a lot of
the advantages without the disadvantages.

>
> The package here works rather good nevertheless, the shortcomings of piping 
> not being native are not quite as severe as in point (2). Nevertheless, an 
> intuitive symbol such as | would be helpful, and it sometimes bothers me that 
> I have to parenthesize anonymous function, which would probably not be 
> required in a native pipe-operator, much like it is not required in f.ex. 
> lapply. That is,
> 1:5 %>% function(x) x+2
> should be totally fine

That seems pretty small-potatoes to me.

>
> 2) currying
> Currently available in package Curry. The idea is that, having a function 
> such as foo = function(x, y) x+y, one would like to write for example 
> lapply(foo(3), 1:5), and have the interpreter figure out ok, foo(3) does not 
> make a value result, but it can still give a function result - a function of 
> y. This would be indeed most useful for various apply functions, rather than 
> writing function(x) foo(3,x).

You can already do

lapply(1:5, foo, y = 3)

(assuming that the first argument to foo is named "y")

I'm stopping here since I don't have anything useful to say about your
subsequent points.

Best,
Ista

>
> I suggest that currying would make the code easier to write, and more 
> readable, especially when using apply functions. One might imagine that there 
> could be some confusion with such a feature, especially from people 
> unfamiliar with functional programming, although R already does take function 
> as first-order arguments, so it could be just fine. But one could address it 
> with special syntax, such as $foo(3) [$foo(x=3)] for partial application.  
> The current currying package has very limited usefulness, as, being limited 
> by the user-defined operator framework, it only rarely can contribute to less 
> code/more readability. Compare yourself:
> $foo(x=3) vs foo %<% 3
> goo = function(a,b,c)
> $goo(b=3) vs goo %><% list(b=3)
>
> Moreover, one would often like currying to have highest priority. For 
> example, when piping:
> data %>% foo %>% foo1 %<% 3
> if one wants to do data %>% foo %>% $foo(x=3)
>
> 3) Code executable only when running the script itself
> Whereas the first two suggestions are somewhat stealing from Haskell and the 
> like, this suggestion would be stealing from Python. I’m building quite a 
> complicated pipeline, using S4 classes. After defining the class and its 
> methods, I also define how to build the class to my likings, based on my 
> input data, using various now-defined methods. So I end up having a list of 
> command line arguments to process, and the way to create the class instance 
> based on them. If I write it to the class file, however, I end up running the 
> code when it is sourced from the next step in the pipeline, that needs the 
> previous class definitions.
>
> A feature such as pythonic “if __name__ == __main__” would thus be useful. As 
> it is, I had to create run scripts as separate files. Which is actually not 
> so terrible, given the class and its methods often span a few hundred lines, 
> but still.
>
> 4) non-exported global variables
> I also find it lacking, that I seem to be unable to create constants that 
> would not get passed to files that source the class definition. That is, if 
> class1 features global constant CONSTANT=3, then if class2 sources class1, it 
> will also include the constant. This 1) clutters the namespace when running 
> the code interactively, 2) potentially overwrites the constants in case of 
> nameclash. Some kind of export/nonexport variable syntax, or symbolic import, 
> or namespace would be useful. I know if I converted it to a package I would 
> get at least something like a namespace, but still.
>
> I understand that the variable cannot just not be imported, in general, as 
> the functions will generally rely on it (otherwise it wouldn’t have to be 
> there). But one could consider hiding it in an implicit namespace for the 
> file, for example.
>
> 5) S4 methods with same name, for different classes
> Say I have an S4 class called datasetSingle, and another S4 class called 
> datasetMulti, which gathers up a number of datasetSingle classes, and adds 
> some extra functionality on top. The datasetSingle class may have a method 
> replicates, that returns a named vector assigning replicate number to 
> experiment names of the dataset. But I would also like to have a function 
> with the same name for the datasetMulti class, that returns for data frame, 
> or list, covering replicate numbers for all the datasets included.
>
> But then, I need to setGeneric for the method. But if I set generic before 
> both implementations, I will reset the generic in the second call, losing the 
> definition for “replicates” for datasetSingle. Skipping this in the code for 
> datasetMulti means that 1) I have to remember that I had the function defined 
> for datasetSingle, 2) if I remove the function or change its name in 
> datasetSingle, I now have to change the datasetMulti class file too. 
> Moreover, if I would like to have a different generic for the datasetMulti 
> version, I have to change it not in datasetMulti class file, but in the 
> datasetSingle file, where it might not make much sense. In this case, I 
> wanted to have another argument “datasets”, which would return the replicates 
> only for the datasets specified, rather than for all.
>
> I made a wrapper that could circumvent the first issue, but the second issue 
> is not easy to circumvent.
>
> 6) Many parameters freeze S4 method calls
> If I specify ca over 6 parameters for an S4 method, I would often get a 
> “freeze” on the method call. The process would eat up a lot of memory before 
> going into the call, upon which it would execute the call as normal (if it 
> didn’t run out of memory or I didn’t run out of patience). Subsequent calls 
> of the method would not include this overhead. The amount of memory this 
> could take could be in gigabytes, and the time in minutes. I suspect this 
> might be due to generating an entry in call table for each accepted 
> signature. It can be circumvented, but sure isn’t a behaviour one would 
> expect.
>
> 7) Default values for S4 methods
> It would seem that it is not possible to set up default parameters for an S4 
> method in a usual way of definiton = function (x, y=5). I resorted to making 
> class unions with “missing” for signatures on the call, with the call 
> starting with if(missing(param)) param=DEFAULT_VALUE, but it certainly does 
> not improve readability or ease of coding.
>
>
> Thank you for your time if you have finished reading thus far. :) Looking 
> forward to any answer.
>
> Yours Sincerely,
> Antonin Klima
>
> ______________________________________________
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] A few suggestions and perspectives from a PhD student

Reply via email to