On Fri, May 5, 2017 at 1:00 PM, Antonin Klima <[email protected]> wrote:
> Dear Sir or Madam,
>
> I am in 2nd year of my PhD in bioinformatics, after taking my Master’s in
> computer science, and have been using R heavily during my PhD. As such, I
> have put together a list of certain features in R that, in my opinion, would
> be beneficial to add, or could be improved. The first two are already
> implemented in packages, but given that it is implemented as user-defined
> operators, it greatly restricts its usefulness.
Why do you think being implemented in a contributed package restricts
the usefulness of a feature?
I hope you will find my suggestions interesting. If you find time, I
will welcome any feedback as to whether you find the suggestions
useful, or why you do not think they should be implemented. I will
also welcome if you enlighten me with any features I might be unaware
of, that might solve the issues I have pointed out below.
>
> 1) piping
> Currently available in package magrittr, piping makes the code better
> readable by having the line start at its natural starting point, and
> following with functions that are applied - in order. The readability of
> several nested calls with a number of parameters each is almost zero, it’s
> almost as if one would need to come up with the solution himself. Pipeline in
> comparison is very straightforward, especially together with the point (2).
You may be surprised to learn that not everyone thinks pipes are a
good idea. Personally I see some advantages, but there is also a big
downside with is that they mess up the call stack and make tracking
down errors via traceback() more difficult.
There is a simple alternative to pipes already built in to R that
gives you some of the advantages of %>% without messing up the call
stack. Using Hadley's famous "little bunny foo foo" example:
foo_foo <- little_bunny()
## nesting (it is rough)
bop(
scoop(
hop(foo_foo, through = forest),
up = field_mice
),
on = head
)
## magrittr
foo_foo %>%
hop(through = forest) %>%
scoop(up = field_mouse) %>%
bop(on = head)
## regular R assignment
foo_foo -> .
hop(., through = forest) -> .
scoop(., up = field_mouse) -> .
bop(., on = head)
This is more limited that magrittr's %>%, but it gives you a lot of
the advantages without the disadvantages.
>
> The package here works rather good nevertheless, the shortcomings of piping
> not being native are not quite as severe as in point (2). Nevertheless, an
> intuitive symbol such as | would be helpful, and it sometimes bothers me that
> I have to parenthesize anonymous function, which would probably not be
> required in a native pipe-operator, much like it is not required in f.ex.
> lapply. That is,
> 1:5 %>% function(x) x+2
> should be totally fine
That seems pretty small-potatoes to me.
>
> 2) currying
> Currently available in package Curry. The idea is that, having a function
> such as foo = function(x, y) x+y, one would like to write for example
> lapply(foo(3), 1:5), and have the interpreter figure out ok, foo(3) does not
> make a value result, but it can still give a function result - a function of
> y. This would be indeed most useful for various apply functions, rather than
> writing function(x) foo(3,x).
You can already do
lapply(1:5, foo, y = 3)
(assuming that the first argument to foo is named "y")
I'm stopping here since I don't have anything useful to say about your
subsequent points.
Best,
Ista
>
> I suggest that currying would make the code easier to write, and more
> readable, especially when using apply functions. One might imagine that there
> could be some confusion with such a feature, especially from people
> unfamiliar with functional programming, although R already does take function
> as first-order arguments, so it could be just fine. But one could address it
> with special syntax, such as $foo(3) [$foo(x=3)] for partial application.
> The current currying package has very limited usefulness, as, being limited
> by the user-defined operator framework, it only rarely can contribute to less
> code/more readability. Compare yourself:
> $foo(x=3) vs foo %<% 3
> goo = function(a,b,c)
> $goo(b=3) vs goo %><% list(b=3)
>
> Moreover, one would often like currying to have highest priority. For
> example, when piping:
> data %>% foo %>% foo1 %<% 3
> if one wants to do data %>% foo %>% $foo(x=3)
>
> 3) Code executable only when running the script itself
> Whereas the first two suggestions are somewhat stealing from Haskell and the
> like, this suggestion would be stealing from Python. I’m building quite a
> complicated pipeline, using S4 classes. After defining the class and its
> methods, I also define how to build the class to my likings, based on my
> input data, using various now-defined methods. So I end up having a list of
> command line arguments to process, and the way to create the class instance
> based on them. If I write it to the class file, however, I end up running the
> code when it is sourced from the next step in the pipeline, that needs the
> previous class definitions.
>
> A feature such as pythonic “if __name__ == __main__” would thus be useful. As
> it is, I had to create run scripts as separate files. Which is actually not
> so terrible, given the class and its methods often span a few hundred lines,
> but still.
>
> 4) non-exported global variables
> I also find it lacking, that I seem to be unable to create constants that
> would not get passed to files that source the class definition. That is, if
> class1 features global constant CONSTANT=3, then if class2 sources class1, it
> will also include the constant. This 1) clutters the namespace when running
> the code interactively, 2) potentially overwrites the constants in case of
> nameclash. Some kind of export/nonexport variable syntax, or symbolic import,
> or namespace would be useful. I know if I converted it to a package I would
> get at least something like a namespace, but still.
>
> I understand that the variable cannot just not be imported, in general, as
> the functions will generally rely on it (otherwise it wouldn’t have to be
> there). But one could consider hiding it in an implicit namespace for the
> file, for example.
>
> 5) S4 methods with same name, for different classes
> Say I have an S4 class called datasetSingle, and another S4 class called
> datasetMulti, which gathers up a number of datasetSingle classes, and adds
> some extra functionality on top. The datasetSingle class may have a method
> replicates, that returns a named vector assigning replicate number to
> experiment names of the dataset. But I would also like to have a function
> with the same name for the datasetMulti class, that returns for data frame,
> or list, covering replicate numbers for all the datasets included.
>
> But then, I need to setGeneric for the method. But if I set generic before
> both implementations, I will reset the generic in the second call, losing the
> definition for “replicates” for datasetSingle. Skipping this in the code for
> datasetMulti means that 1) I have to remember that I had the function defined
> for datasetSingle, 2) if I remove the function or change its name in
> datasetSingle, I now have to change the datasetMulti class file too.
> Moreover, if I would like to have a different generic for the datasetMulti
> version, I have to change it not in datasetMulti class file, but in the
> datasetSingle file, where it might not make much sense. In this case, I
> wanted to have another argument “datasets”, which would return the replicates
> only for the datasets specified, rather than for all.
>
> I made a wrapper that could circumvent the first issue, but the second issue
> is not easy to circumvent.
>
> 6) Many parameters freeze S4 method calls
> If I specify ca over 6 parameters for an S4 method, I would often get a
> “freeze” on the method call. The process would eat up a lot of memory before
> going into the call, upon which it would execute the call as normal (if it
> didn’t run out of memory or I didn’t run out of patience). Subsequent calls
> of the method would not include this overhead. The amount of memory this
> could take could be in gigabytes, and the time in minutes. I suspect this
> might be due to generating an entry in call table for each accepted
> signature. It can be circumvented, but sure isn’t a behaviour one would
> expect.
>
> 7) Default values for S4 methods
> It would seem that it is not possible to set up default parameters for an S4
> method in a usual way of definiton = function (x, y=5). I resorted to making
> class unions with “missing” for signatures on the call, with the call
> starting with if(missing(param)) param=DEFAULT_VALUE, but it certainly does
> not improve readability or ease of coding.
>
>
> Thank you for your time if you have finished reading thus far. :) Looking
> forward to any answer.
>
> Yours Sincerely,
> Antonin Klima
>
> ______________________________________________
> [email protected] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel