[Rd] A few suggestions and perspectives from a PhD student

Antonin Klima Fri, 05 May 2017 10:04:11 -0700

Dear Sir or Madam,

I am in 2nd year of my PhD in bioinformatics, after taking my Master’s in 
computer science, and have been using R heavily during my PhD. As such, I have 
put together a list of certain features in R that, in my opinion, would be 
beneficial to add, or could be improved. The first two are already implemented 
in packages, but given that it is implemented as user-defined operators, it 
greatly restricts its usefulness. I hope you will find my suggestions 
interesting. If you find time, I will welcome any feedback as to whether you 
find the suggestions useful, or why you do not think they should be 
implemented. I will also welcome if you enlighten me with any features I might 
be unaware of, that might solve the issues I have pointed out below.


1) piping
Currently available in package magrittr, piping makes the code better readable 
by having the line start at its natural starting point, and following with 
functions that are applied - in order. The readability of several nested calls 
with a number of parameters each is almost zero, it’s almost as if one would 
need to come up with the solution himself. Pipeline in comparison is very 
straightforward, especially together with the point (2).

The package here works rather good nevertheless, the shortcomings of piping not 
being native are not quite as severe as in point (2). Nevertheless, an 
intuitive symbol such as | would be helpful, and it sometimes bothers me that I 
have to parenthesize anonymous function, which would probably not be required 
in a native pipe-operator, much like it is not required in f.ex. lapply. That 
is,
1:5 %>% function(x) x+2
should be totally fine

2) currying
Currently available in package Curry. The idea is that, having a function such 
as foo = function(x, y) x+y, one would like to write for example lapply(foo(3), 
1:5), and have the interpreter figure out ok, foo(3) does not make a value 
result, but it can still give a function result - a function of y. This would 
be indeed most useful for various apply functions, rather than writing 
function(x) foo(3,x).

I suggest that currying would make the code easier to write, and more readable, 
especially when using apply functions. One might imagine that there could be 
some confusion with such a feature, especially from people unfamiliar with 
functional programming, although R already does take function as first-order 
arguments, so it could be just fine. But one could address it with special 
syntax, such as $foo(3) [$foo(x=3)] for partial application.  The current 
currying package has very limited usefulness, as, being limited by the 
user-defined operator framework, it only rarely can contribute to less 
code/more readability. Compare yourself:
$foo(x=3) vs foo %<% 3
goo = function(a,b,c)
$goo(b=3) vs goo %><% list(b=3)

Moreover, one would often like currying to have highest priority. For example, 
when piping:
data %>% foo %>% foo1 %<% 3
if one wants to do data %>% foo %>% $foo(x=3)

3) Code executable only when running the script itself
Whereas the first two suggestions are somewhat stealing from Haskell and the 
like, this suggestion would be stealing from Python. I’m building quite a 
complicated pipeline, using S4 classes. After defining the class and its 
methods, I also define how to build the class to my likings, based on my input 
data, using various now-defined methods. So I end up having a list of command 
line arguments to process, and the way to create the class instance based on 
them. If I write it to the class file, however, I end up running the code when 
it is sourced from the next step in the pipeline, that needs the previous class 
definitions.

A feature such as pythonic “if __name__ == __main__” would thus be useful. As 
it is, I had to create run scripts as separate files. Which is actually not so 
terrible, given the class and its methods often span a few hundred lines, but 
still.

4) non-exported global variables
I also find it lacking, that I seem to be unable to create constants that would 
not get passed to files that source the class definition. That is, if class1 
features global constant CONSTANT=3, then if class2 sources class1, it will 
also include the constant. This 1) clutters the namespace when running the code 
interactively, 2) potentially overwrites the constants in case of nameclash. 
Some kind of export/nonexport variable syntax, or symbolic import, or namespace 
would be useful. I know if I converted it to a package I would get at least 
something like a namespace, but still.

I understand that the variable cannot just not be imported, in general, as the 
functions will generally rely on it (otherwise it wouldn’t have to be there). 
But one could consider hiding it in an implicit namespace for the file, for 
example.

5) S4 methods with same name, for different classes
Say I have an S4 class called datasetSingle, and another S4 class called 
datasetMulti, which gathers up a number of datasetSingle classes, and adds some 
extra functionality on top. The datasetSingle class may have a method 
replicates, that returns a named vector assigning replicate number to 
experiment names of the dataset. But I would also like to have a function with 
the same name for the datasetMulti class, that returns for data frame, or list, 
covering replicate numbers for all the datasets included.

But then, I need to setGeneric for the method. But if I set generic before both 
implementations, I will reset the generic in the second call, losing the 
definition for “replicates” for datasetSingle. Skipping this in the code for 
datasetMulti means that 1) I have to remember that I had the function defined 
for datasetSingle, 2) if I remove the function or change its name in 
datasetSingle, I now have to change the datasetMulti class file too. Moreover, 
if I would like to have a different generic for the datasetMulti version, I 
have to change it not in datasetMulti class file, but in the datasetSingle 
file, where it might not make much sense. In this case, I wanted to have 
another argument “datasets”, which would return the replicates only for the 
datasets specified, rather than for all.

I made a wrapper that could circumvent the first issue, but the second issue is 
not easy to circumvent.

6) Many parameters freeze S4 method calls
If I specify ca over 6 parameters for an S4 method, I would often get a 
“freeze” on the method call. The process would eat up a lot of memory before 
going into the call, upon which it would execute the call as normal (if it 
didn’t run out of memory or I didn’t run out of patience). Subsequent calls of 
the method would not include this overhead. The amount of memory this could 
take could be in gigabytes, and the time in minutes. I suspect this might be 
due to generating an entry in call table for each accepted signature. It can be 
circumvented, but sure isn’t a behaviour one would expect.

7) Default values for S4 methods
It would seem that it is not possible to set up default parameters for an S4 
method in a usual way of definiton = function (x, y=5). I resorted to making 
class unions with “missing” for signatures on the call, with the call starting 
with if(missing(param)) param=DEFAULT_VALUE, but it certainly does not improve 
readability or ease of coding.


Thank you for your time if you have finished reading thus far. :) Looking 
forward to any answer.

Yours Sincerely,
Antonin Klima

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] A few suggestions and perspectives from a PhD student

Reply via email to