On Fri, May 5, 2017 at 1:00 PM, Antonin Klima <anton...@idi.ntnu.no> wrote: > Dear Sir or Madam, > > I am in 2nd year of my PhD in bioinformatics, after taking my Master’s in > computer science, and have been using R heavily during my PhD. As such, I > have put together a list of certain features in R that, in my opinion, would > be beneficial to add, or could be improved. The first two are already > implemented in packages, but given that it is implemented as user-defined > operators, it greatly restricts its usefulness.
Why do you think being implemented in a contributed package restricts the usefulness of a feature? I hope you will find my suggestions interesting. If you find time, I will welcome any feedback as to whether you find the suggestions useful, or why you do not think they should be implemented. I will also welcome if you enlighten me with any features I might be unaware of, that might solve the issues I have pointed out below. > > 1) piping > Currently available in package magrittr, piping makes the code better > readable by having the line start at its natural starting point, and > following with functions that are applied - in order. The readability of > several nested calls with a number of parameters each is almost zero, it’s > almost as if one would need to come up with the solution himself. Pipeline in > comparison is very straightforward, especially together with the point (2). You may be surprised to learn that not everyone thinks pipes are a good idea. Personally I see some advantages, but there is also a big downside with is that they mess up the call stack and make tracking down errors via traceback() more difficult. There is a simple alternative to pipes already built in to R that gives you some of the advantages of %>% without messing up the call stack. Using Hadley's famous "little bunny foo foo" example: foo_foo <- little_bunny() ## nesting (it is rough) bop( scoop( hop(foo_foo, through = forest), up = field_mice ), on = head ) ## magrittr foo_foo %>% hop(through = forest) %>% scoop(up = field_mouse) %>% bop(on = head) ## regular R assignment foo_foo -> . hop(., through = forest) -> . scoop(., up = field_mouse) -> . bop(., on = head) This is more limited that magrittr's %>%, but it gives you a lot of the advantages without the disadvantages. > > The package here works rather good nevertheless, the shortcomings of piping > not being native are not quite as severe as in point (2). Nevertheless, an > intuitive symbol such as | would be helpful, and it sometimes bothers me that > I have to parenthesize anonymous function, which would probably not be > required in a native pipe-operator, much like it is not required in f.ex. > lapply. That is, > 1:5 %>% function(x) x+2 > should be totally fine That seems pretty small-potatoes to me. > > 2) currying > Currently available in package Curry. The idea is that, having a function > such as foo = function(x, y) x+y, one would like to write for example > lapply(foo(3), 1:5), and have the interpreter figure out ok, foo(3) does not > make a value result, but it can still give a function result - a function of > y. This would be indeed most useful for various apply functions, rather than > writing function(x) foo(3,x). You can already do lapply(1:5, foo, y = 3) (assuming that the first argument to foo is named "y") I'm stopping here since I don't have anything useful to say about your subsequent points. Best, Ista > > I suggest that currying would make the code easier to write, and more > readable, especially when using apply functions. One might imagine that there > could be some confusion with such a feature, especially from people > unfamiliar with functional programming, although R already does take function > as first-order arguments, so it could be just fine. But one could address it > with special syntax, such as $foo(3) [$foo(x=3)] for partial application. > The current currying package has very limited usefulness, as, being limited > by the user-defined operator framework, it only rarely can contribute to less > code/more readability. Compare yourself: > $foo(x=3) vs foo %<% 3 > goo = function(a,b,c) > $goo(b=3) vs goo %><% list(b=3) > > Moreover, one would often like currying to have highest priority. For > example, when piping: > data %>% foo %>% foo1 %<% 3 > if one wants to do data %>% foo %>% $foo(x=3) > > 3) Code executable only when running the script itself > Whereas the first two suggestions are somewhat stealing from Haskell and the > like, this suggestion would be stealing from Python. I’m building quite a > complicated pipeline, using S4 classes. After defining the class and its > methods, I also define how to build the class to my likings, based on my > input data, using various now-defined methods. So I end up having a list of > command line arguments to process, and the way to create the class instance > based on them. If I write it to the class file, however, I end up running the > code when it is sourced from the next step in the pipeline, that needs the > previous class definitions. > > A feature such as pythonic “if __name__ == __main__” would thus be useful. As > it is, I had to create run scripts as separate files. Which is actually not > so terrible, given the class and its methods often span a few hundred lines, > but still. > > 4) non-exported global variables > I also find it lacking, that I seem to be unable to create constants that > would not get passed to files that source the class definition. That is, if > class1 features global constant CONSTANT=3, then if class2 sources class1, it > will also include the constant. This 1) clutters the namespace when running > the code interactively, 2) potentially overwrites the constants in case of > nameclash. Some kind of export/nonexport variable syntax, or symbolic import, > or namespace would be useful. I know if I converted it to a package I would > get at least something like a namespace, but still. > > I understand that the variable cannot just not be imported, in general, as > the functions will generally rely on it (otherwise it wouldn’t have to be > there). But one could consider hiding it in an implicit namespace for the > file, for example. > > 5) S4 methods with same name, for different classes > Say I have an S4 class called datasetSingle, and another S4 class called > datasetMulti, which gathers up a number of datasetSingle classes, and adds > some extra functionality on top. The datasetSingle class may have a method > replicates, that returns a named vector assigning replicate number to > experiment names of the dataset. But I would also like to have a function > with the same name for the datasetMulti class, that returns for data frame, > or list, covering replicate numbers for all the datasets included. > > But then, I need to setGeneric for the method. But if I set generic before > both implementations, I will reset the generic in the second call, losing the > definition for “replicates” for datasetSingle. Skipping this in the code for > datasetMulti means that 1) I have to remember that I had the function defined > for datasetSingle, 2) if I remove the function or change its name in > datasetSingle, I now have to change the datasetMulti class file too. > Moreover, if I would like to have a different generic for the datasetMulti > version, I have to change it not in datasetMulti class file, but in the > datasetSingle file, where it might not make much sense. In this case, I > wanted to have another argument “datasets”, which would return the replicates > only for the datasets specified, rather than for all. > > I made a wrapper that could circumvent the first issue, but the second issue > is not easy to circumvent. > > 6) Many parameters freeze S4 method calls > If I specify ca over 6 parameters for an S4 method, I would often get a > “freeze” on the method call. The process would eat up a lot of memory before > going into the call, upon which it would execute the call as normal (if it > didn’t run out of memory or I didn’t run out of patience). Subsequent calls > of the method would not include this overhead. The amount of memory this > could take could be in gigabytes, and the time in minutes. I suspect this > might be due to generating an entry in call table for each accepted > signature. It can be circumvented, but sure isn’t a behaviour one would > expect. > > 7) Default values for S4 methods > It would seem that it is not possible to set up default parameters for an S4 > method in a usual way of definiton = function (x, y=5). I resorted to making > class unions with “missing” for signatures on the call, with the call > starting with if(missing(param)) param=DEFAULT_VALUE, but it certainly does > not improve readability or ease of coding. > > > Thank you for your time if you have finished reading thus far. :) Looking > forward to any answer. > > Yours Sincerely, > Antonin Klima > > ______________________________________________ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel