Re: [R] Using and abusing %>% (was Re: Why can't I access this type?)

Henric Winell Fri, 27 Mar 2015 16:45:07 -0700

On 2015-03-26 07:48, Patrick Connolly wrote:

On Wed, 25-Mar-2015 at 03:14PM +0100, Henric Winell wrote:


...

|> Well...  Opinions may perhaps differ, but apart from '%>%' being
|> butt-ugly it's also fairly slow:

Beauty, it is said, is in the eye of the beholder.  I'm impressed by
the way using %>% reduces or eliminates complicated nested brackets.

I didn't dispute whether '%>%' may be useful -- I just pointed out thatit is slow. However, it is only part of the problem: 'filter()' and'select()', although aesthetically pleasing, also seem to be slow:


> all.states <- data.frame(state.x77, Name = rownames(state.x77))
>
> f1 <- function()
+     all.states[all.states$Frost > 150, c("Name", "Frost")]
>
> f2 <- function()
+     subset(all.states, Frost > 150, select = c("Name", "Frost"))
>
> f3 <- function() {
+     filt <- subset(all.states, Frost > 150)
+     subset(filt, select = c("Name", "Frost"))
+ }
>
> f4 <- function()
+     all.states %>% subset(Frost > 150) %>%
+         subset(select = c("Name", "Frost"))
>
> f5 <- function()
+     select(filter(all.states, Frost > 150), Name, Frost)
>
> f6 <- function()
+     all.states %>% filter(Frost > 150) %>% select(Name, Frost)
>
> mb <- microbenchmark(
+     f1(), f2(), f3(), f4(), f5(), f6(),
+     times = 1000L
+ )
> print(mb, signif = 3L)
Unit: microseconds
 expr min   lq      mean median   uq  max neval   cld
 f1() 115  124  134.8812    129  134 1500  1000 a
 f2() 128  141  147.4694    145  151 1520  1000 a
 f3() 303  328  344.3175    338  348 1740  1000  b
 f4() 458  494  518.0830    510  523 1890  1000   c
 f5() 806  848  887.7270    875  894 3510  1000    d
 f6() 971 1010 1056.5659   1040 1060 3110  1000     e

So, using '%>%', but leaving 'filter()' and 'select()' out of theequation, as in 'f4()' is only half as bad as the "full" 'dplyr' idiomin 'f6()'. In this case, since we're talking microseconds, the speed-upis negligible but that *is* beside the point.

In this tiny example it's not obvious but it's very clear if the
objective is to sort the dataframe by three or four columns and
various lots of aggregation then returning a largish number of
consecutive columns, omitting the rest.  It's very easy to see what's
going on without the need for intermediate objects.

Why are you opposed to using intermediate objects? In this case, as canbe seen from 'f3()', it will also have the benefit of being faster thaneither '%>%' or the "full" 'dplyr' idiom.

|> [...]

It's no surprise that instructing a computer in something closer to
human language is an order of magnitude slower.

Certainly not true, at least for compiled languages. In any case,judging from off-list correspondence, it definitely came as a surpriseto some R users...

Given that '%>%' is so heavily marketed through 'dplyr', where thelatter is said to provide "blazing fast performance for in-memory databy writing key pieces in C++" and "a fast, consistent tool for workingwith data frame like objects, both in memory and out of memory", I don'tthink it's far-fetched to expect that it should be more performant thanbase R.

I'm sure you'd get something even quicker using machine code.


Don't be ridiculous.  We're mainly discussing

all.states[all.states$Frost > 150, c("state", "Frost")]

vs.

all.states %>% filter(Frost > 150) %>% select(state, Frost)

i.e., pure R code.

I spend 3 or 4 orders of magnitude more time writing code than running it.


You and me both.  But that doesn't mean speed is of no or little importance.

It's much more important to me to be able to read and modify than

> it is to have it run at optimum speed.

Good for you. But surely, if this is your goal, nothing beatsintermediate objects. And like I said, it may still be faster than the'dplyr' idiom.

|> Of course, this doesn't matter for interactive one-off use.  But
|> lately I've seen examples of the '%>%' operator creeping into
|> functions in packages.

That could indicate that %>% is seductively easy to use.  It's
probably true that there are places where it should be done the hard
way.

We all know how easy it is to write ugly and sluggish code in R. But'foo[i,j]' is neither ugly nor sluggish and certainly not "the hard way."

|>  However, it would be nice to see a fast pipe operator as part of
|> base R.

Heck, it doesn't even have to be fast as long as it's a bit more elegantthan '%>%'.



Henric Winell


|>
|>
|> Henric Winell
|>


______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Using and abusing %>% (was Re: Why can't I access this type?)

Reply via email to