On 2015-03-26 07:48, Patrick Connolly wrote:

On Wed, 25-Mar-2015 at 03:14PM +0100, Henric Winell wrote:

...

|> Well...  Opinions may perhaps differ, but apart from '%>%' being
|> butt-ugly it's also fairly slow:

Beauty, it is said, is in the eye of the beholder.  I'm impressed by
the way using %>% reduces or eliminates complicated nested brackets.

I didn't dispute whether '%>%' may be useful -- I just pointed out that it is slow. However, it is only part of the problem: 'filter()' and 'select()', although aesthetically pleasing, also seem to be slow:

> all.states <- data.frame(state.x77, Name = rownames(state.x77))
>
> f1 <- function()
+     all.states[all.states$Frost > 150, c("Name", "Frost")]
>
> f2 <- function()
+     subset(all.states, Frost > 150, select = c("Name", "Frost"))
>
> f3 <- function() {
+     filt <- subset(all.states, Frost > 150)
+     subset(filt, select = c("Name", "Frost"))
+ }
>
> f4 <- function()
+     all.states %>% subset(Frost > 150) %>%
+         subset(select = c("Name", "Frost"))
>
> f5 <- function()
+     select(filter(all.states, Frost > 150), Name, Frost)
>
> f6 <- function()
+     all.states %>% filter(Frost > 150) %>% select(Name, Frost)
>
> mb <- microbenchmark(
+     f1(), f2(), f3(), f4(), f5(), f6(),
+     times = 1000L
+ )
> print(mb, signif = 3L)
Unit: microseconds
 expr min   lq      mean median   uq  max neval   cld
 f1() 115  124  134.8812    129  134 1500  1000 a
 f2() 128  141  147.4694    145  151 1520  1000 a
 f3() 303  328  344.3175    338  348 1740  1000  b
 f4() 458  494  518.0830    510  523 1890  1000   c
 f5() 806  848  887.7270    875  894 3510  1000    d
 f6() 971 1010 1056.5659   1040 1060 3110  1000     e

So, using '%>%', but leaving 'filter()' and 'select()' out of the equation, as in 'f4()' is only half as bad as the "full" 'dplyr' idiom in 'f6()'. In this case, since we're talking microseconds, the speed-up is negligible but that *is* beside the point.

In this tiny example it's not obvious but it's very clear if the
objective is to sort the dataframe by three or four columns and
various lots of aggregation then returning a largish number of
consecutive columns, omitting the rest.  It's very easy to see what's
going on without the need for intermediate objects.

Why are you opposed to using intermediate objects? In this case, as can be seen from 'f3()', it will also have the benefit of being faster than either '%>%' or the "full" 'dplyr' idiom.

|> [...]

It's no surprise that instructing a computer in something closer to
human language is an order of magnitude slower.

Certainly not true, at least for compiled languages. In any case, judging from off-list correspondence, it definitely came as a surprise to some R users...

Given that '%>%' is so heavily marketed through 'dplyr', where the latter is said to provide "blazing fast performance for in-memory data by writing key pieces in C++" and "a fast, consistent tool for working with data frame like objects, both in memory and out of memory", I don't think it's far-fetched to expect that it should be more performant than base R.

I'm sure you'd get something even quicker using machine code.

Don't be ridiculous.  We're mainly discussing

all.states[all.states$Frost > 150, c("state", "Frost")]

vs.

all.states %>% filter(Frost > 150) %>% select(state, Frost)

i.e., pure R code.

I spend 3 or 4 orders of magnitude more time writing code than running it.

You and me both.  But that doesn't mean speed is of no or little importance.

It's much more important to me to be able to read and modify than
> it is to have it run at optimum speed.

Good for you. But surely, if this is your goal, nothing beats intermediate objects. And like I said, it may still be faster than the 'dplyr' idiom.

|> Of course, this doesn't matter for interactive one-off use.  But
|> lately I've seen examples of the '%>%' operator creeping into
|> functions in packages.

That could indicate that %>% is seductively easy to use.  It's
probably true that there are places where it should be done the hard
way.

We all know how easy it is to write ugly and sluggish code in R. But 'foo[i,j]' is neither ugly nor sluggish and certainly not "the hard way."

|>  However, it would be nice to see a fast pipe operator as part of
|> base R.

Heck, it doesn't even have to be fast as long as it's a bit more elegant than '%>%'.


Henric Winell




|>
|>
|> Henric Winell
|>


______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to