(CCing the list because I definitely think it would be useful to have this in the archives.)

Thanks so much! It's clear that I'm going to need to read this email plus references quite a few times before things sink in. I guess I'll have to change my point of view entirely for this language. I understand much of the theory behind functional languages and have made much use of vectorization in programming in the past, but I tended to only do that when absolutely necessary to avoid complexity. I guess I really will have to fundamentally change my programming style and turn that theory into practice.

Thanks again everyone!

Cheers,
Thomas

On 10/30/2014 12:58 PM, Dennis Murphy wrote:
Hi:

You're making several fundamental mistakes that people who move to R
from a conventional programming language often make. The ones that are
immediately apparent are:

(i) Initializing empty objects.
(ii) Populating the object one element at a time.
(iii) An over-reliance on for loops to do the job.

One of the most important features of R is the concept of
vectorization, which basically means "program the entire object, not
its pieces". Item (ii) usually guarantees slow processing because R
uses lazy evaluation, which means that in cases like yours, the memory
used in each iteration is accumulated until the loop is complete,
which in turn means that iterations will get progressively longer as R
has to search harder for memory. One quick fix to (ii) is to
pre-allocate the required space, but the better fix is to use
vectorization whenever possible, because then the actual looping takes
place at the compiler level rather than inside the interpreter - the
former is generally much faster.

If I read your code properly, you may have meant to do something like this:

numbers <- sample(seq(30000), 50000, replace = TRUE)
values <- sample(seq(10), 50000, replace = TRUE)
L <- split(numbers, values)
sapply(L, length)    # returns the length of each of the ten
subvectors in the list

The time it takes to create the list on my system:

system.time({
+ numbers <- sample(seq(30000), 50000, replace = TRUE)
+ values <- sample(seq(10), 50000, replace = TRUE)
+ L <- split(numbers, values)  })
    user  system elapsed
       0       0       0

The length of the numeric vectors associated with each value:

sapply(L, length)
    1    2    3    4    5    6    7    8    9   10
4976 4968 4912 5111 5009 5074 4987 4950 5166 4847

Notice the following features of my code:

* no object initialization
* no loops


More comments inline.

On Thu, Oct 30, 2014 at 8:17 AM, Thomas Nyberg <tomnyb...@gmail.com> wrote:
Hello,

I want to do the following: Given a set of (number, value) pairs, I want to
create a list l so that l[[toString(number)]] returns the vector of values
associated to that number. It is hundreds of times slower than the
equivalent that I would write in python. I'm pretty new to R so I bet I'm
using its data structures inefficiently, but I've tried more or less
everything I can think of and can't really speed it up. I have done some
profiling which helped me find problem areas, but I couldn't speed things up
even with that information. I'm thinking I'm just fundamentally using R in a
silly way.

I've included code for the different versions. I wrote the python code in a
style to make it as clear to R programmers as possible. Thanks a lot! Any
help would be greatly appreciated!

Cheers,
Thomas


R code (with two versions depending on commenting):

-----

numbers <- numeric(0)
for (i in 1:5) {
     numbers <- c(numbers, sample(1:30000, 10000))
}

values <- numeric(0)
for (i in 1:length(numbers)) {
     values <- append(values, sample(1:10, 1))
}

The two loops above are classic cases of initializing empty vectors
and populating them one element at a time. This may be efficient in
Python, C or Java, but it is not so in R because R uses lazy
evaluation, which means that an expression is not evaluated until its
values are needed. In this case, they are needed at the end of the
loop.

When you have the time, download Patrick Burns' classic R Inferno,
which covers this problem as well as many other "gotchas" that bite
the inexperienced R programmer.

            starttime <- Sys.time()

d = list()
for (i in 1:length(numbers)) {
     number = toString(numbers[i])
     value = values[i]
     if (is.null(d[[number]])) {
     #if (number %in% names(d)) {
         d[[number]] <- c(value)
     } else {
         d[[number]] <- append(d[[number]], value)
     }
}

(1) Conversion to string is unnecessary.
(2) Use of append() is unnecessary. As Bill mentioned, split()
performs the same purpose as your code.

One use of list objects in R allows for its components to have the
same type but different lengths.

The takeaway: efficient programming in Python does not necessarily
translate to efficient programming in R. You need to think somewhat
differently in R. My approach is to think in terms of entire objects
rather than its elements and program accordingly, which comes in handy
as I can do

mean(x)      # find the arithmetic average of the numeric vector x
median(x)   # find the sample median of the numeric vector x

etc., which are vectorized functions. Trying to program these in a
standard programming language requires a lot more code and as a result
the code is clunkier and less transparent to an external reader.

Dennis


endtime <- Sys.time()

print(format(endtime - starttime))

-----

uncommented version: "45.64791 secs"
commented version: "1.423056 mins"



Another version of R code:

-----

numbers <- numeric(0)
for (i in 1:5) {
     numbers <- c(numbers, sample(1:30000, 10000))
}

values <- numeric(0)
for (i in 1:length(numbers)) {
     values <- append(values, sample(1:10, 1))
}

starttime <- Sys.time()

d = list()
for (number in unique(numbers)) {
     d[[toString(number)]] <- numeric(0)
}
for (i in 1:length(numbers)) {
     number = toString(numbers[i])
     value = values[i]
     d[[number]] <- append(d[[number]], value)
}

endtime <- Sys.time()

print(format(endtime - starttime))

-----

"47.15579 secs"



The python code:

-----

import random
import time

numbers = []
for i in range(5):
     numbers += random.sample(range(30000), 10000)

values = []
for i in range(len(numbers)):
     values.append(random.randint(1, 10))

starttime = time.time()

d = {}
for i in range(len(numbers)):
     number = numbers[i]
     value = values[i]
     if d.has_key(number):
         d[number].append(value)
     else:
         d[number] = [value]

endtime = time.time()

print endtime - starttime, "seconds"

-----

0.123021125793 seconds

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to