[R] More efficient way to use ifelse()?

2010-05-25 Thread Ian Dworkin
# This is more about trying to find a more effecient way to code some
simple vectorized computations using ifelse().

# Say you have some vector representing a factor with a number of
levels (6 in this case), representing the location that samples were
collected.

Population <- gl( n=6, k=5,length=120, labels =c("CO", "CN","Ga","KO",
"Mw", "Ng"))


# You would like to assign a particular value to each level of
population (in this case the elevation at which they were collected).
In a vectorized approach (for speed... pretend this was a big data
set..)

elevation <-  ifelse(Population=="CO", 2169,
 ifelse(Population=="CN", 1121,
  ifelse(Population=="Ga", 500,
ifelse(Population=="KO", 2500,
ifelse(Population=="Mw", 625,
  ifelse(Population=="Ng", 300, NA ))

# Which is fine, but is a pain to write...

# So I was trying to think about how to vectorize directly. i.e use
vectors within the test, and for return values for T and F

elevation.take.2 <- ifelse(Population==c("CO",  "CN", "Ga", "KO",
"Mw", "Ng"), c(2169, 1121, 500, 2500, 625, 300), c(NA, NA, NA, NA, NA,
NA))

# It makes sense to me why this does not work (elevation.take.2), but
I am not sure how to get it to work. Any suggestions? I suspect it
involves a trick using "any" or "II" or something, but I can't seem to
work it out.


# Thanks in advance!

# Ian Dworkin
# idwor...@msu.edu

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] More efficient way to use ifelse()? - A follow up

2010-05-26 Thread Ian Dworkin
# Thanks again to everyone who provided suggestions.
# I was curious about which approaches would be the fastest... so a
little benchmarking

# My approach was by far the worst :)

# The approach suggested by Duncan Murdoch and Peter Langfelder, based
on indexing , was by far the fastest (~ 66times faster than using
nested ifelse() ). All the details can be found below for those who
are interested. I found it interesting that the variant by Peter
Langfelder was somewhat slower, given that the only difference was
explicitly defining the class in the index. What is the speed cost for
this: O(n) or O(1)?

# I have one additional question. I would have guessed that
initializing an empty vector of the right size would have sped up the
subsequent operation, filling that vector, but it does not seem to
have much of an effect. Any thoughts?
# i.e. using

N <- 600 # number of observations
elevation <- rep(NA, length(Population)) # This does not really speed
things up much.


#


Population <- gl( n=6, k=5,length=N, labels =c("Ga", "CO", "CN","KO",
"Ng", "Mw"))


# You would like to assign a particular value to each level of
population (in this case the elevation at which they were collected).
In a vectorized approach (for speed... pretend this was a big data
set..)

elevation <- rep(NA, length(Population))  # Just to make a vector of
the right size, to speed up filling it. In practice it does not seem
to speed things up.

# My original approach
system.time(
elevation <-  ifelse(Population=="CO", 2169,
 ifelse(Population=="CN", 1121,
  ifelse(Population=="Ga", 500,
ifelse(Population=="KO", 2500,
ifelse(Population=="Mw", 625,
  ifelse(Population=="Ng", 300, NA ))
)

#elapsed ~ 12s... by far the slowest approach

# Suggestions

#Peter Langfelder

values = c(500, 2169, 1121, 2500, 300, 625)

system.time( elevation.PL <- values[as.numeric(factor(Population))] ) # ~ 0.85s


# Values need to be in the order in which the levels of the factor are sorted
#i.e. Pop2 <- rep(c("Ga", "CO", "CN", "Ng", "KO", "Mw"), 10)
# levels(factor(Pop2)) would not work.


#or
codeToElev = data.frame(codes = c("CO", "CN","Ga","KO", "Mw", "Ng"),
elev = c(2169, 1121,
500, 2500, 625, 300))


system.time(
elevation.PL.2 <- codeToElev$elev[match(Population, codeToElev$codes)]
)
# ~ 0.5s elapsed


# Duncan Murdoch suggested
#In a case like this, often indexing is clearer than ifelse.  For example,

results <- c(CN=1121, CO=2169, Ga = 500, KO=2500, Mw = 625, Ng = 300)


system.time (
elevation.DM <- results[Population]
)
# 0.181s elapsed


#One followup:  don't do this if Population is a factor.  It will
index by the numeric values rather than the labels.  In this example
you should get the same answer since the labels in "results" are in
alphabetical order, but you won't in general.
#Generally vector indexing of atomic vectors and matrices is very
fast; indexing of data frames is much slower, so if speed is an issue,
avoid them.

# Jorge Ivan Velez suggests looking at recode in the car package.
require(car)


system.time(
elevation.JIV <- recode(Population, " 'CN'=1121; 'CO'=2169; 'Ga' =
500; 'KO' = 2500; 'Mw' = 625; 'Ng' = 300 ", as.factor.result=F)
)
# ~ 3.5s elapsed

# David Winsemius suggests


system.time(
elevation.DW <-  (Population=="CO")* 2169+
 (Population=="CN")* 1121+
 (Population=="Ga")* 500+
 (Population=="KO")* 2500+
 (Population=="Mw")* 625+
 (Population=="Ng")* 300
 )
 # ~ 3.2s elapsed


#Jeff Newmiller suggested using merge.. not implemented

# Dennis Murphy suggested switch.. I have not gotten it working yet..

elevation.DM <- switch(Population, "CO"= 2169, "CN" = 1121, "Ga" =
500, "KO" = 2500, "Mw" = 625, "Ng" = 300 )







On 26 May 2010 01:25, Ian Dworkin  wrote:
> # This is more about trying to find a more effecient way to code some
> simple vectorized computations using ifelse().
>
> # Say you have some vector representing a factor with a number of
> levels (6 in this case), representing the location that samples were
> collected.
>
> Population <- gl( n=6, k=5,length=120, labels =c("CO", "CN","Ga","KO",
> "Mw", "Ng"))
>
>
> # You would like to assign a particular value to each level of
> population (in this case the elevation at which they were collected).
> In a vectorized approach (for