First thing to do is to run Rprof and see where the time is going; here it is from my computer:
self.time self.pct total.time total.pct tolower 4.42 39.46 4.42 39.46 sub 3.56 31.79 3.56 31.79 nchar 1.54 13.75 1.54 13.75 canonicalize.language 0.62 5.54 11.14 99.46 != 0.52 4.64 0.52 4.64 == 0.26 2.32 0.26 2.32 & 0.22 1.96 0.22 1.96 gc 0.06 0.54 0.06 0.54 more than half the time is in 'tolower' and 'nchar', so it is not all 'sub's problem. This version runs a little faster since it does not need the 'tolower': canonicalize.language <- function (s) { # s <- tolower(s) long <- nchar(s) == 5 s[long] <- sub("^([[:alpha:]]{2})[-_][[:alpha:]]{2}$","\\1",s[long]) s[nchar(s) != 2 & s != "c"] <- "unknown" s } On Fri, Sep 14, 2012 at 12:30 PM, Sam Steingold <s...@gnu.org> wrote: > this function is supposed to canonicalize the language: > > --8<---------------cut here---------------start------------->8--- > canonicalize.language <- function (s) { > s <- tolower(s) > long <- nchar(s) == 5 > s[long] <- sub("^([a-z]{2})[-_][a-z]{2}$","\\1",s[long]) > s[nchar(s) != 2 & s != "c"] <- "unknown" > s > } > canonicalize.language(c("aa","bb-cc","DD-abc","eee","ff_FF","C")) > [1] "aa" "bb" "unknown" "unknown" "ff" "c" > --8<---------------cut here---------------end--------------->8--- > > it does what I want it to do, but it takes 4.5 seconds on a vector of > length 10,256,341 - I wonder if I might be doing something aufully stupid. > I thought that sub() was slow, but my second attempt: > --8<---------------cut here---------------start------------->8--- > canonicalize.language <- function (s) { > s <- tolower(s) > good <- nchar(s) == 5 & substr(s,3,3) %in% c("_","-") > s[good] <- substr(s[good],1,2) > s[nchar(s) != 2 & s != "c"] <- "unknown" > s > } > --8<---------------cut here---------------end--------------->8--- > was even slower (6.4 sec). > > My two concerns are: > > 1. avoid allocating many small objects which are never collected > 2. run fast > > Which would be the best implementation? > > Thanks a lot for your insight! > > -- > Sam Steingold (http://sds.podval.org/) on Ubuntu 12.04 (precise) X > 11.0.11103000 > http://www.childpsy.net/ http://think-israel.org > http://openvotingconsortium.org > http://memri.org http://camera.org http://truepeace.org > WHO ATE MY BREAKFAST PANTS? > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.