You can alway convert to lower case afterwards with probably a shorter
vector. You did not indicate that you needed that conversion; it only
looked like you did it for the regular expression.
On Fri, Sep 14, 2012 at 3:13 PM, Sam Steingold wrote:
>> * jim holtman [2012-09-14 13:10:37 -0400]:
>>
> * jim holtman [2012-09-14 13:10:37 -0400]:
>
> more than half the time is in 'tolower' and 'nchar', so it is not all
> 'sub's problem.
aha, thanks!
> This version runs a little faster since it does not need the 'tolower':
>
> canonicalize.language <- function (s) {
> # s <- tolower(s)
> lo
First thing to do is to run Rprof and see where the time is going;
here it is from my computer:
self.time self.pct total.time total.pct
tolower4.4239.46 4.42 39.46
sub3.5631.79 3.56 31.79
nchar
this function is supposed to canonicalize the language:
--8<---cut here---start->8---
canonicalize.language <- function (s) {
s <- tolower(s)
long <- nchar(s) == 5
s[long] <- sub("^([a-z]{2})[-_][a-z]{2}$","\\1",s[long])
s[nchar(s) != 2 & s != "c"] <- "u
4 matches
Mail list logo