some of this can be automated using the CRAN package hash. Kjetil
On Sat, Nov 6, 2010 at 10:43 PM, William Dunlap <wdun...@tibco.com> wrote: > I would make make an environemnt called wfreqsEnv > whose entry names are your words and whose entry > values are the information about the words. I find > it convenient to use [[ to make it appear to be > a list (instead of using exists(), assign(), and get()). > E.g., the following enters the 100,000 words from a > list of 17,576 and records their id numbers and the > number of times each is found in the sample. > >> wfreqsEnv <- new.env(hash=TRUE, parent = emptyenv()) >> words <- do.call("paste", c(list(sep=""), expand.grid(LETTERS, > letters, letters))) > # length(words) == 17576 >> set.seed(1) >> samp <- sample(seq_along(words), size=100000, replace=TRUE) >> system.time(for(i in samp) { > + word <- words[i] > + if (is.null(wfreqsEnv[[word]])) { # new entry > + wfreqsEnv[[word]] <- list(Count=1, EntryNo=i) > + } else { # update existing entry > + wfreqsEnv[[word]]$Count <- wfreqsEnv[[word]]$Count + 1 > + } > +}) > user system elapsed > 2.28 0.00 2.14 > (The time, in seconds, is from an ancient Windows laptop, c. 2002.) > > Here is a small check that we are getting what we expect: >> words[14736] > [1] "Tuv" >> wfreqsEnv[["Tuv"]] > $Count > [1] 8 > > $EntryNo > [1] 14736 > >> sum(samp==14736) > [1] 8 > > If we do this with a non-hashed environment we get the same > answers but the elapsed time is now 34.81 seconds instead of > 2.14. If you make wfreqEnv be a list instead of an environment > then that time is 74.12 seconds (and the answers are the same). > > Bill Dunlap > Spotfire, TIBCO Software > wdunlap tibco.com > >> -----Original Message----- >> From: r-help-boun...@r-project.org >> [mailto:r-help-boun...@r-project.org] On Behalf Of Levy, Roger >> Sent: Saturday, November 06, 2010 1:39 PM >> To: r-help@r-project.org >> Subject: [R] Hashing and environments >> >> Hi, >> >> I'm trying to write a general-purpose "lexicon" class and >> associated methods for storing and accessing information >> about large numbers of specific words (e.g., their >> frequencies in different genres). Crucial to making such a >> class practically useful is to get hashing working correctly >> so that information about specific words can be accessed >> quickly. But I've never really understood very well how >> hashing works, so I'm having trouble. >> >> Here is an example of what I've done so far: >> >> *** >> >> setClass("Lexicon",representation(e="environment")) >> setMethod("initialize","Lexicon",function(.Object,wfreqs) { >> .obj...@e <- new.env(hash=T,parent=emptyenv()) >> assign("wfreqs",wfreqs,envir=.obj...@e) >> return(.Object) >> }) >> >> ## function to access word frequencies >> wfreq <- function(lexicon,word) { >> return(get("wfreqs",envir=lexi...@e)[word]) >> } >> >> ## example of use >> my.lexicon <- new("Lexicon",wfreqs=c("the"=2,"person"=1)) >> wfreq(my.lexicon,"the") >> >> *** >> >> However, testing indicates that the way I have set this up >> does not achieve the intended benefits of having the >> environment hashed: >> >> *** >> >> sample.wfreqs <- trunc(runif(1e5,max=100)) >> names(sample.wfreqs) <- as.character(1:length(sample.wfreqs)) >> lex <- new("Lexicon",wfreqs=sample.wfreqs) >> words.to.lookup <- trunc(runif(100,min=1,max=1e5)) >> ## look up the words directly from the sample.wfreqs vector >> system.time({ >> for(i in words.to.lookup) >> sample.wfreqs[as.character(i)] >> },gcFirst=TRUE) >> ## look up the words through the wfreq() function; time >> approx the same >> system.time({ >> for(i in words.to.lookup) >> wfreq(lex,as.character(i)) >> },gcFirst=TRUE) >> >> *** >> >> I'm guessing that the problem is that the indexing of the >> wfreqs vector in my wfreq() function is not happening inside >> the actual lexicon's environment. However, I have not been >> able to figure out the proper call to get the lookup to >> happen inside the lexicon's environment. I've tried >> >> wfreq1 <- function(lexicon,word) { >> return(eval(wfreqs[word],envir=lexi...@e)) >> } >> >> which I'd thought should work, but this gives me an error: >> >> > wfreq1(my.lexicon,'the') >> Error in eval(wfreqs[word], envir = lexi...@e) : >> object 'wfreqs' not found >> >> Any advice would be much appreciated! >> >> Best & many thanks in advance, >> >> Roger >> >> -- >> >> Roger Levy Email: rl...@ucsd.edu >> Assistant Professor Phone: 858-534-7219 >> Department of Linguistics Fax: 858-534-4789 >> UC San Diego Web: http://ling.ucsd.edu/~rlevy >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.