I would make make an environemnt called wfreqsEnv whose entry names are your words and whose entry values are the information about the words. I find it convenient to use [[ to make it appear to be a list (instead of using exists(), assign(), and get()). E.g., the following enters the 100,000 words from a list of 17,576 and records their id numbers and the number of times each is found in the sample.
> wfreqsEnv <- new.env(hash=TRUE, parent = emptyenv()) > words <- do.call("paste", c(list(sep=""), expand.grid(LETTERS, letters, letters))) # length(words) == 17576 > set.seed(1) > samp <- sample(seq_along(words), size=100000, replace=TRUE) > system.time(for(i in samp) { + word <- words[i] + if (is.null(wfreqsEnv[[word]])) { # new entry + wfreqsEnv[[word]] <- list(Count=1, EntryNo=i) + } else { # update existing entry + wfreqsEnv[[word]]$Count <- wfreqsEnv[[word]]$Count + 1 + } +}) user system elapsed 2.28 0.00 2.14 (The time, in seconds, is from an ancient Windows laptop, c. 2002.) Here is a small check that we are getting what we expect: > words[14736] [1] "Tuv" > wfreqsEnv[["Tuv"]] $Count [1] 8 $EntryNo [1] 14736 > sum(samp==14736) [1] 8 If we do this with a non-hashed environment we get the same answers but the elapsed time is now 34.81 seconds instead of 2.14. If you make wfreqEnv be a list instead of an environment then that time is 74.12 seconds (and the answers are the same). Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com > -----Original Message----- > From: r-help-boun...@r-project.org > [mailto:r-help-boun...@r-project.org] On Behalf Of Levy, Roger > Sent: Saturday, November 06, 2010 1:39 PM > To: r-help@r-project.org > Subject: [R] Hashing and environments > > Hi, > > I'm trying to write a general-purpose "lexicon" class and > associated methods for storing and accessing information > about large numbers of specific words (e.g., their > frequencies in different genres). Crucial to making such a > class practically useful is to get hashing working correctly > so that information about specific words can be accessed > quickly. But I've never really understood very well how > hashing works, so I'm having trouble. > > Here is an example of what I've done so far: > > *** > > setClass("Lexicon",representation(e="environment")) > setMethod("initialize","Lexicon",function(.Object,wfreqs) { > .obj...@e <- new.env(hash=T,parent=emptyenv()) > assign("wfreqs",wfreqs,envir=.obj...@e) > return(.Object) > }) > > ## function to access word frequencies > wfreq <- function(lexicon,word) { > return(get("wfreqs",envir=lexi...@e)[word]) > } > > ## example of use > my.lexicon <- new("Lexicon",wfreqs=c("the"=2,"person"=1)) > wfreq(my.lexicon,"the") > > *** > > However, testing indicates that the way I have set this up > does not achieve the intended benefits of having the > environment hashed: > > *** > > sample.wfreqs <- trunc(runif(1e5,max=100)) > names(sample.wfreqs) <- as.character(1:length(sample.wfreqs)) > lex <- new("Lexicon",wfreqs=sample.wfreqs) > words.to.lookup <- trunc(runif(100,min=1,max=1e5)) > ## look up the words directly from the sample.wfreqs vector > system.time({ > for(i in words.to.lookup) > sample.wfreqs[as.character(i)] > },gcFirst=TRUE) > ## look up the words through the wfreq() function; time > approx the same > system.time({ > for(i in words.to.lookup) > wfreq(lex,as.character(i)) > },gcFirst=TRUE) > > *** > > I'm guessing that the problem is that the indexing of the > wfreqs vector in my wfreq() function is not happening inside > the actual lexicon's environment. However, I have not been > able to figure out the proper call to get the lookup to > happen inside the lexicon's environment. I've tried > > wfreq1 <- function(lexicon,word) { > return(eval(wfreqs[word],envir=lexi...@e)) > } > > which I'd thought should work, but this gives me an error: > > > wfreq1(my.lexicon,'the') > Error in eval(wfreqs[word], envir = lexi...@e) : > object 'wfreqs' not found > > Any advice would be much appreciated! > > Best & many thanks in advance, > > Roger > > -- > > Roger Levy Email: rl...@ucsd.edu > Assistant Professor Phone: 858-534-7219 > Department of Linguistics Fax: 858-534-4789 > UC San Diego Web: http://ling.ucsd.edu/~rlevy > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.