Wow, that is perfect: the hash package is exactly what I needed. Thank you!
Roger On Nov 6, 2010, at 4:09 PM, Kjetil Halvorsen wrote: > some of this can be automated using the CRAN package > hash. > > Kjetil > > On Sat, Nov 6, 2010 at 10:43 PM, William Dunlap <wdun...@tibco.com> wrote: >> I would make make an environemnt called wfreqsEnv >> whose entry names are your words and whose entry >> values are the information about the words. I find >> it convenient to use [[ to make it appear to be >> a list (instead of using exists(), assign(), and get()). >> E.g., the following enters the 100,000 words from a >> list of 17,576 and records their id numbers and the >> number of times each is found in the sample. >> >>> wfreqsEnv <- new.env(hash=TRUE, parent = emptyenv()) >>> words <- do.call("paste", c(list(sep=""), expand.grid(LETTERS, >> letters, letters))) >> # length(words) == 17576 >>> set.seed(1) >>> samp <- sample(seq_along(words), size=100000, replace=TRUE) >>> system.time(for(i in samp) { >> + word <- words[i] >> + if (is.null(wfreqsEnv[[word]])) { # new entry >> + wfreqsEnv[[word]] <- list(Count=1, EntryNo=i) >> + } else { # update existing entry >> + wfreqsEnv[[word]]$Count <- wfreqsEnv[[word]]$Count + 1 >> + } >> +}) >> user system elapsed >> 2.28 0.00 2.14 >> (The time, in seconds, is from an ancient Windows laptop, c. 2002.) >> >> Here is a small check that we are getting what we expect: >>> words[14736] >> [1] "Tuv" >>> wfreqsEnv[["Tuv"]] >> $Count >> [1] 8 >> >> $EntryNo >> [1] 14736 >> >>> sum(samp==14736) >> [1] 8 >> >> If we do this with a non-hashed environment we get the same >> answers but the elapsed time is now 34.81 seconds instead of >> 2.14. If you make wfreqEnv be a list instead of an environment >> then that time is 74.12 seconds (and the answers are the same). >> >> Bill Dunlap >> Spotfire, TIBCO Software >> wdunlap tibco.com >> >>> -----Original Message----- >>> From: r-help-boun...@r-project.org >>> [mailto:r-help-boun...@r-project.org] On Behalf Of Levy, Roger >>> Sent: Saturday, November 06, 2010 1:39 PM >>> To: r-help@r-project.org >>> Subject: [R] Hashing and environments >>> >>> Hi, >>> >>> I'm trying to write a general-purpose "lexicon" class and >>> associated methods for storing and accessing information >>> about large numbers of specific words (e.g., their >>> frequencies in different genres). Crucial to making such a >>> class practically useful is to get hashing working correctly >>> so that information about specific words can be accessed >>> quickly. But I've never really understood very well how >>> hashing works, so I'm having trouble. >>> >>> Here is an example of what I've done so far: >>> >>> *** >>> >>> setClass("Lexicon",representation(e="environment")) >>> setMethod("initialize","Lexicon",function(.Object,wfreqs) { >>> .obj...@e <- new.env(hash=T,parent=emptyenv()) >>> assign("wfreqs",wfreqs,envir=.obj...@e) >>> return(.Object) >>> }) >>> >>> ## function to access word frequencies >>> wfreq <- function(lexicon,word) { >>> return(get("wfreqs",envir=lexi...@e)[word]) >>> } >>> >>> ## example of use >>> my.lexicon <- new("Lexicon",wfreqs=c("the"=2,"person"=1)) >>> wfreq(my.lexicon,"the") >>> >>> *** >>> >>> However, testing indicates that the way I have set this up >>> does not achieve the intended benefits of having the >>> environment hashed: >>> >>> *** >>> >>> sample.wfreqs <- trunc(runif(1e5,max=100)) >>> names(sample.wfreqs) <- as.character(1:length(sample.wfreqs)) >>> lex <- new("Lexicon",wfreqs=sample.wfreqs) >>> words.to.lookup <- trunc(runif(100,min=1,max=1e5)) >>> ## look up the words directly from the sample.wfreqs vector >>> system.time({ >>> for(i in words.to.lookup) >>> sample.wfreqs[as.character(i)] >>> },gcFirst=TRUE) >>> ## look up the words through the wfreq() function; time >>> approx the same >>> system.time({ >>> for(i in words.to.lookup) >>> wfreq(lex,as.character(i)) >>> },gcFirst=TRUE) >>> >>> *** >>> >>> I'm guessing that the problem is that the indexing of the >>> wfreqs vector in my wfreq() function is not happening inside >>> the actual lexicon's environment. However, I have not been >>> able to figure out the proper call to get the lookup to >>> happen inside the lexicon's environment. I've tried >>> >>> wfreq1 <- function(lexicon,word) { >>> return(eval(wfreqs[word],envir=lexi...@e)) >>> } >>> >>> which I'd thought should work, but this gives me an error: >>> >>>> wfreq1(my.lexicon,'the') >>> Error in eval(wfreqs[word], envir = lexi...@e) : >>> object 'wfreqs' not found >>> >>> Any advice would be much appreciated! >>> >>> Best & many thanks in advance, >>> >>> Roger >>> >>> -- >>> >>> Roger Levy Email: rl...@ucsd.edu >>> Assistant Professor Phone: 858-534-7219 >>> Department of Linguistics Fax: 858-534-4789 >>> UC San Diego Web: http://ling.ucsd.edu/~rlevy >>> >>> ______________________________________________ >>> R-help@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >>> http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >>> >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> -- Roger Levy Email: rl...@ucsd.edu Assistant Professor Phone: 858-534-7219 Department of Linguistics Fax: 858-534-4789 UC San Diego Web: http://ling.ucsd.edu/~rlevy ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.