Have you looked into the 'igraph' package? Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com
> -----Original Message----- > From: [email protected] [mailto:[email protected]] On > Behalf > Of Magnus Thor Torfason > Sent: Friday, November 01, 2013 8:23 AM > To: [email protected] > Subject: Re: [R] Inserting 17M entries into env took 18h, inserting 34M > entries taking 5+ > days > > Sure, > > I was attempting to be concise and boiling it down to what I saw as the > root issue, but you are right, I could have taken it a step further. So > here goes. > > I have a set of around around 20M string pairs. A given string (say, A) > can either be equivalent to another string (B) or not. If A and B occur > together in the same pair, they are equivalent. But equivalence is > transitive, so if A and B occur together in one pair, and A and C occur > together in another pair, then A and C are also equivalent. I need a way > to quickly determine if any two strings from my data set are equivalent > or not. > > The way I do this currently is to designate the smallest > (alphabetically) string in each known equivalence set as the "main" > entry. For each pair, I therefore insert two entries into the hash > table, both pointing at the mail value. So assuming the input data: > > A,B > B,C > D,E > > I would then have: > > A->A > B->A > C->B > D->D > E->D > > Except that I also follow each chain until I reach the end (key==value), > and insert pointers to the "main" value for every value I find along the > way. After doing that, I end up with: > > A->A > B->A > C->A > D->D > E->D > > And I can very quickly check equivalence, either by comparing the hash > of two strings, or simply by transforming each string into its hash, and > then I can use simple comparison from then on. The code for generating > the final hash table is as follows: > > h : Empty hash table created with hash.new() > d : Input data > hash.deep.get : Function that iterates through the hash table until it > finds a key whose value is equal to itself (until hash.get(X)==X), then > returns all the values in a vector > > > h = hash.new() > for ( i in 1:nrow(d) ) > { > deep.a = hash.deep.get(h, d$a[i]) > deep.b = hash.deep.get(h, d$b[i]) > equivalents = sort(unique(c(deep.a,deep.b))) > equiv.id = min(equivalents) > for ( equivalent in equivalents ) > { > hash.put(h, equivalent, equiv.id) > } > } > > > I would so much appreciate if there was a simpler and faster way to do > this. Keeping my fingers crossed that one of the R-help geniuses who > sees this is sufficiently interested to crack the problem > > Best, > Magnus > > On 11/1/2013 1:49 PM, jim holtman wrote: > > It would be nice if you followed the posting guidelines and at least > > showed the script that was creating your entries now so that we > > understand the problem you are trying to solve. A bit more > > explanation of why you want this would be useful. This gets to the > > second part of my tag line: Tell me what you want to do, not how you > > want to do it. There may be other solutions to your problem. > > > > Jim Holtman > > Data Munger Guru > > > > What is the problem that you are trying to solve? > > Tell me what you want to do, not how you want to do it. > > > > > > On Fri, Nov 1, 2013 at 9:32 AM, Magnus Thor Torfason > > <[email protected]> wrote: > >> Pretty much what the subject says: > >> > >> I used an env as the basis for a Hashtable in R, based on information that > >> this is in fact the way environments are implemented under the hood. > >> > >> I've been experimenting with doubling the number of entries, and so far it > >> has seemed to be scaling more or less linearly, as expected. > >> > >> But as I went from 17 million entries to 34 million entries, the completion > >> time has gone from 18 hours, to 5 days and counting. > >> > >> > >> The keys and values are in all cases strings of equal length. > >> > >> One might suspect that the slow-down might have to do with the memory being > >> swapped to disk, but from what I know about my computing environment, that > >> should not be the case. > >> > >> So my first question: > >> Is anyone familiar with anything in the implementation of environments that > >> would limit their use or slow them down (faster than O(nlog(n)) as the > >> number of entries is increased? > >> > >> And my second question: > >> I realize that this is not strictly what R environments were designed for, > >> but this is what my algorithm requires: I must go through these millions of > >> entries, storing them in the hash table and sometimes retrieving them along > >> the way, in a more or less random manner, which is contingent on the data I > >> am encountering, and on the contents of the hash table at each moment. > >> > >> Does anyone have a good recommendation for alternatives to implement huge, > >> fast, table-like structures in R? > >> > >> Best, > >> Magnus > >> > >> ______________________________________________ > >> [email protected] mailing list > >> https://stat.ethz.ch/mailman/listinfo/r-help > >> PLEASE do read the posting guide > >> http://www.R-project.org/posting-guide.html > >> and provide commented, minimal, self-contained, reproducible code. > > > > ______________________________________________ > [email protected] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ [email protected] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.

