On Sat, Nov 2, 2013 at 11:12 AM, Martin Morgan <[email protected]> wrote:
> On 11/01/2013 08:22 AM, Magnus Thor Torfason wrote:
>
>> Sure,
>>
>> I was attempting to be concise and boiling it down to what I saw as the
>> root
>> issue, but you are right, I could have taken it a step further. So here
>> goes.
>>
>> I have a set of around around 20M string pairs. A given string (say, A)
>> can
>> either be equivalent to another string (B) or not. If A and B occur
>> together in
>> the same pair, they are equivalent. But equivalence is transitive, so if
>> A and B
>> occur together in one pair, and A and C occur together in another pair,
>> then A
>> and C are also equivalent. I need a way to quickly determine if any two
>> strings
>> from my data set are equivalent or not.
>>
>
> Do you mean that if A,B occur together and B,C occur together, then A,B
> and A,C are equivalent?
>
> Here's a function that returns a unique identifier (not well tested!),
> allowing for transitive relations but not circularity.
>
> uid <- function(x, y)
> {
> i <- seq_along(x) # global index
> xy <- paste0(x, y) # make unique identifiers
> idx <- match(xy, xy)
>
> repeat {
> ## transitive look-up
> y_idx <- match(y[idx], x) # look up 'y' in 'x'
> keep <- !is.na(y_idx)
> if (!any(keep)) # no transitive relations,
> done!
> break
> x[idx[keep]] <- x[y_idx[keep]]
> y[idx[keep]] <- y[y_idx[keep]]
>
> ## create new index of values
> xy <- paste0(x, y)
> idx <- match(xy, xy)
> }
> idx
> }
>
> Values with the same index are identical. Some tests
>
> > x <- c(1, 2, 3, 4)
> > y <- c(2, 3, 5, 6)
> > uid(x, y)
> [1] 1 1 1 4
> > i <- sample(x); uid(x[i], y[i])
> [1] 1 1 3 1
> > uid(as.character(x), as.character(y)) ## character() ok
> [1] 1 1 1 4
> > uid(1:10, 1 + 1:10)
> [1] 1 1 1 1 1 1 1 1 1 1
> > uid(integer(), integer())
> integer(0)
> > x <- c(1, 2, 3)
> > y <- c(2, 3, 1)
> > uid(x, y) ## circular!
> C-c C-c
>
> I think this will scale well enough, but the worst-case scenario can be
> made to be log(longest chain) and copying can be reduced by using an index
> i and subsetting the original vector on each iteration. I think you could
> test for circularity by checking that the updated x are not a permutation
> of the kept x, all(x[y_idx[keep]] %in% x[keep]))
>
> Martin
>
>
This problem (union-find) is discussed in Chapter 1 of Sedgwick's
"Algorithms". There's an algorithm given that takes linear time to build
the structure, worst-case logarithmic time to query, and effectively
constant average time to query (inverse-Ackerman amortized complexity).
-thomas
--
Thomas Lumley
Professor of Biostatistics
University of Auckland
[[alternative HTML version deleted]]
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.