On 07/15/2010 08:38 AM, Martin Morgan wrote: > On 07/15/2010 01:12 AM, Hervé Pagès wrote: >> Hi, >> >> I'm subsetting a named vector using character indices. >> My vector of indices (or keys) is 10x longer than the vector >> I'm subsetting. All my keys are distinct and only 10% of them >> are valid (i.e. match a name of the vector being subsetted). >> It is surprisingly slow: >> >> x1 <- 1:1000 >> names(x1) <- paste("a", x1, sep="") >> keys <- sample(c(names(x1), paste("b", 1:9000, sep=""))) >>> system.time(y1 <- x1[keys]) >> user system elapsed >> 0.410 0.000 0.416 >> >> x2 <- 1:2000 >> names(x2) <- paste("a", x2, sep="") >> keys <- sample(c(names(x2), paste("b", 1:18000, sep=""))) >>> system.time(y2 <- x2[keys]) >> user system elapsed >> 1.730 0.000 1.736 > > For what its worth, I think this comes about in the loop starting at > subscript.c:538, which seems to be there to allow [<-,*,character to > extend a vector with new named elements > >> x=c(a=1) >> x["b"] = 2 >> x > a b > 1 2 > > It seems to be irrelevant (?) for sub-setting per se (though by analogy > one might expect x["c"] to return a length-1 vector NA with name "c", > whereas it returns a vector with names NA). > > Seems like the O(n^2) loop through NonNullStringMatch could be replaced > by look-ups into a hash, or an additional argument could be propagated
this passes make check and does > x4 <- 1:8000 > names(x4) <- paste("a", x4, sep="") > keys <- sample(c(names(x4), paste("b", 1:72000, sep=""))) > system.time(y4 <- x4[keys]) user system elapsed 0.092 0.000 0.093 > identical(y4, x4[match(keys, names(x4))]) [1] TRUE but uses some additional memory. Martin Index: src/main/subscript.c =================================================================== --- src/main/subscript.c (revision 52526) +++ src/main/subscript.c (working copy) @@ -535,15 +535,17 @@ } + SEXP sindx = PROTECT(match(s, s, 0)); /* first match */ for (i = 0; i < ns; i++) { sub = INTEGER(indx)[i]; if (sub == 0) { - for (j = 0 ; j < i ; j++) - if (NonNullStringMatch(STRING_ELT(s, i), STRING_ELT(s, j))) { - sub = INTEGER(indx)[j]; - SET_VECTOR_ELT(indexnames, i, STRING_ELT(s, j)); - break; - } + j = INTEGER(sindx)[i] - 1; + if (NA_STRING != STRING_ELT(s, j) && + R_NilValue != STRING_ELT(s, j)) + { + sub = INTEGER(indx)[j]; + SET_VECTOR_ELT(indexnames, i, STRING_ELT(s, j)); + } } if (sub == 0) { if (!canstretch) { @@ -561,7 +563,7 @@ setAttrib(indx, R_UseNamesSymbol, indexnames); if (canstretch) *stretch = extra; - UNPROTECT(4); + UNPROTECT(5); return indx; } > to stringSubscript to exit early when names aren't required. Or the call > to makeSubscript at subset.c:164 could instead be made to matchE in > unique.c. > > Martin > >> >> x3 <- 1:4000 >> names(x3) <- paste("a", x3, sep="") >> keys <- sample(c(names(x3), paste("b", 1:36000, sep=""))) >>> system.time(y3 <- x3[keys]) >> user system elapsed >> 8.900 0.010 9.227 >> >> x4 <- 1:8000 >> names(x4) <- paste("a", x4, sep="") >> keys <- sample(c(names(x4), paste("b", 1:72000, sep=""))) >>> system.time(y4 <- x4[keys]) >> user system elapsed >> 130.390 0.000 132.316 >> >> And it's apparently worse than quadratic in time! >> >> I'm wondering why this subsetting by name is so slow since it >> seems it could be implemented with x4[match(keys, names(x4))], >> which is very fast: only 0.012s! >> >> This is with R-2.11.0 and R-2.12.0. >> >> Thanks, >> H. >> >> > > -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793 ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel