Re: [Rd] Very slow subsetting by name

Martin Morgan Thu, 15 Jul 2010 11:16:15 -0700

On 07/15/2010 08:38 AM, Martin Morgan wrote:
> On 07/15/2010 01:12 AM, Hervé Pagès wrote:
>> Hi,
>>
>> I'm subsetting a named vector using character indices.
>> My vector of indices (or keys) is 10x longer than the vector
>> I'm subsetting. All my keys are distinct and only 10% of them
>> are valid (i.e. match a name of the vector being subsetted).
>> It is surprisingly slow:
>>
>> x1 <- 1:1000
>> names(x1) <- paste("a", x1, sep="")
>> keys <- sample(c(names(x1), paste("b", 1:9000, sep="")))
>>> system.time(y1 <- x1[keys])
>>    user  system elapsed
>>   0.410   0.000   0.416
>>
>> x2 <- 1:2000
>> names(x2) <- paste("a", x2, sep="")
>> keys <- sample(c(names(x2), paste("b", 1:18000, sep="")))
>>> system.time(y2 <- x2[keys])
>>    user  system elapsed
>>   1.730   0.000   1.736
> 
> For what its worth, I think this comes about in the loop starting at
> subscript.c:538, which seems to be there to allow [<-,*,character to
> extend a vector with new named elements
> 
>> x=c(a=1)
>> x["b"] = 2
>> x
> a b
> 1 2
> 
> It seems to be irrelevant (?) for sub-setting per se (though by analogy
> one might expect x["c"] to return a length-1 vector NA with name "c",
> whereas it returns a vector with names NA).
> 
> Seems like the O(n^2) loop through NonNullStringMatch could be replaced
> by look-ups into a hash, or an additional argument could be propagated


this passes make check and does

> x4 <- 1:8000
> names(x4) <- paste("a", x4, sep="")
> keys <- sample(c(names(x4), paste("b", 1:72000, sep="")))
> system.time(y4 <- x4[keys])
   user  system elapsed
  0.092   0.000   0.093
> identical(y4, x4[match(keys, names(x4))])
[1] TRUE

but uses some additional memory.

Martin

Index: src/main/subscript.c
===================================================================
--- src/main/subscript.c        (revision 52526)
+++ src/main/subscript.c        (working copy)
@@ -535,15 +535,17 @@
     }


+    SEXP sindx = PROTECT(match(s, s, 0)); /* first match */
     for (i = 0; i < ns; i++) {
        sub = INTEGER(indx)[i];
        if (sub == 0) {
-           for (j = 0 ; j < i ; j++)
-               if (NonNullStringMatch(STRING_ELT(s, i), STRING_ELT(s, j))) {
-                   sub = INTEGER(indx)[j];
-                   SET_VECTOR_ELT(indexnames, i, STRING_ELT(s, j));
-                   break;
-               }
+            j = INTEGER(sindx)[i] - 1;
+            if (NA_STRING != STRING_ELT(s, j) &&
+                R_NilValue != STRING_ELT(s, j))
+            {
+                sub = INTEGER(indx)[j];
+                SET_VECTOR_ELT(indexnames, i, STRING_ELT(s, j));
+            }
        }
        if (sub == 0) {
            if (!canstretch) {
@@ -561,7 +563,7 @@
        setAttrib(indx, R_UseNamesSymbol, indexnames);
     if (canstretch)
        *stretch = extra;
-    UNPROTECT(4);
+    UNPROTECT(5);
     return indx;
 }

> to stringSubscript to exit early when names aren't required. Or the call
> to makeSubscript at subset.c:164 could instead be made to matchE in
> unique.c.
> 
> Martin
> 
>>
>> x3 <- 1:4000
>> names(x3) <- paste("a", x3, sep="")
>> keys <- sample(c(names(x3), paste("b", 1:36000, sep="")))
>>> system.time(y3 <- x3[keys])
>>    user  system elapsed
>>   8.900   0.010   9.227
>>
>> x4 <- 1:8000
>> names(x4) <- paste("a", x4, sep="")
>> keys <- sample(c(names(x4), paste("b", 1:72000, sep="")))
>>> system.time(y4 <- x4[keys])
>>    user  system elapsed
>> 130.390   0.000 132.316
>>
>> And it's apparently worse than quadratic in time!
>>
>> I'm wondering why this subsetting by name is so slow since it
>> seems it could be implemented with x4[match(keys, names(x4))],
>> which is very fast: only 0.012s!
>>
>> This is with R-2.11.0 and R-2.12.0.
>>
>> Thanks,
>> H.
>>
>>
> 
> 


-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Very slow subsetting by name

Reply via email to