Re: [PR] Add TaxonomyReader#getBulkOrdinals method (#12180) [lucene]

via GitHub Mon, 06 Nov 2023 04:22:28 -0800


mikemccand commented on code in PR #12769:
URL: https://github.com/apache/lucene/pull/12769#discussion_r1383241368



##########
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java:
##########
@@ -312,6 +316,111 @@ public int getOrdinal(FacetLabel cp) throws IOException {
     return ret;
   }
 
+  @Override
+  public int[] getBulkOrdinals(FacetLabel... categoryPaths) throws IOException 
{
+    ensureOpen();
+    if (categoryPaths.length == 0) {
+      return new int[0];
+    }
+    if (categoryPaths.length == 1) {
+      return new int[] {getOrdinal(categoryPaths[0])};
+    }
+    // First try to find results in the cache:
+    int[] result = new int[categoryPaths.length];
+    int[] indexesMissingFromCache = new int[10]; // initial size, will grow 
when required
+    int numberOfMissingFromCache = 0;
+    FacetLabel cp;
+    Integer res;
+    for (int i = 0; i < categoryPaths.length; i++) {
+      cp = categoryPaths[i];
+      synchronized (ordinalCache) {
+        res = ordinalCache.get(cp);
+      }
+      if (res != null) {
+        if (res < indexReader.maxDoc()) {
+          // Since the cache is shared with DTR instances allocated from
+          // doOpenIfChanged, we need to ensure that the ordinal is one that
+          // this DTR instance recognizes.
+          result[i] = res;
+        } else {
+          // if we get here, it means that the category was found in the cache,
+          // but is not recognized by this TR instance. Therefore, there's no
+          // need to continue search for the path on disk, because we won't 
find
+          // it there too.
+          result[i] = TaxonomyReader.INVALID_ORDINAL;
+        }
+      } else {
+        indexesMissingFromCache =
+            ArrayUtil.grow(indexesMissingFromCache, numberOfMissingFromCache + 
1);
+        indexesMissingFromCache[numberOfMissingFromCache++] = i;
+      }
+    }
+    // all ordinals found in cache
+    if (indexesMissingFromCache.length == 0) {
+      return result;
+    }
+
+    // If we're still here, we have at least one cache miss. We need to fetch 
the
+    // value from disk, and then also put results in the cache
+
+    // Create array of missing terms, and sort them so that later we scan 
terms dictionary
+    // forward-only.
+    // Note: similar functionality exists within BytesRefHash and 
BytesRefArray, but they don't
+    // reuse BytesRefs and assign their own ords. It is cheaper to have custom 
implementation here.
+    BytesRef[] termsToGet = new BytesRef[numberOfMissingFromCache];
+    for (int i = 0; i < termsToGet.length; i++) {
+      cp = categoryPaths[indexesMissingFromCache[i]];
+      termsToGet[i] = new BytesRef(FacetsConfig.pathToString(cp.components, 
cp.length));
+    }
+    // sort both terms and their indexes in the input parameter
+    int[] finalMissingFromCache = indexesMissingFromCache;
+
+    new StringSorter(BytesRefComparator.NATURAL) {
+
+      @Override
+      protected void swap(int i, int j) {
+        int tmp = finalMissingFromCache[i];
+        finalMissingFromCache[i] = finalMissingFromCache[j];
+        finalMissingFromCache[j] = tmp;
+        BytesRef tmpBytes = termsToGet[i];
+        termsToGet[i] = termsToGet[j];
+        termsToGet[j] = tmpBytes;
+      }
+
+      @Override
+      protected void get(BytesRefBuilder builder, BytesRef result, int i) {
+        BytesRef ref = termsToGet[i];
+        result.offset = ref.offset;
+        result.length = ref.length;
+        result.bytes = ref.bytes;
+      }
+    }.sort(0, numberOfMissingFromCache);
+
+    TermsEnum te = MultiTerms.getTerms(indexReader, Consts.FULL).iterator();
+    PostingsEnum postings = null;
+    int ord;
+    int resIndex;
+    for (int i = 0; i < numberOfMissingFromCache; i++) {
+      resIndex = indexesMissingFromCache[i];
+      if (te.seekExact(termsToGet[i])) {
+        postings = te.postings(postings, 0);
+        if (postings != null && postings.nextDoc() != 
DocIdSetIterator.NO_MORE_DOCS) {
+          ord = postings.docID();
+          result[resIndex] = ord;
+          // populate cache
+          synchronized (ordinalCache) {

Review Comment:
   You could maybe do this after looking up all the ords, under a single lock 
acquisition?  But this is a minor possible opto.



##########
lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java:
##########
@@ -476,6 +479,86 @@ public void testOpenIfChangedReplaceTaxonomy() throws 
Exception {
     src.close();
   }
 
+  private void assertGettingOrdinals(
+      DirectoryTaxonomyReader reader, int[] expectedOrds, FacetLabel[] 
sourcePaths)
+      throws IOException {
+    // To exercise mix of cache hit and cache misses for getOrdinal and 
getBulkOrdinals this method:
+    // 1. Randomly gets a few ords using sequential calls.
+    // 2. Call bulk get method.
+    // 3. Continue sequential calls for the remaining items.
+    assertEquals(expectedOrds.length, sourcePaths.length);
+    int bulkOperationsIteration = random().nextInt(sourcePaths.length);
+    List<Integer> indexesShuffled =
+        new ArrayList<>(IntStream.range(0, 
sourcePaths.length).boxed().toList());
+    Collections.shuffle(indexesShuffled, random());
+
+    for (int i = 0; i < bulkOperationsIteration; i++) {
+      int nextIndex = indexesShuffled.get(i);
+      assertEquals(expectedOrds[nextIndex], 
reader.getOrdinal(sourcePaths[nextIndex]));
+    }
+
+    int[] bulkOrdResult = reader.getBulkOrdinals(sourcePaths);
+    assertArrayEquals(expectedOrds, bulkOrdResult);
+
+    for (int i = bulkOperationsIteration; i < sourcePaths.length; i++) {
+      int nextIndex = indexesShuffled.get(i);
+      assertEquals(expectedOrds[nextIndex], 
reader.getOrdinal(sourcePaths[nextIndex]));
+    }
+  }
+
+  private void assertGettingPaths(
+      DirectoryTaxonomyReader reader, FacetLabel[] expectedPaths, int[] 
sourceOrds)
+      throws IOException {
+    // To exercise mix of cache hit and cache misses for getPath and 
getBulkPath this method:

Review Comment:
   Cool!



##########
lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java:
##########
@@ -476,6 +479,86 @@ public void testOpenIfChangedReplaceTaxonomy() throws 
Exception {
     src.close();
   }
 
+  private void assertGettingOrdinals(

Review Comment:
   Could we maybe add a thread safety test as well, if we aren't already 
testing across threads?  Index many `FacetLabel -> ord`, open reader, spawn 
threads all doing random lookups and confirming the ords are correct?  Maybe 
with random cache sizes so we sometimes face evictions?



##########
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/TaxonomyReader.java:
##########
@@ -201,6 +201,26 @@ public ChildrenIterator getChildren(final int ordinal) 
throws IOException {
    */
   public abstract int getOrdinal(FacetLabel categoryPath) throws IOException;
 
+  /**
+   * Returns the ordinals of the categories given as a path. The ordinal is 
the category's serial
+   * number, an integer which starts with 0 and grows as more categories are 
added (note that once a
+   * category is added, it can never be deleted).
+   *
+   * <p>The implementation in {@link
+   * org.apache.lucene.facet.taxonomy.directory.DirectoryTaxonomyReader} is 
generally faster than
+   * iteratively calling {@link #getOrdinal(FacetLabel)}
+   *
+   * @return array of the category's' ordinals or {@link #INVALID_ORDINAL} if 
the category wasn't
+   *     found.
+   */
+  public int[] getBulkOrdinals(FacetLabel... categoryPath) throws IOException {
+    int[] ords = new int[categoryPath.length];

Review Comment:
   I love the above javadoc, but maybe additionally add comment here that this 
is the simple & slow base class / default implementation, and note that 
`DirectoryTaxonomyReader` has an optimized and more complex version?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add TaxonomyReader#getBulkOrdinals method (#12180) [lucene]

Reply via email to