[GitHub] [lucene-solr] mikemccand commented on a change in pull request #2247: LUCENE-9476 Add getBulkPath API for the Taxonomy index

GitBox Wed, 17 Feb 2021 15:34:59 -0800


mikemccand commented on a change in pull request #2247:
URL: https://github.com/apache/lucene-solr/pull/2247#discussion_r578019284




##########
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -353,12 +349,137 @@ public FacetLabel getPath(int ordinal) throws 
IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?
+    synchronized (categoryCache) {
+      return categoryCache.get(ordinal);
+    }
+  }
+
+  private void checkOrdinalBounds(int ordinal, int indexReaderMaxDoc)
+      throws IllegalArgumentException {
+    if (ordinal < 0 || ordinal >= indexReaderMaxDoc) {
+      throw new IllegalArgumentException(
+          "ordinal "
+              + ordinal
+              + " is out of the range of the indexReader "
+              + indexReader.toString());
+    }
+  }
+
+  /**
+   * Returns an array of FacetLabels for a given array of ordinals.
+   *
+   * <p>This API is generally faster than iteratively calling {@link 
#getPath(int)} over an array of
+   * ordinals. It uses the {@link #getPath(int)} method iteratively when it 
detects that the index
+   * was created using StoredFields (with no performance gains) and uses 
DocValues based iteration
+   * when the index is based on DocValues.
+   *
+   * @param ordinals Array of ordinals that are assigned to categories 
inserted into the taxonomy
+   *     index
+   */
+  public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+    ensureOpen();
+
+    int ordinalsLength = ordinals.length;
+    FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+    // remember the original positions of ordinals before they are sorted
+    int originalPosition[] = new int[ordinalsLength];
+    Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+    int indexReaderMaxDoc = indexReader.maxDoc();
+
+    for (int i = 0; i < ordinalsLength; i++) {
+      // check whether the ordinal is valid before accessing the cache
+      checkOrdinalBounds(ordinals[i], indexReaderMaxDoc);
+      // check the cache before trying to find it in the index
+      FacetLabel ordinalPath = getPathFromCache(ordinals[i]);
+      if (ordinalPath != null) {
+        bulkPath[i] = ordinalPath;
+      }
+    }
+
+    // parallel sort the ordinals and originalPosition array based on the 
values in the ordinals
+    // array
+    new InPlaceMergeSorter() {
+      @Override
+      protected void swap(int i, int j) {
+        int x = ordinals[i];
+        ordinals[i] = ordinals[j];
+        ordinals[j] = x;
+
+        x = originalPosition[i];
+        originalPosition[i] = originalPosition[j];
+        originalPosition[j] = x;
+      }
+      ;
+
+      @Override
+      public int compare(int i, int j) {
+        return Integer.compare(ordinals[i], ordinals[j]);
+      }
+    }.sort(0, ordinalsLength);
+
+    int readerIndex;
+    int leafReaderMaxDoc = 0;
+    int leafReaderDocBase = 0;
+    LeafReader leafReader;
+    LeafReaderContext leafReaderContext;
+    BinaryDocValues values = null;
+
+    for (int i = 0; i < ordinalsLength; i++) {
+      if (bulkPath[originalPosition[i]] == null) {
+        if (values == null || ordinals[i] >= leafReaderMaxDoc) {
+
+          readerIndex = ReaderUtil.subIndex(ordinals[i], indexReader.leaves());
+          leafReaderContext = indexReader.leaves().get(readerIndex);
+          leafReader = leafReaderContext.reader();
+          leafReaderMaxDoc = leafReader.maxDoc();
+          leafReaderDocBase = leafReaderContext.docBase;
+          values = leafReader.getBinaryDocValues(Consts.FULL);
+
+          // this check is only needed once to confirm that the index uses 
BinaryDocValues
+          boolean success = values.advanceExact(ordinals[i] - 
leafReaderDocBase);
+          if (success == false) {
+            return getBulkPathForOlderIndexes(ordinals);
+          }
+        }
+        boolean success = values.advanceExact(ordinals[i] - leafReaderDocBase);
+        assert success;
+        bulkPath[originalPosition[i]] =
+            new 
FacetLabel(FacetsConfig.stringToPath(values.binaryValue().utf8ToString()));
+      }
+    }
+
+    for (int i = 0; i < ordinalsLength; i++) {
+      synchronized (categoryCache) {
+        categoryCache.put(ordinals[i], bulkPath[originalPosition[i]]);

Review comment:
       > This should also (theoretically) be faster than trying to get the lock 
again and again in a loop?
   
   Hmm, I'm confused: this code is already getting the lock inside a `for` 
loop?  I guess we could move the `synchronized` outside of the `for` loop?  Or, 
maybe `javac` is doing this for us already?  But let's make it explicit, or, 
let's just merge this `for` loop with the one before (and keep acquiring the 
lock inside the `for` loop)?  One big benefit of the latter approach is that if 
all of the ordinals were already cached (hopefully typically a common case), we 
do not need any locking, but with this approach, we still do.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [lucene-solr] mikemccand commented on a change in pull request #2247: LUCENE-9476 Add getBulkPath API for the Taxonomy index

Reply via email to