[GitHub] [lucene] mikemccand commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader

GitBox Sat, 26 Jun 2021 17:56:28 -0700


mikemccand commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r659242645




##########
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -351,12 +349,139 @@ public FacetLabel getPath(int ordinal) throws 
IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?
+    synchronized (categoryCache) {
+      return categoryCache.get(ordinal);
+    }
+  }
+
+  private void checkOrdinalBounds(int ordinal, int indexReaderMaxDoc)
+      throws IllegalArgumentException {
+    if (ordinal < 0 || ordinal >= indexReaderMaxDoc) {
+      throw new IllegalArgumentException(
+          "ordinal "
+              + ordinal
+              + " is out of the range of the indexReader "
+              + indexReader.toString());
+    }
+  }
+
+  /**
+   * Returns an array of FacetLabels for a given array of ordinals.
+   *
+   * <p>This API is generally faster than iteratively calling {@link 
#getPath(int)} over an array of
+   * ordinals. It uses the {@link #getPath(int)} method iteratively when it 
detects that the index
+   * was created using StoredFields (with no performance gains) and uses 
DocValues based iteration
+   * when the index is based on DocValues.
+   *
+   * @param ordinals Array of ordinals that are assigned to categories 
inserted into the taxonomy
+   *     index
+   */
+  public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+    ensureOpen();
+
+    int ordinalsLength = ordinals.length;
+    FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+    // remember the original positions of ordinals before they are sorted
+    int[] originalPosition = new int[ordinalsLength];
+    Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+    int indexReaderMaxDoc = indexReader.maxDoc();
+
+    for (int i = 0; i < ordinalsLength; i++) {
+      // check whether the ordinal is valid before accessing the cache
+      checkOrdinalBounds(ordinals[i], indexReaderMaxDoc);
+      // check the cache before trying to find it in the index
+      FacetLabel ordinalPath = getPathFromCache(ordinals[i]);
+      if (ordinalPath != null) {
+        bulkPath[i] = ordinalPath;
+      }
+    }
+
+    /* parallel sort the ordinals and originalPosition array based on the 
values in the ordinals array */
+    new InPlaceMergeSorter() {
+      @Override
+      protected void swap(int i, int j) {
+        int x = ordinals[i];
+        ordinals[i] = ordinals[j];
+        ordinals[j] = x;
+
+        x = originalPosition[i];
+        originalPosition[i] = originalPosition[j];
+        originalPosition[j] = x;
+      }
+      ;

Review comment:
       Hmm, leftover semi-colon?

##########
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -351,12 +349,139 @@ public FacetLabel getPath(int ordinal) throws 
IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?
+    synchronized (categoryCache) {
+      return categoryCache.get(ordinal);
+    }
+  }
+
+  private void checkOrdinalBounds(int ordinal, int indexReaderMaxDoc)

Review comment:
       Hmm, the incoming `int indexReaderMaxDoc` is always 
`indexReader.maxDoc()` right?  Maybe remove these (redundant) parameter?

##########
File path: 
lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java
##########
@@ -567,4 +567,39 @@ public void testAccountable() throws Exception {
     taxoReader.close();
     dir.close();
   }
+
+  public void testCallingBulkPathReturnsCorrectResult() throws Exception {

Review comment:
       Maybe just rename to `testBulkPath`?

##########
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -351,12 +349,139 @@ public FacetLabel getPath(int ordinal) throws 
IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?

Review comment:
       Could you open an issue for this?  This seems like a low-hanging-fruit, 
now that we use lots of HPPC classes in the `facet` module today!  And this 
would improve RAM efficiency of this cache, allowing us to make it larger maybe.

##########
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -351,12 +349,139 @@ public FacetLabel getPath(int ordinal) throws 
IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?
+    synchronized (categoryCache) {
+      return categoryCache.get(ordinal);
+    }
+  }
+
+  private void checkOrdinalBounds(int ordinal, int indexReaderMaxDoc)
+      throws IllegalArgumentException {
+    if (ordinal < 0 || ordinal >= indexReaderMaxDoc) {
+      throw new IllegalArgumentException(
+          "ordinal "
+              + ordinal
+              + " is out of the range of the indexReader "
+              + indexReader.toString());
+    }
+  }
+
+  /**
+   * Returns an array of FacetLabels for a given array of ordinals.
+   *
+   * <p>This API is generally faster than iteratively calling {@link 
#getPath(int)} over an array of
+   * ordinals. It uses the {@link #getPath(int)} method iteratively when it 
detects that the index
+   * was created using StoredFields (with no performance gains) and uses 
DocValues based iteration
+   * when the index is based on DocValues.
+   *
+   * @param ordinals Array of ordinals that are assigned to categories 
inserted into the taxonomy
+   *     index
+   */
+  public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+    ensureOpen();
+
+    int ordinalsLength = ordinals.length;
+    FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+    // remember the original positions of ordinals before they are sorted
+    int[] originalPosition = new int[ordinalsLength];
+    Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+    int indexReaderMaxDoc = indexReader.maxDoc();
+
+    for (int i = 0; i < ordinalsLength; i++) {
+      // check whether the ordinal is valid before accessing the cache
+      checkOrdinalBounds(ordinals[i], indexReaderMaxDoc);
+      // check the cache before trying to find it in the index
+      FacetLabel ordinalPath = getPathFromCache(ordinals[i]);
+      if (ordinalPath != null) {
+        bulkPath[i] = ordinalPath;
+      }
+    }
+
+    /* parallel sort the ordinals and originalPosition array based on the 
values in the ordinals array */
+    new InPlaceMergeSorter() {
+      @Override
+      protected void swap(int i, int j) {
+        int x = ordinals[i];
+        ordinals[i] = ordinals[j];
+        ordinals[j] = x;
+
+        x = originalPosition[i];
+        originalPosition[i] = originalPosition[j];
+        originalPosition[j] = x;
+      }
+      ;
+
+      @Override
+      public int compare(int i, int j) {
+        return Integer.compare(ordinals[i], ordinals[j]);
+      }
+    }.sort(0, ordinalsLength);
+
+    int readerIndex;
+    int leafReaderMaxDoc = 0;
+    int leafReaderDocBase = 0;
+    LeafReader leafReader;
+    LeafReaderContext leafReaderContext;
+    BinaryDocValues values = null;
+
+    for (int i = 0; i < ordinalsLength; i++) {
+      if (bulkPath[originalPosition[i]] == null) {
+        /*
+        If ordinals[i] >= leafReaderMaxDoc then we find the next leaf that 
contains our ordinal
+         */
+        if (values == null || ordinals[i] >= leafReaderMaxDoc) {
+
+          readerIndex = ReaderUtil.subIndex(ordinals[i], indexReader.leaves());
+          leafReaderContext = indexReader.leaves().get(readerIndex);
+          leafReader = leafReaderContext.reader();
+          leafReaderMaxDoc = leafReader.maxDoc();
+          leafReaderDocBase = leafReaderContext.docBase;
+          values = leafReader.getBinaryDocValues(Consts.FULL);
+
+          /*
+          If the index is constructed with the older StoredFields it will not 
have any BinaryDocValues field and will return null
+           */
+          if (values == null) {
+            return getBulkPathForOlderIndexes(ordinals);
+          }
+        }
+        boolean success = values.advanceExact(ordinals[i] - leafReaderDocBase);
+        assert success;
+        bulkPath[originalPosition[i]] =
+            new 
FacetLabel(FacetsConfig.stringToPath(values.binaryValue().utf8ToString()));
+
+        // add the value to the categoryCache after computation
+        synchronized (categoryCache) {

Review comment:
       Could you add a `// TODO` to also bulk-insert into the cached the newly 
retrieved `ordinal -> FacetLabel`, rather than one at a time like this?  It 
might improve thread contention on this lock for concurrent searches.

##########
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -351,12 +349,139 @@ public FacetLabel getPath(int ordinal) throws 
IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?
+    synchronized (categoryCache) {
+      return categoryCache.get(ordinal);
+    }
+  }
+
+  private void checkOrdinalBounds(int ordinal, int indexReaderMaxDoc)
+      throws IllegalArgumentException {
+    if (ordinal < 0 || ordinal >= indexReaderMaxDoc) {
+      throw new IllegalArgumentException(
+          "ordinal "
+              + ordinal
+              + " is out of the range of the indexReader "
+              + indexReader.toString());
+    }
+  }
+
+  /**
+   * Returns an array of FacetLabels for a given array of ordinals.
+   *
+   * <p>This API is generally faster than iteratively calling {@link 
#getPath(int)} over an array of
+   * ordinals. It uses the {@link #getPath(int)} method iteratively when it 
detects that the index
+   * was created using StoredFields (with no performance gains) and uses 
DocValues based iteration
+   * when the index is based on DocValues.
+   *
+   * @param ordinals Array of ordinals that are assigned to categories 
inserted into the taxonomy
+   *     index
+   */
+  public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+    ensureOpen();
+
+    int ordinalsLength = ordinals.length;
+    FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+    // remember the original positions of ordinals before they are sorted
+    int[] originalPosition = new int[ordinalsLength];
+    Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+    int indexReaderMaxDoc = indexReader.maxDoc();
+
+    for (int i = 0; i < ordinalsLength; i++) {
+      // check whether the ordinal is valid before accessing the cache
+      checkOrdinalBounds(ordinals[i], indexReaderMaxDoc);
+      // check the cache before trying to find it in the index
+      FacetLabel ordinalPath = getPathFromCache(ordinals[i]);
+      if (ordinalPath != null) {
+        bulkPath[i] = ordinalPath;
+      }
+    }
+
+    /* parallel sort the ordinals and originalPosition array based on the 
values in the ordinals array */
+    new InPlaceMergeSorter() {
+      @Override
+      protected void swap(int i, int j) {
+        int x = ordinals[i];
+        ordinals[i] = ordinals[j];
+        ordinals[j] = x;
+
+        x = originalPosition[i];
+        originalPosition[i] = originalPosition[j];
+        originalPosition[j] = x;
+      }
+      ;
+
+      @Override
+      public int compare(int i, int j) {
+        return Integer.compare(ordinals[i], ordinals[j]);
+      }
+    }.sort(0, ordinalsLength);
+
+    int readerIndex;
+    int leafReaderMaxDoc = 0;
+    int leafReaderDocBase = 0;
+    LeafReader leafReader;
+    LeafReaderContext leafReaderContext;
+    BinaryDocValues values = null;
+
+    for (int i = 0; i < ordinalsLength; i++) {
+      if (bulkPath[originalPosition[i]] == null) {
+        /*
+        If ordinals[i] >= leafReaderMaxDoc then we find the next leaf that 
contains our ordinal
+         */
+        if (values == null || ordinals[i] >= leafReaderMaxDoc) {

Review comment:
       Hmm, don't we need to compare `ordinals[i] >= leafReaderMaxDoc + 
leafReaderDocBase` instead?  I.e. the ordinal is in the global `IndexReader` 
docid space, but the `leafReaderMaxDoc` is just for this leaf.  How come tests 
are not failing due to this :)

##########
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/FloatTaxonomyFacets.java
##########
@@ -146,10 +147,18 @@ public FacetResult getTopChildren(int topN, String dim, 
String... path) throws I
     }
 
     LabelAndValue[] labelValues = new LabelAndValue[q.size()];
+    int[] ordinals = new int[labelValues.length];
+    float[] values = new float[labelValues.length];
+
     for (int i = labelValues.length - 1; i >= 0; i--) {
       TopOrdAndFloatQueue.OrdAndValue ordAndValue = q.pop();
-      FacetLabel child = taxoReader.getPath(ordAndValue.ord);
-      labelValues[i] = new LabelAndValue(child.components[cp.length], 
ordAndValue.value);
+      ordinals[i] = ordAndValue.ord;
+      values[i] = ordAndValue.value;
+    }
+
+    FacetLabel[] bulkPath = ((DirectoryTaxonomyReader) 
taxoReader).getBulkPath(ordinals);

Review comment:
       Hmm is this cast always safe? Are there other implementations of 
`TaxonomyReader`?

##########
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -351,12 +349,139 @@ public FacetLabel getPath(int ordinal) throws 
IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?
+    synchronized (categoryCache) {
+      return categoryCache.get(ordinal);
+    }
+  }
+
+  private void checkOrdinalBounds(int ordinal, int indexReaderMaxDoc)
+      throws IllegalArgumentException {
+    if (ordinal < 0 || ordinal >= indexReaderMaxDoc) {
+      throw new IllegalArgumentException(
+          "ordinal "
+              + ordinal
+              + " is out of the range of the indexReader "

Review comment:
       Maybe also include the maximum ordinal in this exception message too?

##########
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -351,12 +349,139 @@ public FacetLabel getPath(int ordinal) throws 
IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?
+    synchronized (categoryCache) {
+      return categoryCache.get(ordinal);
+    }
+  }
+
+  private void checkOrdinalBounds(int ordinal, int indexReaderMaxDoc)
+      throws IllegalArgumentException {
+    if (ordinal < 0 || ordinal >= indexReaderMaxDoc) {
+      throw new IllegalArgumentException(
+          "ordinal "
+              + ordinal
+              + " is out of the range of the indexReader "
+              + indexReader.toString());
+    }
+  }
+
+  /**
+   * Returns an array of FacetLabels for a given array of ordinals.
+   *
+   * <p>This API is generally faster than iteratively calling {@link 
#getPath(int)} over an array of
+   * ordinals. It uses the {@link #getPath(int)} method iteratively when it 
detects that the index
+   * was created using StoredFields (with no performance gains) and uses 
DocValues based iteration
+   * when the index is based on DocValues.

Review comment:
       Can you change to `index is based on BinaryDocValues`, and note the 
Lucene version that first switched to writing `BinaryDocValues` for the 
taxonomy index?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mikemccand commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader

Reply via email to