mikemccand commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r697513953
##########
File path:
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/TaxonomyReader.java
##########
@@ -212,6 +212,19 @@ public int getOrdinal(String dim, String... path) throws
IOException {
/** Returns the path name of the category with the given ordinal. */
public abstract FacetLabel getPath(int ordinal) throws IOException;
+ /**
+ * Returns the path names of the list of ordinals associated with different
categories. The
+ * implementation in {@link
org.apache.lucene.facet.taxonomy.directory.DirectoryTaxonomyReader} is
+ * generally faster than iteratively calling {@link #getPath(int)}
+ */
+ public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+ FacetLabel[] facetLabels = new FacetLabel[ordinals.length];
Review comment:
Maybe add a comment that this is a default (slow) implementation that
does just iteratively call `getPath`?
##########
File path:
lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java
##########
@@ -567,4 +567,78 @@ public void testAccountable() throws Exception {
taxoReader.close();
dir.close();
}
+
+ public void testCallingBulkPathReturnsCorrectResult() throws Exception {
+ Directory src = newDirectory();
+ DirectoryTaxonomyWriter w = new DirectoryTaxonomyWriter(src);
+ String randomArray[] = new String[random().nextInt(1000)];
+ // adding a smaller bound on ints ensures that we will have some duplicate
ordinals in random
+ // test cases
+ Arrays.setAll(randomArray, i -> Integer.toString(random().nextInt(500)));
+
+ FacetLabel allPaths[] = new FacetLabel[randomArray.length];
+ int allOrdinals[] = new int[randomArray.length];
+
+ for (int i = 0; i < randomArray.length; i++) {
+ allPaths[i] = new FacetLabel(randomArray[i]);
+ w.addCategory(allPaths[i]);
+ // add random commits to create multiple segments in the index
+ if (random().nextBoolean()) {
+ w.commit();
+ }
+ }
+ w.commit();
+ w.close();
+
+ DirectoryTaxonomyReader r1 = new DirectoryTaxonomyReader(src);
+
+ for (int i = 0; i < allPaths.length; i++) {
+ allOrdinals[i] = r1.getOrdinal(allPaths[i]);
+ }
+
+ // create multiple threads to check result correctness and thread
contention in the cache
+ Thread[] addThreads = new Thread[atLeast(4)];
+ for (int z = 0; z < addThreads.length; z++) {
+ addThreads[z] =
+ new Thread() {
+ @Override
+ public void run() {
+ // each thread iterates for maxThreadIterations times
+ int maxThreadIterations = random().nextInt(10);
+ for (int threadIterations = 0;
+ threadIterations < maxThreadIterations;
+ threadIterations++) {
+
+ // length of the FacetLabel array that we are going to check
+ int numOfOrdinalsToCheck =
random().nextInt(allOrdinals.length);
Review comment:
Sometimes this will be `0`, which I guess is OK :) We confirm that
passing empty `int[]` is acceptable.
##########
File path:
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -351,12 +348,140 @@ public FacetLabel getPath(int ordinal) throws
IOException {
}
synchronized (categoryCache) {
- categoryCache.put(catIDInteger, ret);
+ categoryCache.put(ordinal, ret);
}
return ret;
}
+ private FacetLabel[] getPathFromCache(int... ordinals) {
+ FacetLabel[] facetLabels = new FacetLabel[ordinals.length];
+ // TODO LUCENE-10068: can we use an int-based hash impl, such as
IntToObjectMap,
+ // wrapped as LRU?
+ synchronized (categoryCache) {
+ for (int i = 0; i < ordinals.length; i++) {
+ facetLabels[i] = categoryCache.get(ordinals[i]);
+ }
+ }
+ return facetLabels;
+ }
+
+ /**
+ * Checks if the ordinals in the array are >=0 and < {@code
+ * DirectoryTaxonomyReader#indexReader.maxDoc()}
+ *
+ * @param ordinals Integer array of ordinals
+ * @throws IllegalArgumentException Throw an IllegalArgumentException if one
of the ordinals is
+ * out of bounds
+ */
+ private void checkOrdinalBounds(int... ordinals) throws
IllegalArgumentException {
+ for (int ordinal : ordinals) {
+ if (ordinal < 0 || ordinal >= indexReader.maxDoc()) {
+ throw new IllegalArgumentException(
+ "ordinal "
+ + ordinal
+ + " is out of the range of the indexReader "
+ + indexReader.toString()
+ + ". The maximum possible ordinal number is "
+ + (indexReader.maxDoc() - 1));
+ }
+ }
+ }
+
+ /**
+ * Returns an array of FacetLabels for a given array of ordinals.
+ *
+ * <p>This API is generally faster than iteratively calling {@link
#getPath(int)} over an array of
+ * ordinals. It uses the {@link #getPath(int)} method iteratively when it
detects that the index
+ * was created using StoredFields (with no performance gains) and uses
DocValues based iteration
+ * when the index is based on BinaryDocValues. Lucene switched to
BinaryDocValues in version 9.0
+ *
+ * @param ordinals Array of ordinals that are assigned to categories
inserted into the taxonomy
+ * index
+ */
+ @Override
+ public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+ ensureOpen();
+ checkOrdinalBounds(ordinals);
+
+ int ordinalsLength = ordinals.length;
+ FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+ // remember the original positions of ordinals before they are sorted
+ int[] originalPosition = new int[ordinalsLength];
+ Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+
+ getPathFromCache(ordinals);
+
+ /* parallel sort the ordinals and originalPosition array based on the
values in the ordinals array */
+ new InPlaceMergeSorter() {
+ @Override
+ protected void swap(int i, int j) {
+ int x = ordinals[i];
+ ordinals[i] = ordinals[j];
+ ordinals[j] = x;
+
+ x = originalPosition[i];
+ originalPosition[i] = originalPosition[j];
+ originalPosition[j] = x;
+ }
+
+ @Override
+ public int compare(int i, int j) {
+ return Integer.compare(ordinals[i], ordinals[j]);
+ }
+ }.sort(0, ordinalsLength);
+
+ int readerIndex;
+ int leafReaderMaxDoc = 0;
+ int leafReaderDocBase = 0;
+ LeafReader leafReader;
+ LeafReaderContext leafReaderContext;
+ BinaryDocValues values = null;
+ List<Integer> uncachedOrdinalPositions = new ArrayList<>();
+
+ for (int i = 0; i < ordinalsLength; i++) {
+ if (bulkPath[originalPosition[i]] == null) {
+ /*
+ If ordinals[i] >= leafReaderDocBase + leafReaderMaxDoc then we find
the next leaf that contains our ordinal.
+ Remember: ordinals[i] operates in the global ordinal space and hence
we add leafReaderDocBase to the leafReaderMaxDoc
+ (which is the maxDoc of the specific leaf)
+ */
+ if (values == null || ordinals[i] >= leafReaderDocBase +
leafReaderMaxDoc) {
+
+ readerIndex = ReaderUtil.subIndex(ordinals[i], indexReader.leaves());
+ leafReaderContext = indexReader.leaves().get(readerIndex);
+ leafReader = leafReaderContext.reader();
+ leafReaderMaxDoc = leafReader.maxDoc();
+ leafReaderDocBase = leafReaderContext.docBase;
+ values = leafReader.getBinaryDocValues(Consts.FULL);
+
+ /*
+ If the index is constructed with the older StoredFields it will not
have any BinaryDocValues field and will return null
+ */
+ if (values == null) {
+ return super.getBulkPath(ordinals);
+ }
+ }
+ // values is leaf specific so you only advance till you reach the
target within the leaf
+ boolean success = values.advanceExact(ordinals[i] - leafReaderDocBase);
+ assert success;
+ bulkPath[originalPosition[i]] =
+ new
FacetLabel(FacetsConfig.stringToPath(values.binaryValue().utf8ToString()));
+
+ uncachedOrdinalPositions.add(i);
+ }
+ }
+
+ synchronized (categoryCache) {
Review comment:
Can we add `if (uncachedOrdinalPositions.isEmpty() == false) { ... }`
around this? This way we don't try to lock if there is everything was already
cached (hopefully a common case). It is possible hotspot might do this opto
for us but let's be explicit :)
##########
File path:
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -351,12 +348,140 @@ public FacetLabel getPath(int ordinal) throws
IOException {
}
synchronized (categoryCache) {
- categoryCache.put(catIDInteger, ret);
+ categoryCache.put(ordinal, ret);
}
return ret;
}
+ private FacetLabel[] getPathFromCache(int... ordinals) {
+ FacetLabel[] facetLabels = new FacetLabel[ordinals.length];
+ // TODO LUCENE-10068: can we use an int-based hash impl, such as
IntToObjectMap,
+ // wrapped as LRU?
+ synchronized (categoryCache) {
+ for (int i = 0; i < ordinals.length; i++) {
+ facetLabels[i] = categoryCache.get(ordinals[i]);
+ }
+ }
+ return facetLabels;
+ }
+
+ /**
+ * Checks if the ordinals in the array are >=0 and < {@code
+ * DirectoryTaxonomyReader#indexReader.maxDoc()}
+ *
+ * @param ordinals Integer array of ordinals
+ * @throws IllegalArgumentException Throw an IllegalArgumentException if one
of the ordinals is
+ * out of bounds
+ */
+ private void checkOrdinalBounds(int... ordinals) throws
IllegalArgumentException {
+ for (int ordinal : ordinals) {
+ if (ordinal < 0 || ordinal >= indexReader.maxDoc()) {
+ throw new IllegalArgumentException(
+ "ordinal "
+ + ordinal
+ + " is out of the range of the indexReader "
+ + indexReader.toString()
+ + ". The maximum possible ordinal number is "
+ + (indexReader.maxDoc() - 1));
+ }
+ }
+ }
+
+ /**
+ * Returns an array of FacetLabels for a given array of ordinals.
+ *
+ * <p>This API is generally faster than iteratively calling {@link
#getPath(int)} over an array of
+ * ordinals. It uses the {@link #getPath(int)} method iteratively when it
detects that the index
+ * was created using StoredFields (with no performance gains) and uses
DocValues based iteration
+ * when the index is based on BinaryDocValues. Lucene switched to
BinaryDocValues in version 9.0
+ *
+ * @param ordinals Array of ordinals that are assigned to categories
inserted into the taxonomy
+ * index
+ */
+ @Override
+ public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+ ensureOpen();
+ checkOrdinalBounds(ordinals);
+
+ int ordinalsLength = ordinals.length;
+ FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+ // remember the original positions of ordinals before they are sorted
+ int[] originalPosition = new int[ordinalsLength];
+ Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+
+ getPathFromCache(ordinals);
Review comment:
This is a real hazard, but I think we can ignore it because users are
supposed to ensure all search/faceting threads complete before closing the
taxonomy reader.
##########
File path:
lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java
##########
@@ -567,4 +567,78 @@ public void testAccountable() throws Exception {
taxoReader.close();
dir.close();
}
+
+ public void testCallingBulkPathReturnsCorrectResult() throws Exception {
+ Directory src = newDirectory();
+ DirectoryTaxonomyWriter w = new DirectoryTaxonomyWriter(src);
+ String randomArray[] = new String[random().nextInt(1000)];
+ // adding a smaller bound on ints ensures that we will have some duplicate
ordinals in random
+ // test cases
+ Arrays.setAll(randomArray, i -> Integer.toString(random().nextInt(500)));
+
+ FacetLabel allPaths[] = new FacetLabel[randomArray.length];
+ int allOrdinals[] = new int[randomArray.length];
+
+ for (int i = 0; i < randomArray.length; i++) {
+ allPaths[i] = new FacetLabel(randomArray[i]);
+ w.addCategory(allPaths[i]);
+ // add random commits to create multiple segments in the index
+ if (random().nextBoolean()) {
+ w.commit();
+ }
+ }
+ w.commit();
+ w.close();
+
+ DirectoryTaxonomyReader r1 = new DirectoryTaxonomyReader(src);
+
+ for (int i = 0; i < allPaths.length; i++) {
+ allOrdinals[i] = r1.getOrdinal(allPaths[i]);
+ }
+
+ // create multiple threads to check result correctness and thread
contention in the cache
+ Thread[] addThreads = new Thread[atLeast(4)];
+ for (int z = 0; z < addThreads.length; z++) {
+ addThreads[z] =
+ new Thread() {
+ @Override
+ public void run() {
+ // each thread iterates for maxThreadIterations times
+ int maxThreadIterations = random().nextInt(10);
Review comment:
Maybe rename to `numThreadIterations` since it is exactly how many
iterations this thread will do?
##########
File path:
lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java
##########
@@ -567,4 +567,78 @@ public void testAccountable() throws Exception {
taxoReader.close();
dir.close();
}
+
+ public void testCallingBulkPathReturnsCorrectResult() throws Exception {
+ Directory src = newDirectory();
+ DirectoryTaxonomyWriter w = new DirectoryTaxonomyWriter(src);
+ String randomArray[] = new String[random().nextInt(1000)];
+ // adding a smaller bound on ints ensures that we will have some duplicate
ordinals in random
+ // test cases
+ Arrays.setAll(randomArray, i -> Integer.toString(random().nextInt(500)));
+
+ FacetLabel allPaths[] = new FacetLabel[randomArray.length];
+ int allOrdinals[] = new int[randomArray.length];
+
+ for (int i = 0; i < randomArray.length; i++) {
+ allPaths[i] = new FacetLabel(randomArray[i]);
+ w.addCategory(allPaths[i]);
+ // add random commits to create multiple segments in the index
+ if (random().nextBoolean()) {
+ w.commit();
+ }
+ }
+ w.commit();
+ w.close();
+
+ DirectoryTaxonomyReader r1 = new DirectoryTaxonomyReader(src);
+
+ for (int i = 0; i < allPaths.length; i++) {
+ allOrdinals[i] = r1.getOrdinal(allPaths[i]);
+ }
+
+ // create multiple threads to check result correctness and thread
contention in the cache
+ Thread[] addThreads = new Thread[atLeast(4)];
Review comment:
Hmm `atLeast(4)` is a bit dangerous because in NIGHTLY build that might
generate some really massive ints. How about
`RandomNumbers.randomIntBetween(random(), 1, 12)` instead?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]