Re: [PR] Split taxonomy arrays across chunks [lucene]

via GitHub Sun, 07 Jan 2024 22:46:02 -0800


stefanvodita commented on code in PR #12995:
URL: https://github.com/apache/lucene/pull/12995#discussion_r1444211059



##########
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/TaxonomyIndexArrays.java:
##########
@@ -68,25 +90,49 @@ public TaxonomyIndexArrays(IndexReader reader, 
TaxonomyIndexArrays copyFrom) thr
     // it may be caused if e.g. the taxonomy segments were merged, and so an 
updated
     // NRT reader was obtained, even though nothing was changed. this is not 
very likely
     // to happen.
-    int[] copyParents = copyFrom.parents();
-    this.parents = new int[reader.maxDoc()];
-    System.arraycopy(copyParents, 0, parents, 0, copyParents.length);
-    initParents(reader, copyParents.length);
-
+    int[][] parentArray = allocateChunkedArray(reader.maxDoc());
+    copyChunkedArray(copyFrom.parents.values, parentArray);
+    initParents(parentArray, reader, copyFrom.parents.length());
+    parents = new ChunkedArray(parentArray);
     if (copyFrom.initializedChildren) {
       initChildrenSiblings(copyFrom);
     }
   }
 
+  private static int[][] allocateChunkedArray(int size) {
+    int chunkCount = size / CHUNK_SIZE + 1;
+    int lastChunkSize = size % CHUNK_SIZE;
+    int[][] array = new int[chunkCount][];
+    if (array.length > 0) {
+      for (int i = 0; i < chunkCount - 1; i++) {
+        array[i] = new int[CHUNK_SIZE];
+      }
+      array[chunkCount - 1] = new int[lastChunkSize];
+    }
+    return array;
+  }
+
+  private static void copyChunkedArray(int[][] oldArray, int[][] newArray) {
+    // Copy all but the last (maybe partial) chunk from the old array
+    if (oldArray.length > 1) {
+      System.arraycopy(oldArray, 0, newArray, 0, oldArray.length - 1);
+    }
+    int[] lastCopyChunk = oldArray[oldArray.length - 1];
+    System.arraycopy(lastCopyChunk, 0, newArray[oldArray.length - 1], 0, 
lastCopyChunk.length);

Review Comment:
   Thank you for the thorough explanation! Follow-up question: Why can't we 
shallowly copy even the last array? If we expected the old reader to keep 
functioning as before, they I agree we would need to deep copy the last array, 
but I think that's already not the case with the taxo reader because of the way 
the caches are reused after a refresh. Wouldn't it be safe to keep working with 
`oldArray`?



##########
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/TaxonomyIndexArrays.java:
##########
@@ -38,27 +38,49 @@
  * @lucene.experimental
  */
 class TaxonomyIndexArrays extends ParallelTaxonomyArrays implements 
Accountable {
+  private static final int CHUNK_SIZE = 8192;
 
-  private final int[] parents;
+  private final ChunkedArray parents;
 
   // the following two arrays are lazily initialized. note that we only keep a
   // single boolean member as volatile, instead of declaring the arrays
   // volatile. the code guarantees that only after the boolean is set to true,
   // the arrays are returned.
   private volatile boolean initializedChildren = false;
-  private int[] children, siblings;
+  private ChunkedArray children, siblings;
+
+  private static class ChunkedArray extends ParallelTaxonomyArrays.IntArray {
+    private final int[][] values;
+
+    private ChunkedArray(int[][] values) {
+      this.values = values;
+    }
+
+    @Override
+    public int get(int i) {
+      return values[i / CHUNK_SIZE][i % CHUNK_SIZE];
+    }
+
+    public void set(int i, int val) {
+      values[i / CHUNK_SIZE][i % CHUNK_SIZE] = val;
+    }
+
+    @Override
+    public int length() {
+      return (values.length - 1) * CHUNK_SIZE + values[values.length - 
1].length;
+    }
+  }
 
   /** Used by {@link #add(int, int)} after the array grew. */
-  private TaxonomyIndexArrays(int[] parents) {
-    this.parents = parents;
+  private TaxonomyIndexArrays(int[][] parents) {
+    this.parents = new ChunkedArray(parents);
   }
 
   public TaxonomyIndexArrays(IndexReader reader) throws IOException {
-    parents = new int[reader.maxDoc()];
-    if (parents.length > 0) {
-      initParents(reader, 0);
-      parents[0] = TaxonomyReader.INVALID_ORDINAL;
-    }
+    int[][] parentArray = allocateChunkedArray(reader.maxDoc());

Review Comment:
   I'm not sure one way is better than the other either, so let's keep it as is.



##########
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/TaxonomyIndexArrays.java:
##########
@@ -68,25 +90,49 @@ public TaxonomyIndexArrays(IndexReader reader, 
TaxonomyIndexArrays copyFrom) thr
     // it may be caused if e.g. the taxonomy segments were merged, and so an 
updated
     // NRT reader was obtained, even though nothing was changed. this is not 
very likely
     // to happen.
-    int[] copyParents = copyFrom.parents();
-    this.parents = new int[reader.maxDoc()];
-    System.arraycopy(copyParents, 0, parents, 0, copyParents.length);
-    initParents(reader, copyParents.length);
-
+    int[][] parentArray = allocateChunkedArray(reader.maxDoc());
+    copyChunkedArray(copyFrom.parents.values, parentArray);
+    initParents(parentArray, reader, copyFrom.parents.length());
+    parents = new ChunkedArray(parentArray);
     if (copyFrom.initializedChildren) {
       initChildrenSiblings(copyFrom);
     }
   }
 
+  private static int[][] allocateChunkedArray(int size) {
+    int chunkCount = size / CHUNK_SIZE + 1;

Review Comment:
   I should have been more detailed in my comment. I didn't think the `size == 
0` case specifically would be an issue. I was wondering more generally if we 
really want the empty array at the end in cases where `size` is a multiple of 
`CHUNK_SIZE` (i.e. `lastChunkSize == 0`). And I think it's reasonable to do it 
that way - it sticks with this contract of having the last array be mutable in 
a way that the others aren't.
   
   My preferred solution right now would be to remove the special case for 
`size == 0` and remove the other if condition in the method, since it's never 
not met.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Split taxonomy arrays across chunks [lucene]

Reply via email to