Re: [PR] Split taxonomy arrays across chunks [lucene]

via GitHub Sat, 06 Jan 2024 01:32:58 -0800


stefanvodita commented on code in PR #12995:
URL: https://github.com/apache/lucene/pull/12995#discussion_r1443667100



##########
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/TaxonomyIndexArrays.java:
##########
@@ -68,25 +90,49 @@ public TaxonomyIndexArrays(IndexReader reader, 
TaxonomyIndexArrays copyFrom) thr
     // it may be caused if e.g. the taxonomy segments were merged, and so an 
updated
     // NRT reader was obtained, even though nothing was changed. this is not 
very likely
     // to happen.
-    int[] copyParents = copyFrom.parents();
-    this.parents = new int[reader.maxDoc()];
-    System.arraycopy(copyParents, 0, parents, 0, copyParents.length);
-    initParents(reader, copyParents.length);
-
+    int[][] parentArray = allocateChunkedArray(reader.maxDoc());
+    copyChunkedArray(copyFrom.parents.values, parentArray);
+    initParents(parentArray, reader, copyFrom.parents.length());
+    parents = new ChunkedArray(parentArray);
     if (copyFrom.initializedChildren) {
       initChildrenSiblings(copyFrom);
     }
   }
 
+  private static int[][] allocateChunkedArray(int size) {
+    int chunkCount = size / CHUNK_SIZE + 1;
+    int lastChunkSize = size % CHUNK_SIZE;
+    int[][] array = new int[chunkCount][];
+    if (array.length > 0) {
+      for (int i = 0; i < chunkCount - 1; i++) {
+        array[i] = new int[CHUNK_SIZE];
+      }
+      array[chunkCount - 1] = new int[lastChunkSize];
+    }
+    return array;
+  }
+
+  private static void copyChunkedArray(int[][] oldArray, int[][] newArray) {
+    // Copy all but the last (maybe partial) chunk from the old array
+    if (oldArray.length > 1) {
+      System.arraycopy(oldArray, 0, newArray, 0, oldArray.length - 1);
+    }
+    int[] lastCopyChunk = oldArray[oldArray.length - 1];
+    System.arraycopy(lastCopyChunk, 0, newArray[oldArray.length - 1], 0, 
lastCopyChunk.length);

Review Comment:
   Why are we making this a deep copy while the other arrays are shallow copies?



##########
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/TaxonomyIndexArrays.java:
##########
@@ -38,27 +38,49 @@
  * @lucene.experimental
  */
 class TaxonomyIndexArrays extends ParallelTaxonomyArrays implements 
Accountable {
+  private static final int CHUNK_SIZE = 8192;
 
-  private final int[] parents;
+  private final ChunkedArray parents;
 
   // the following two arrays are lazily initialized. note that we only keep a
   // single boolean member as volatile, instead of declaring the arrays
   // volatile. the code guarantees that only after the boolean is set to true,
   // the arrays are returned.
   private volatile boolean initializedChildren = false;
-  private int[] children, siblings;
+  private ChunkedArray children, siblings;
+
+  private static class ChunkedArray extends ParallelTaxonomyArrays.IntArray {

Review Comment:
   Maybe we can call it `ChunkedIntArray` if the parent is `IntArray`?



##########
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/TaxonomyIndexArrays.java:
##########
@@ -38,27 +38,49 @@
  * @lucene.experimental
  */
 class TaxonomyIndexArrays extends ParallelTaxonomyArrays implements 
Accountable {
+  private static final int CHUNK_SIZE = 8192;
 
-  private final int[] parents;
+  private final ChunkedArray parents;
 
   // the following two arrays are lazily initialized. note that we only keep a
   // single boolean member as volatile, instead of declaring the arrays
   // volatile. the code guarantees that only after the boolean is set to true,
   // the arrays are returned.
   private volatile boolean initializedChildren = false;
-  private int[] children, siblings;
+  private ChunkedArray children, siblings;
+
+  private static class ChunkedArray extends ParallelTaxonomyArrays.IntArray {
+    private final int[][] values;
+
+    private ChunkedArray(int[][] values) {
+      this.values = values;
+    }
+
+    @Override
+    public int get(int i) {
+      return values[i / CHUNK_SIZE][i % CHUNK_SIZE];

Review Comment:
   Since `CHUNK_SIZE` is a power of two, can we use bitwise operations to make 
sure this kind of lookup is fast, here and elsewhere?



##########
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/TaxonomyIndexArrays.java:
##########
@@ -38,27 +38,49 @@
  * @lucene.experimental
  */
 class TaxonomyIndexArrays extends ParallelTaxonomyArrays implements 
Accountable {
+  private static final int CHUNK_SIZE = 8192;
 
-  private final int[] parents;
+  private final ChunkedArray parents;
 
   // the following two arrays are lazily initialized. note that we only keep a
   // single boolean member as volatile, instead of declaring the arrays
   // volatile. the code guarantees that only after the boolean is set to true,
   // the arrays are returned.
   private volatile boolean initializedChildren = false;
-  private int[] children, siblings;
+  private ChunkedArray children, siblings;
+
+  private static class ChunkedArray extends ParallelTaxonomyArrays.IntArray {
+    private final int[][] values;
+
+    private ChunkedArray(int[][] values) {
+      this.values = values;
+    }
+
+    @Override
+    public int get(int i) {
+      return values[i / CHUNK_SIZE][i % CHUNK_SIZE];
+    }
+
+    public void set(int i, int val) {
+      values[i / CHUNK_SIZE][i % CHUNK_SIZE] = val;
+    }
+
+    @Override
+    public int length() {
+      return (values.length - 1) * CHUNK_SIZE + values[values.length - 
1].length;
+    }
+  }
 
   /** Used by {@link #add(int, int)} after the array grew. */
-  private TaxonomyIndexArrays(int[] parents) {
-    this.parents = parents;
+  private TaxonomyIndexArrays(int[][] parents) {
+    this.parents = new ChunkedArray(parents);
   }
 
   public TaxonomyIndexArrays(IndexReader reader) throws IOException {
-    parents = new int[reader.maxDoc()];
-    if (parents.length > 0) {
-      initParents(reader, 0);
-      parents[0] = TaxonomyReader.INVALID_ORDINAL;
-    }
+    int[][] parentArray = allocateChunkedArray(reader.maxDoc());

Review Comment:
   Is it better to work with `int[][]` in `TaxonomyArrays` rather than add 
support to the chunked array for the operations we want and only work with it?



##########
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/TaxonomyIndexArrays.java:
##########
@@ -68,25 +90,49 @@ public TaxonomyIndexArrays(IndexReader reader, 
TaxonomyIndexArrays copyFrom) thr
     // it may be caused if e.g. the taxonomy segments were merged, and so an 
updated
     // NRT reader was obtained, even though nothing was changed. this is not 
very likely
     // to happen.
-    int[] copyParents = copyFrom.parents();
-    this.parents = new int[reader.maxDoc()];
-    System.arraycopy(copyParents, 0, parents, 0, copyParents.length);
-    initParents(reader, copyParents.length);
-
+    int[][] parentArray = allocateChunkedArray(reader.maxDoc());
+    copyChunkedArray(copyFrom.parents.values, parentArray);
+    initParents(parentArray, reader, copyFrom.parents.length());
+    parents = new ChunkedArray(parentArray);
     if (copyFrom.initializedChildren) {
       initChildrenSiblings(copyFrom);
     }
   }
 
+  private static int[][] allocateChunkedArray(int size) {
+    int chunkCount = size / CHUNK_SIZE + 1;

Review Comment:
   Should we do the `+1` only if `lastChunkSize != 0`?



##########
lucene/facet/src/test/org/apache/lucene/facet/taxonomy/TestTaxonomyCombined.java:
##########
@@ -669,20 +668,26 @@ public void testChildrenArraysInvariants() throws 
Exception {
       // Find the youngest older sibling:
       int j;
       for (j = i - 1; j >= 0; j--) {
-        if (parents[j] == parents[i]) {
+        if (parents.get(j) == parents.get(i)) {
           break; // found youngest older sibling
         }
       }
       if (j < 0) { // no sibling found
         j = TaxonomyReader.INVALID_ORDINAL;
       }
-      assertEquals(j, olderSiblingArray[i]);
+      assertEquals(j, olderSiblingArray.get(i));
     }
 
     tr.close();
     indexDir.close();
   }
 
+  private static void assertArrayEquals(int[] expected, 
ParallelTaxonomyArrays.IntArray actual) {

Review Comment:
   I'm wondering if it's worth implementing `equals` for `ChunkedArray`. No 
strong opinion one way or the other.



##########
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/TaxonomyIndexArrays.java:
##########
@@ -98,26 +144,28 @@ private void computeChildrenSiblings(int first) {
     // reset the youngest child of all ordinals. while this should be done only
     // for the leaves, we don't know up front which are the leaves, so we reset
     // all of them.
-    for (int i = first; i < parents.length; i++) {
-      children[i] = TaxonomyReader.INVALID_ORDINAL;
+    int length = parents.length();
+    for (int i = first; i < length; i++) {
+      children.set(i, TaxonomyReader.INVALID_ORDINAL);
     }
 
     // the root category has no parent, and therefore no siblings
     if (first == 0) {
       first = 1;
-      siblings[0] = TaxonomyReader.INVALID_ORDINAL;
+      siblings.set(0, TaxonomyReader.INVALID_ORDINAL);
     }
 
-    for (int i = first; i < parents.length; i++) {
+    for (int i = first; i < length; i++) {
       // note that parents[i] is always < i, so the right-hand-side of
       // the following line is already set when we get here
-      siblings[i] = children[parents[i]];
-      children[parents[i]] = i;
+      siblings.set(i, children.get(parents.get(i)));

Review Comment:
   Let's extract out `parents.get(i)` to a new variable, since it's a bit more 
than a simple array lookup now, and it's used twice.



##########
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/TaxonomyIndexArrays.java:
##########
@@ -153,12 +203,13 @@ private void initParents(IndexReader reader, int first) 
throws IOException {
    * <p><b>NOTE:</b> you should call this method from a thread-safe code.
    */
   TaxonomyIndexArrays add(int ordinal, int parentOrdinal) {
-    if (ordinal >= parents.length) {
-      int[] newarray = ArrayUtil.grow(parents, ordinal + 1);
-      newarray[ordinal] = parentOrdinal;
-      return new TaxonomyIndexArrays(newarray);
+    if (ordinal >= parents.length()) {
+      int[][] newParents = allocateChunkedArray(ArrayUtil.oversize(ordinal + 
1, Integer.BYTES));

Review Comment:
   Won't this make `parents.length()` invalid since we have empty space at the 
end of the chunked array and `length()` just looks at the size of the chunked 
array?



##########
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/TaxonomyIndexArrays.java:
##########
@@ -68,25 +90,49 @@ public TaxonomyIndexArrays(IndexReader reader, 
TaxonomyIndexArrays copyFrom) thr
     // it may be caused if e.g. the taxonomy segments were merged, and so an 
updated
     // NRT reader was obtained, even though nothing was changed. this is not 
very likely
     // to happen.
-    int[] copyParents = copyFrom.parents();
-    this.parents = new int[reader.maxDoc()];
-    System.arraycopy(copyParents, 0, parents, 0, copyParents.length);
-    initParents(reader, copyParents.length);
-
+    int[][] parentArray = allocateChunkedArray(reader.maxDoc());
+    copyChunkedArray(copyFrom.parents.values, parentArray);
+    initParents(parentArray, reader, copyFrom.parents.length());
+    parents = new ChunkedArray(parentArray);
     if (copyFrom.initializedChildren) {
       initChildrenSiblings(copyFrom);
     }
   }
 
+  private static int[][] allocateChunkedArray(int size) {
+    int chunkCount = size / CHUNK_SIZE + 1;
+    int lastChunkSize = size % CHUNK_SIZE;
+    int[][] array = new int[chunkCount][];
+    if (array.length > 0) {
+      for (int i = 0; i < chunkCount - 1; i++) {
+        array[i] = new int[CHUNK_SIZE];
+      }
+      array[chunkCount - 1] = new int[lastChunkSize];
+    }
+    return array;
+  }
+
+  private static void copyChunkedArray(int[][] oldArray, int[][] newArray) {
+    // Copy all but the last (maybe partial) chunk from the old array
+    if (oldArray.length > 1) {
+      System.arraycopy(oldArray, 0, newArray, 0, oldArray.length - 1);
+    }
+    int[] lastCopyChunk = oldArray[oldArray.length - 1];
+    System.arraycopy(lastCopyChunk, 0, newArray[oldArray.length - 1], 0, 
lastCopyChunk.length);
+  }
+
   private synchronized void initChildrenSiblings(TaxonomyIndexArrays copyFrom) 
{
     if (!initializedChildren) { // must do this check !
-      children = new int[parents.length];
-      siblings = new int[parents.length];
+      int[][] childrenArray = allocateChunkedArray(parents.length());

Review Comment:
   Similar comment as in the constructor - are we getting any memory back by 
allocating like this and the copying?



##########
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/TaxonomyIndexArrays.java:
##########
@@ -68,25 +90,49 @@ public TaxonomyIndexArrays(IndexReader reader, 
TaxonomyIndexArrays copyFrom) thr
     // it may be caused if e.g. the taxonomy segments were merged, and so an 
updated
     // NRT reader was obtained, even though nothing was changed. this is not 
very likely
     // to happen.
-    int[] copyParents = copyFrom.parents();
-    this.parents = new int[reader.maxDoc()];
-    System.arraycopy(copyParents, 0, parents, 0, copyParents.length);
-    initParents(reader, copyParents.length);
-
+    int[][] parentArray = allocateChunkedArray(reader.maxDoc());

Review Comment:
   Hmm, maybe I'm missing something, but if we're allocating here are we any 
better off than before? Most of this memory will be discarded anyway when we 
shallowly copy the old arrays. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Split taxonomy arrays across chunks [lucene]

Reply via email to