Re: [PR] Use value-based LRU cache in NodeHash [lucene]

via GitHub Tue, 31 Oct 2023 02:58:00 -0700


mikemccand commented on code in PR #12738:
URL: https://github.com/apache/lucene/pull/12738#discussion_r1377320240



##########
lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java:
##########
@@ -461,9 +457,14 @@ long addNode(FSTCompiler.UnCompiledNode<T> nodeIn) throws 
IOException {
     final long thisNodeAddress = bytes.getPosition() - 1;
     bytes.reverse(startAddress, thisNodeAddress);
     nodeCount++;
-    return thisNodeAddress;
+    // nocommit: this is non-optimal, we should write the BytesStore to the 
ByteBlockPool directly
+    byte[] buf = new byte[Math.toIntExact(thisNodeAddress - startAddress + 1)];
+    bytes.copyBytes(startAddress, buf, 0, buf.length);
+    return new NodeAndBuffer(thisNodeAddress, buf);
   }
 
+  record NodeAndBuffer(long nodeAddress, byte[] bytes) {}

Review Comment:
   `record` is JDK 14+ right?  If we take this approach we won't be able to 
backport to 9.x (it must support down to JDK 11).
   
   Since `NodeAndBuffer` is only use transiently (as return type from 
`addNode`) maybe we can just use a normal class?



##########
lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java:
##########
@@ -906,10 +900,6 @@ static final class UnCompiledNode<T> implements Node {
     T output;
     boolean isFinal;
 
-    // TODO: remove this tracking?  we used to use it for confusingly pruning 
NodeHash, but

Review Comment:
   Aha!  Thanks for tackling this `TODO` in stride ...



##########
lucene/core/src/java/org/apache/lucene/util/fst/NodeHash.java:
##########
@@ -269,36 +283,58 @@ private boolean nodesEqual(FSTCompiler.UnCompiledNode<T> 
node, long address) thr
     return false;
   }
 
+  record OffsetAndLength(long offset, int length) {}
+
   /** Inner class because it needs access to hash function and FST bytes. */
-  private class PagedGrowableHash {
+  class PagedGrowableHash {
     private PagedGrowableWriter entries;
-    private long count;
+    // nocommit: use PagedGrowableWriter? there was some size overflow issue 
with
+    // PagedGrowableWriter

Review Comment:
   This is interesting -- what exception did you hit?



##########
lucene/core/src/java/org/apache/lucene/util/fst/ReverseBytesReader.java:
##########
@@ -17,7 +17,7 @@
 package org.apache.lucene.util.fst;
 
 /** Reads in reverse from a single byte[]. */
-final class ReverseBytesReader extends FST.BytesReader {
+class ReverseBytesReader extends FST.BytesReader {

Review Comment:
   I wonder if this hurts FST traversal performance -- this is a hot class.  We 
can test perf later ... and we can accept some CPU loss here since capping RAM 
usage is important, I think.



##########
lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java:
##########
@@ -749,7 +750,6 @@ public void add(IntsRef input, T output) throws IOException 
{
       // format cannot represent the empty input since
       // 'finalness' is stored on the incoming arc, not on
       // the node
-      frontier[0].inputCount++;

Review Comment:
   Oh -- is this now dead code (because we no longer support hard pruning while 
compiling the FST)?



##########
lucene/core/src/java/org/apache/lucene/util/fst/NodeHash.java:
##########
@@ -269,36 +283,58 @@ private boolean nodesEqual(FSTCompiler.UnCompiledNode<T> 
node, long address) thr
     return false;
   }
 
+  record OffsetAndLength(long offset, int length) {}
+
   /** Inner class because it needs access to hash function and FST bytes. */
-  private class PagedGrowableHash {

Review Comment:
   Hmm -- did something else in this package need access?



##########
lucene/core/src/java/org/apache/lucene/util/fst/NodeHash.java:
##########
@@ -214,7 +222,13 @@ private long hash(long node) throws IOException {
    * Compares an unfrozen node (UnCompiledNode) with a frozen node at byte 
location address (long),
    * returning true if they are equal.
    */
-  private boolean nodesEqual(FSTCompiler.UnCompiledNode<T> node, long address) 
throws IOException {
+  private boolean nodesEqual(
+      PagedGrowableHash table, FSTCompiler.UnCompiledNode<T> node, long 
address)
+      throws IOException {
+    // nocommit: this is non-optimal, we should have a BytesReader that wraps 
and read the
+    // ByteBlockPool directly
+    byte[] bytes = table.getBytes(address);

Review Comment:
   I think we can change `address` from the "global" address (in the FST's 
growing append-only `byte[]`), to the more compact address of the 
`ByteBlockPool` belonging to each of two (primary and fallback) hash sets?



##########
lucene/core/src/java/org/apache/lucene/util/fst/NodeHash.java:
##########
@@ -269,36 +283,58 @@ private boolean nodesEqual(FSTCompiler.UnCompiledNode<T> 
node, long address) thr
     return false;
   }
 
+  record OffsetAndLength(long offset, int length) {}
+
   /** Inner class because it needs access to hash function and FST bytes. */
-  private class PagedGrowableHash {
+  class PagedGrowableHash {
     private PagedGrowableWriter entries;
-    private long count;
+    // nocommit: use PagedGrowableWriter? there was some size overflow issue 
with
+    // PagedGrowableWriter
+    // mapping from FST real address to copiedNodes offsets & length
+    private Map<Long, OffsetAndLength> copiedOffsets;

Review Comment:
   Egads yeah we should not use a "real" `Map`.



##########
lucene/core/src/java/org/apache/lucene/util/fst/NodeHash.java:
##########
@@ -269,36 +283,58 @@ private boolean nodesEqual(FSTCompiler.UnCompiledNode<T> 
node, long address) thr
     return false;
   }
 
+  record OffsetAndLength(long offset, int length) {}
+
   /** Inner class because it needs access to hash function and FST bytes. */
-  private class PagedGrowableHash {
+  class PagedGrowableHash {
     private PagedGrowableWriter entries;
-    private long count;
+    // nocommit: use PagedGrowableWriter? there was some size overflow issue 
with
+    // PagedGrowableWriter
+    // mapping from FST real address to copiedNodes offsets & length
+    private Map<Long, OffsetAndLength> copiedOffsets;
+    long count;
+    long currentOffsets;
     private long mask;
+    private final ByteBlockPool copiedNodes;
 
     // 256K blocks, but note that the final block is sized only as needed so 
it won't use the full
     // block size when just a few elements were written to it
     private static final int BLOCK_SIZE_BYTES = 1 << 18;
 
     public PagedGrowableHash() {
       entries = new PagedGrowableWriter(16, BLOCK_SIZE_BYTES, 8, 
PackedInts.COMPACT);
+      copiedOffsets = new HashMap<>();
       mask = 15;
+      copiedNodes = new ByteBlockPool(new ByteBlockPool.DirectAllocator());
     }
 
     public PagedGrowableHash(long lastNodeAddress, long size) {
       entries =
           new PagedGrowableWriter(
               size, BLOCK_SIZE_BYTES, 
PackedInts.bitsRequired(lastNodeAddress), PackedInts.COMPACT);
+      copiedOffsets = new HashMap<>();
       mask = size - 1;
       assert (mask & size) == 0 : "size must be a power-of-2; got size=" + 
size + " mask=" + mask;
+      copiedNodes = new ByteBlockPool(new ByteBlockPool.DirectAllocator());
+    }
+
+    public byte[] getBytes(long index) {
+      OffsetAndLength offsetAndLength = copiedOffsets.get(index);

Review Comment:
   I think the `byte[]` encoding of an FST's node is self-delimiting?  So, we 
don't need to store the length (ideally) -- just the offset into the 
`ByteBlockPool`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Use value-based LRU cache in NodeHash [lucene]

Reply via email to