Re: [PR] Use value-based LRU cache in NodeHash [lucene]

via GitHub Tue, 31 Oct 2023 21:32:18 -0700


dungba88 commented on code in PR #12738:
URL: https://github.com/apache/lucene/pull/12738#discussion_r1378381108



##########
lucene/core/src/java/org/apache/lucene/util/fst/NodeHash.java:
##########
@@ -186,119 +194,85 @@ private long hash(FSTCompiler.UnCompiledNode<T> node) {
     return h;
   }
 
-  // hash code for a frozen node.  this must precisely match the hash 
computation of an unfrozen
-  // node!
-  private long hash(long node) throws IOException {
-    final int PRIME = 31;
-
-    long h = 0;
-    fstCompiler.fst.readFirstRealTargetArc(node, scratchArc, in);
-    while (true) {
-      h = PRIME * h + scratchArc.label();
-      h = PRIME * h + (int) (scratchArc.target() ^ (scratchArc.target() >> 
32));
-      h = PRIME * h + scratchArc.output().hashCode();
-      h = PRIME * h + scratchArc.nextFinalOutput().hashCode();
-      if (scratchArc.isFinal()) {
-        h += 17;
-      }
-      if (scratchArc.isLast()) {
-        break;
-      }
-      fstCompiler.fst.readNextRealArc(scratchArc, in);
-    }
-
-    return h;
-  }
-
-  /**
-   * Compares an unfrozen node (UnCompiledNode) with a frozen node at byte 
location address (long),
-   * returning true if they are equal.
-   */
-  private boolean nodesEqual(FSTCompiler.UnCompiledNode<T> node, long address) 
throws IOException {
-    fstCompiler.fst.readFirstRealTargetArc(address, scratchArc, in);
-
-    // fail fast for a node with fixed length arcs
-    if (scratchArc.bytesPerArc() != 0) {
-      assert node.numArcs > 0;
-      // the frozen node uses fixed-with arc encoding (same number of bytes 
per arc), but may be
-      // sparse or dense
-      switch (scratchArc.nodeFlags()) {
-        case FST.ARCS_FOR_BINARY_SEARCH:
-          // sparse
-          if (node.numArcs != scratchArc.numArcs()) {
-            return false;
-          }
-          break;
-        case FST.ARCS_FOR_DIRECT_ADDRESSING:
-          // dense -- compare both the number of labels allocated in the array 
(some of which may
-          // not actually be arcs), and the number of arcs
-          if ((node.arcs[node.numArcs - 1].label - node.arcs[0].label + 1) != 
scratchArc.numArcs()
-              || node.numArcs != FST.Arc.BitTable.countBits(scratchArc, in)) {
-            return false;
-          }
-          break;
-        default:
-          throw new AssertionError("unhandled scratchArc.nodeFlag() " + 
scratchArc.nodeFlags());
-      }
-    }
-
-    // compare arc by arc to see if there is a difference
-    for (int arcUpto = 0; arcUpto < node.numArcs; arcUpto++) {
-      final FSTCompiler.Arc<T> arc = node.arcs[arcUpto];
-      if (arc.label != scratchArc.label()
-          || arc.output.equals(scratchArc.output()) == false
-          || ((FSTCompiler.CompiledNode) arc.target).node != 
scratchArc.target()
-          || arc.nextFinalOutput.equals(scratchArc.nextFinalOutput()) == false
-          || arc.isFinal != scratchArc.isFinal()) {
-        return false;
-      }
-
-      if (scratchArc.isLast()) {
-        if (arcUpto == node.numArcs - 1) {
-          return true;
-        } else {
-          return false;
-        }
-      }
-
-      fstCompiler.fst.readNextRealArc(scratchArc, in);
-    }
-
-    // unfrozen node has fewer arcs than frozen node
-
-    return false;
-  }
-
   /** Inner class because it needs access to hash function and FST bytes. */
   private class PagedGrowableHash {
+    public long copiedBytes;
     private PagedGrowableWriter entries;
+    // mapping from FST real address to copiedNodes offsets
+    private PagedGrowableWriter copiedOffsets;
     private long count;
+    // mask for entries
     private long mask;
+    // mask for copiedOffsets
+    private long offsetMask;
+    private final List<byte[]> copiedNodes;
 
     // 256K blocks, but note that the final block is sized only as needed so 
it won't use the full
     // block size when just a few elements were written to it
     private static final int BLOCK_SIZE_BYTES = 1 << 18;
 
     public PagedGrowableHash() {
       entries = new PagedGrowableWriter(16, BLOCK_SIZE_BYTES, 8, 
PackedInts.COMPACT);
+      copiedOffsets = new PagedGrowableWriter(32, BLOCK_SIZE_BYTES, 8, 
PackedInts.COMPACT);

Review Comment:
   We are doubling the bytes because we need to write both the node global 
address and the local copiedNodes offset (in adjacent position). Global address 
will be written on the even position.



##########
lucene/core/src/java/org/apache/lucene/util/fst/NodeHash.java:
##########
@@ -186,119 +194,85 @@ private long hash(FSTCompiler.UnCompiledNode<T> node) {
     return h;
   }
 
-  // hash code for a frozen node.  this must precisely match the hash 
computation of an unfrozen

Review Comment:
   I moved these methods to PagedGrowableHash, as the computation is 
table-dependent now, so moving there would help avoid calling with the wrong 
table. But I could keep it and move as a follow-up PR if that makes the review 
easier.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Use value-based LRU cache in NodeHash [lucene]

Reply via email to