Re: [PR] [WIP] first cut at bounding the NodeHash size during FST compilation [lucene]

via GitHub Mon, 16 Oct 2023 08:39:07 -0700


dungba88 commented on code in PR #12633:
URL: https://github.com/apache/lucene/pull/12633#discussion_r1360875178



##########
lucene/core/src/java/org/apache/lucene/util/fst/NodeHash.java:
##########
@@ -17,50 +17,80 @@
 package org.apache.lucene.util.fst;
 
 import java.io.IOException;
-import org.apache.lucene.util.packed.PackedInts;
-import org.apache.lucene.util.packed.PagedGrowableWriter;
+import java.util.Arrays;
+import java.util.Locale;
 
 // Used to dedup states (lookup already-frozen states)
 final class NodeHash<T> {
 
-  private PagedGrowableWriter table;
-  private long count;
-  private long mask;
+  // primary table
+  private long[] table;
+
+  // how many nodes are stored in the primary table; when this gets full, we 
discard tableOld and move primary to it
+  private int count;
+  private int promoteCount;
+  private int missCount;
+  private int hitCount;
+
+  // fallback table.  if we fallback and find the frozen node here, we promote 
it to primary table, for a simplistic LRU behaviour
+  private long[] fallbackTable;
+
+  private final int mask;
   private final FST<T> fst;
   private final FST.Arc<T> scratchArc = new FST.Arc<>();
   private final FST.BytesReader in;
 
-  public NodeHash(FST<T> fst, FST.BytesReader in) {
-    table = new PagedGrowableWriter(16, 1 << 27, 8, PackedInts.COMPACT);
-    mask = 15;
+  public NodeHash(FST<T> fst, int tableSize, FST.BytesReader in) {
+    if (tableSize < 4) {
+      // 2 is a power of 2, but does not work because the rehash logic (at 2/3 
capacity) becomes over-quantized, and the hash
+      // table becomes 100% full before moving to fallback, and then looking 
up an entry in 100% full hash table spins forever
+      throw new IllegalArgumentException("tableSize must at least 4; got: " + 
tableSize);
+    }
+    mask = tableSize - 1;
+    if ((mask & tableSize) != 0) {
+      throw new IllegalArgumentException("tableSize must be a power of 2; got: 
" + tableSize);
+    }
+    table = new long[tableSize];
+    fallbackTable = new long[tableSize];
     this.fst = fst;
     this.in = in;
   }
 
+  /** Compares an unfrozen node (UnCompiledNode) with a frozen node at byte 
location address (long), returning
+   *  true if they are equal. */
   private boolean nodesEqual(FSTCompiler.UnCompiledNode<T> node, long address) 
throws IOException {
     fst.readFirstRealTargetArc(address, scratchArc, in);

Review Comment:
   So does this means we still need to read from the in-writing FST? I'm 
wondering would it be possible to only read from the cache, then we can 
decouple the FST from NodeHash.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] [WIP] first cut at bounding the NodeHash size during FST compilation [lucene]

Reply via email to