Re: [PR] Bound the RAM used by the NodeHash (sharing common suffixes) during FST compilation [lucene]

via GitHub Thu, 19 Oct 2023 07:08:12 -0700


mikemccand commented on code in PR #12633:
URL: https://github.com/apache/lucene/pull/12633#discussion_r1365607237



##########
lucene/core/src/java/org/apache/lucene/util/fst/NodeHash.java:
##########
@@ -20,76 +20,159 @@
 import org.apache.lucene.util.packed.PackedInts;
 import org.apache.lucene.util.packed.PagedGrowableWriter;
 
+// TODO: any way to make a reversee suffix lookup (msokolov's idea) instead of 
more costly hash?
+// hmmm, though, hash is not so wasteful
+// since it does not have to store value of each entry: the value is the node 
pointer in the FST.
+// actually, there is much to save
+// there -- we would not need any long per entry -- we'd be able to start at 
the FST end node and
+// work backwards from the transitions
+
+// TODO: couldn't we prune natrually babck until we see a transition with an 
output?  it's highly
+// unlikely (mostly impossible) such suffixes
+// can be shared?
+
 // Used to dedup states (lookup already-frozen states)
 final class NodeHash<T> {
 
-  private PagedGrowableWriter table;
-  private long count;
-  private long mask;
+  // primary table -- we add nodes into this until it reaches the requested 
tableSizeLimit/2, then
+  // we move it to fallback
+  private PagedGrowableHash primaryTable;
+
+  // how many nodes are allowed to store in both primary and fallback tables; 
when primary gets full
+  // (tableSizeLimit/2), we move it to the
+  // fallback table
+  private final long ramLimitBytes;
+
+  // fallback table.  if we fallback and find the frozen node here, we promote 
it to primary table,
+  // for a simplistic and lowish-RAM-overhead
+  // (compared to e.g. LinkedHashMap) LRU behaviour.  fallbackTable is 
read-only.
+  private PagedGrowableHash fallbackTable;
+
   private final FST<T> fst;
   private final FST.Arc<T> scratchArc = new FST.Arc<>();
   private final FST.BytesReader in;
 
-  public NodeHash(FST<T> fst, FST.BytesReader in) {
-    table = new PagedGrowableWriter(16, 1 << 27, 8, PackedInts.COMPACT);
-    mask = 15;
+  /**
+   * ramLimitMB is the max RAM we can use for recording suffixes. If we hit 
this limit, the least
+   * recently used suffixes are discarded, and the FST is no longer minimalI. 
Still, larger
+   * ramLimitMB will make the FST smaller (closer to minimal).
+   */
+  public NodeHash(FST<T> fst, double ramLimitMB, FST.BytesReader in) {
+    if (ramLimitMB <= 0) {
+      throw new IllegalArgumentException("ramLimitMB must be > 0; got: " + 
ramLimitMB);
+    }
+    double asBytes = ramLimitMB * 1024 * 1024;
+    if (asBytes >= Long.MAX_VALUE) {
+      // quietly truncate to Long.MAX_VALUE in bytes too
+      ramLimitBytes = Long.MAX_VALUE;
+    } else {
+      ramLimitBytes = (long) asBytes;
+    }
+
+    primaryTable = new PagedGrowableHash();
     this.fst = fst;
     this.in = in;
   }
 
-  private boolean nodesEqual(FSTCompiler.UnCompiledNode<T> node, long address) 
throws IOException {
-    fst.readFirstRealTargetArc(address, scratchArc, in);
-
-    // Fail fast for a node with fixed length arcs.
-    if (scratchArc.bytesPerArc() != 0) {
-      if (scratchArc.nodeFlags() == FST.ARCS_FOR_BINARY_SEARCH) {
-        if (node.numArcs != scratchArc.numArcs()) {
-          return false;
-        }
-      } else {
-        assert scratchArc.nodeFlags() == FST.ARCS_FOR_DIRECT_ADDRESSING;
-        if ((node.arcs[node.numArcs - 1].label - node.arcs[0].label + 1) != 
scratchArc.numArcs()
-            || node.numArcs != FST.Arc.BitTable.countBits(scratchArc, in)) {
-          return false;
-        }
+  private long getFallback(FSTCompiler.UnCompiledNode<T> nodeIn, long hash) 
throws IOException {
+    if (fallbackTable == null) {
+      // no fallback yet (primary table is not yet large enough to swap)
+      return 0;
+    }
+    long pos = hash & fallbackTable.mask;
+    int c = 0;
+    while (true) {
+      long node = fallbackTable.get(pos);
+      if (node == 0) {
+        // not found
+        return 0;
+      } else if (nodesEqual(nodeIn, node)) {
+        // frozen version of this node is already here
+        return node;
       }
+
+      // quadratic probe (but is it, really?)
+      pos = (pos + (++c)) & fallbackTable.mask;
     }
+  }
 
-    for (int arcUpto = 0; arcUpto < node.numArcs; arcUpto++) {
-      final FSTCompiler.Arc<T> arc = node.arcs[arcUpto];
-      if (arc.label != scratchArc.label()
-          || !arc.output.equals(scratchArc.output())
-          || ((FSTCompiler.CompiledNode) arc.target).node != 
scratchArc.target()
-          || !arc.nextFinalOutput.equals(scratchArc.nextFinalOutput())
-          || arc.isFinal != scratchArc.isFinal()) {
-        return false;
-      }
+  public long add(FSTCompiler<T> fstCompiler, FSTCompiler.UnCompiledNode<T> 
nodeIn)
+      throws IOException {
 
-      if (scratchArc.isLast()) {
-        if (arcUpto == node.numArcs - 1) {
-          return true;
+    long hash = hash(nodeIn);
+
+    long pos = hash & primaryTable.mask;
+    int c = 0;
+
+    while (true) {
+
+      long node = primaryTable.get(pos);
+      if (node == 0) {
+        // node is not in primary table; is it in fallback table?
+        node = getFallback(nodeIn, hash);
+        if (node != 0) {
+          // it was already in fallback -- promote to primary
+          primaryTable.set(pos, node);
         } else {
-          return false;
+          // not in fallback either -- freeze & add the incoming node
+
+          // freeze & add
+          node = fstCompiler.addNode(nodeIn);
+
+          // we use 0 as empty marker in hash table, so it better be 
impossible to get a frozen node
+          // at 0:
+          assert node != 0;
+
+          // confirm frozen hash and unfrozen hash are the same
+          assert hash(node) == hash : "mismatch frozenHash=" + hash(node) + " 
vs hash=" + hash;
+
+          primaryTable.set(pos, node);
         }
+
+        // how many bytes would be used if we had "perfect" hashing:
+        long ramBytesUsed = primaryTable.count * PackedInts.bitsRequired(node) 
/ 8;
+
+        // NOTE: we could instead use the more precise RAM used, but this 
leads to unpredictable
+        // quantized behavior due to 2X rehashing where for large ranges of 
the RAM limit, the
+        // size of the FST does not change, and then suddenly when you cross a 
secret threshold,
+        // it drops.  With this approach (measuring "perfect" hash storage and 
approximating the
+        // overhead), the behaviour is more strictly monotonic: larger RAM 
limits smoothly result
+        // in smaller FSTs, even if the precise RAM used is not always under 
the limit.
+
+        // divide limit by 2 because fallback gets half the RAM and primary 
gets the other half
+        // divide by 2 again to account for approximate hash table overhead 
halfway between 33.3%
+        // and 66.7% occupancy = 50%
+        if (ramBytesUsed >= ramLimitBytes / (2 * 2)) {
+          // time to fallback -- fallback is now used read-only to promote a 
node (suffix) to
+          // primary if we encounter it again
+          fallbackTable = primaryTable;

Review Comment:
   We could indeed do that, but `PagedGrowableWriter` does not implement 
`clear` yet ... let's defer this small optimization?  I'd rather get this 
change in soon to unblock all the other cool changes you've been working on.  
I'll add a `TODO` :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Bound the RAM used by the NodeHash (sharing common suffixes) during FST compilation [lucene]

Reply via email to