Re: [PR] Bound the RAM used by the NodeHash (sharing common suffixes) during FST compilation [lucene]

via GitHub Fri, 20 Oct 2023 04:27:00 -0700


bruno-roustant commented on code in PR #12633:
URL: https://github.com/apache/lucene/pull/12633#discussion_r1366758770



##########
lucene/core/src/java/org/apache/lucene/util/fst/NodeHash.java:
##########
@@ -20,76 +20,161 @@
 import org.apache.lucene.util.packed.PackedInts;
 import org.apache.lucene.util.packed.PagedGrowableWriter;
 
+// TODO: any way to make a reverse suffix lookup (msokolov's idea) instead of 
more costly hash?
+// hmmm, though, hash is not so wasteful
+// since it does not have to store value of each entry: the value is the node 
pointer in the FST.
+// actually, there is much to save
+// there -- we would not need any long per entry -- we'd be able to start at 
the FST end node and
+// work backwards from the transitions
+
+// TODO: couldn't we prune natrually babck until we see a transition with an 
output?  it's highly

Review Comment:
   "natrually babck"



##########
lucene/core/src/java/org/apache/lucene/util/packed/AbstractPagedMutable.java:
##########
@@ -110,8 +110,10 @@ protected long baseRamBytesUsed() {
   public long ramBytesUsed() {
     long bytesUsed = RamUsageEstimator.alignObjectSize(baseRamBytesUsed());
     bytesUsed += 
RamUsageEstimator.alignObjectSize(RamUsageEstimator.shallowSizeOf(subMutables));
+    // System.out.println("abstract.ramBytesUsed:");

Review Comment:
   Do we keep these commented prints?



##########
lucene/core/src/java/org/apache/lucene/util/fst/NodeHash.java:
##########
@@ -99,20 +184,18 @@ private long hash(FSTCompiler.UnCompiledNode<T> node) {
         h += 17;
       }
     }
-    // System.out.println("  ret " + (h&Integer.MAX_VALUE));
+

Review Comment:
   Do we need to mask with Long.MAX_VALUE below, since we mask anyway with the 
table mask?
   
   Instead, we should multiply with the gold constant BitMixer#PHI_C64 (make it 
public).
   This really makes a difference in the evenness of the value distribution. 
This is one of the secrets of the HPPC hashing. By applying this, we get 
multiple advantages:
   - lookup should be improved (less hash collision)
   - we can try to rehash at 3/4 occupancy because the performance should not 
be impacted until this point.
   - in case of hash collision, we can lookup linearly with a pos = pos + 1 
instead of quadratic probe (lines 95 and 327); this may avoid some mem cache 
miss.
   
   (same for the other hash method)



##########
lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java:
##########
@@ -135,32 +123,28 @@ public class FSTCompiler<T> {
    * Instantiates an FST/FSA builder with default settings and pruning options 
turned off. For more
    * tuning and tweaking, see {@link Builder}.
    */
+  // TODO: remove this?  Builder API should be the only entry point?
   public FSTCompiler(FST.INPUT_TYPE inputType, Outputs<T> outputs) {
-    this(inputType, 0, 0, true, true, Integer.MAX_VALUE, outputs, true, 15, 
1f);
+    this(inputType, 32.0, outputs, true, 15, 1f);
   }
 
   private FSTCompiler(
       FST.INPUT_TYPE inputType,
-      int minSuffixCount1,
-      int minSuffixCount2,
-      boolean doShareSuffix,
-      boolean doShareNonSingletonNodes,
-      int shareMaxTailLength,
+      double suffixRAMLimitMB, // pass 0 to disable suffix compression/trie; 
larger values create

Review Comment:
   `@param`in method javadoc instead? This is not easy to read with spotless 
truncation.



##########
lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java:
##########
@@ -207,59 +187,26 @@ public Builder(FST.INPUT_TYPE inputType, Outputs<T> 
outputs) {
     }
 
     /**
-     * If pruning the input graph during construction, this threshold is used 
for telling if a node
-     * is kept or pruned. If transition_count(node) &gt;= minSuffixCount1, the 
node is kept.
+     * The approximate maximum amount of RAM (in MB) to use holding the suffix 
cache, which enables
+     * the FST to share common suffixes. Pass {@link Double#POSITIVE_INFINITY} 
to keep all suffixes
+     * and create an exactly minimal FST. In this case, the amount of RAM 
actually used will be
+     * bounded by the number of unique suffixes. If you pass a value smaller 
than the builder would
+     * use, the least recently used suffixes will be discarded, thus reducing 
suffix sharing and
+     * creating a non-minimal FST. In this case, the larger the limit, the 
closer the FST will be to
+     * its true minimal size, with diminishing returns as you increasea the 
limit. Pass {@code 0} to

Review Comment:
   "increasea"



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Bound the RAM used by the NodeHash (sharing common suffixes) during FST compilation [lucene]

Reply via email to