bruno-roustant commented on code in PR #12633: URL: https://github.com/apache/lucene/pull/12633#discussion_r1366758770
########## lucene/core/src/java/org/apache/lucene/util/fst/NodeHash.java: ########## @@ -20,76 +20,161 @@ import org.apache.lucene.util.packed.PackedInts; import org.apache.lucene.util.packed.PagedGrowableWriter; +// TODO: any way to make a reverse suffix lookup (msokolov's idea) instead of more costly hash? +// hmmm, though, hash is not so wasteful +// since it does not have to store value of each entry: the value is the node pointer in the FST. +// actually, there is much to save +// there -- we would not need any long per entry -- we'd be able to start at the FST end node and +// work backwards from the transitions + +// TODO: couldn't we prune natrually babck until we see a transition with an output? it's highly Review Comment: "natrually babck" ########## lucene/core/src/java/org/apache/lucene/util/packed/AbstractPagedMutable.java: ########## @@ -110,8 +110,10 @@ protected long baseRamBytesUsed() { public long ramBytesUsed() { long bytesUsed = RamUsageEstimator.alignObjectSize(baseRamBytesUsed()); bytesUsed += RamUsageEstimator.alignObjectSize(RamUsageEstimator.shallowSizeOf(subMutables)); + // System.out.println("abstract.ramBytesUsed:"); Review Comment: Do we keep these commented prints? ########## lucene/core/src/java/org/apache/lucene/util/fst/NodeHash.java: ########## @@ -99,20 +184,18 @@ private long hash(FSTCompiler.UnCompiledNode<T> node) { h += 17; } } - // System.out.println(" ret " + (h&Integer.MAX_VALUE)); + Review Comment: Do we need to mask with Long.MAX_VALUE below, since we mask anyway with the table mask? Instead, we should multiply with the gold constant BitMixer#PHI_C64 (make it public). This really makes a difference in the evenness of the value distribution. This is one of the secrets of the HPPC hashing. By applying this, we get multiple advantages: - lookup should be improved (less hash collision) - we can try to rehash at 3/4 occupancy because the performance should not be impacted until this point. - in case of hash collision, we can lookup linearly with a pos = pos + 1 instead of quadratic probe (lines 95 and 327); this may avoid some mem cache miss. (same for the other hash method) ########## lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java: ########## @@ -135,32 +123,28 @@ public class FSTCompiler<T> { * Instantiates an FST/FSA builder with default settings and pruning options turned off. For more * tuning and tweaking, see {@link Builder}. */ + // TODO: remove this? Builder API should be the only entry point? public FSTCompiler(FST.INPUT_TYPE inputType, Outputs<T> outputs) { - this(inputType, 0, 0, true, true, Integer.MAX_VALUE, outputs, true, 15, 1f); + this(inputType, 32.0, outputs, true, 15, 1f); } private FSTCompiler( FST.INPUT_TYPE inputType, - int minSuffixCount1, - int minSuffixCount2, - boolean doShareSuffix, - boolean doShareNonSingletonNodes, - int shareMaxTailLength, + double suffixRAMLimitMB, // pass 0 to disable suffix compression/trie; larger values create Review Comment: `@param`in method javadoc instead? This is not easy to read with spotless truncation. ########## lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java: ########## @@ -207,59 +187,26 @@ public Builder(FST.INPUT_TYPE inputType, Outputs<T> outputs) { } /** - * If pruning the input graph during construction, this threshold is used for telling if a node - * is kept or pruned. If transition_count(node) >= minSuffixCount1, the node is kept. + * The approximate maximum amount of RAM (in MB) to use holding the suffix cache, which enables + * the FST to share common suffixes. Pass {@link Double#POSITIVE_INFINITY} to keep all suffixes + * and create an exactly minimal FST. In this case, the amount of RAM actually used will be + * bounded by the number of unique suffixes. If you pass a value smaller than the builder would + * use, the least recently used suffixes will be discarded, thus reducing suffix sharing and + * creating a non-minimal FST. In this case, the larger the limit, the closer the FST will be to + * its true minimal size, with diminishing returns as you increasea the limit. Pass {@code 0} to Review Comment: "increasea" -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org