mikemccand commented on PR #12633:
URL: https://github.com/apache/lucene/pull/12633#issuecomment-1753705229

   Translating/merging the above two tables into a graph:
   
   
![image](https://github.com/apache/lucene/assets/796508/6259f97c-a065-4a98-a1fc-1e4984e2386e)
   
   Some observations:
   
     * The PR is mostly better at using less RAM to make the same size FST, yay!
   
     * It is a more smooth/predictable/monotonic tradeoff: the larger the 
`NodeHash` size, the smaller the FST.  Whereas on `main`, using the god-like 
parameters, it's more dicy/spiky/unpredictable.  It's like you are the co-pilot 
trying to land a 747 alone using only toothpicks.
   
    * At the "spend all the RAM necessary to get a truly minimal FST" end (the 
right of the chart) the PR looks like it uses a bit more RAM than `main`.  I 
think I can improve on this by not wastefully using `long[]` but rather one of 
Lucene's many cool bit-packing dynamic/growable array thingys, like `main` does 
for its `NodeHash`.  Or maybe @msokolov's idea to somehow do a reversed suffix 
lookup against the growing FST. I'll try that.
   
     * Bang for the buck tapers off like you'd expect: the early MB of RAM you 
spend has a bigger payoff in reducing the FST size, while later MB of RAM is 
less and less impact.  This is nice 80/20 like behavior...
   
     * With the PR, you unfortunately cannot easily say "give me a minimal FST 
at all costs", like you can with `main` today.  You'd have to keep trying 
larger and larger NodeHash sizes until the final FST size gets no smaller.  I 
don't really like this regression -- I'll think about how to somehow keep that 
capability in the PR.  E.g. we would want to use this option when compiling 
FSTs for Kuromoji, or users may want this when compiling synonym maps.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to