Re: [PR] [WIP] first cut at bounding the NodeHash size during FST compilation [lucene]

via GitHub Mon, 09 Oct 2023 13:19:14 -0700


mikemccand commented on PR #12633:
URL: https://github.com/apache/lucene/pull/12633#issuecomment-1753705229

Translating/merging the above two tables into a graph:

![image](https://github.com/apache/lucene/assets/796508/6259f97c-a065-4a98-a1fc-1e4984e2386e)

Some observations:

* The PR is mostly better at using less RAM to make the same size FST, yay!

* It is a more smooth/predictable/monotonic tradeoff: the larger the
`NodeHash` size, the smaller the FST. Whereas on `main`, using the god-like
parameters, it's more dicy/spiky/unpredictable. It's like you are the co-pilot
trying to land a 747 alone using only toothpicks.

* At the "spend all the RAM necessary to get a truly minimal FST" end (the
right of the chart) the PR looks like it uses a bit more RAM than `main`. I
think I can improve on this by not wastefully using `long[]` but rather one of
Lucene's many cool bit-packing dynamic/growable array thingys, like `main` does
for its `NodeHash`. Or maybe @msokolov's idea to somehow do a reversed suffix
lookup against the growing FST. I'll try that.

* Bang for the buck tapers off like you'd expect: the early MB of RAM you
spend has a bigger payoff in reducing the FST size, while later MB of RAM is
less and less impact. This is nice 80/20 like behavior...

* With the PR, you unfortunately cannot easily say "give me a minimal FST
at all costs", like you can with `main` today. You'd have to keep trying
larger and larger NodeHash sizes until the final FST size gets no smaller. I
don't really like this regression -- I'll think about how to somehow keep that
capability in the PR. E.g. we would want to use this option when compiling
FSTs for Kuromoji, or users may want this when compiling synonym maps.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] [WIP] first cut at bounding the NodeHash size during FST compilation [lucene]

Reply via email to