[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16938389#comment-16938389
 ] 

Bruno Roustant commented on LUCENE-8920:
----------------------------------------

Here is a proposal for the heuristic to select the encoding of a FST node.

The idea is to have threshold values that vary according to a parameter, let's 
call it "timeSpaceBalance". timeSpaceBalance can have 4 values: MORE_COMPACT, 
COMPACT (default, current FST), FAST, FASTER.
Keep the current FST encoding/behavior for COMPACT.
Only try open-addressing encoding for the FAST or FASTER balance.
Be very demanding for direct-addressing for COMPACT or MORE_COMPACT.
Do we need the MORE_COMPACT mode? Even if I don't see the use-case now, since 
it's easy and does not involve more code to have it, I would say yes.

4 rules, one per possible encoding, ordered from top to bottom. The first 
encoding which condition matches is selected.

n: number of labels (num sub-nodes)
depth: depth of the node in the tree

[list-encoding] if n <= L1 || (depth >= L2 && n <= L3)

[direct-addressing] if n / (max label - min label) >= D1

[try open-addressing] if depth <= O1 || n >= O2

[binary search] otherwise


And below are the threshold values for each timeSpaceBalance:

timeSpaceBalance = MORE_COMPACT (memory < x1)
L1 = 6, L2 = 4, L3 = 11
D1 = 0.8
O1 = -1, O2 = infinite

timeSpaceBalance = COMPACT (memory x1)
L1 = 4, L2 = 4, L3 = 9
D1 = 0.66
O1 = -1, O2 = infinite

timeSpaceBalance = FAST (memory x2 ?)
L1 = 4, L2 = 4, L3 = 7
D1 = 0.5
O1 = 3, O2 = 10

timeSpaceBalance = FASTER (memory x3 ?)
L1 = 3, L2 = 4, L3 = 5
D1 = 0.33
O1 = infinite, O2 = 0


Thoughts?

> Reduce size of FSTs due to use of direct-addressing encoding 
> -------------------------------------------------------------
>
>                 Key: LUCENE-8920
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8920
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael Sokolov
>            Priority: Blocker
>             Fix For: 8.3
>
>          Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to