[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16930608#comment-16930608
 ] 

Bruno Roustant edited comment on LUCENE-8920 at 9/16/19 2:29 PM:
-----------------------------------------------------------------

Open-addressing benchmark to store byte labels in an array of size between 8 
and 256.

"Array" means array size.

"tries" means max number of slot comparisons allowed (when a slot we want to 
put a label to is not empty - move to the next slot).

"loadFactor" = max load factor (num labels / array size) for which we can add 
all the labels by respecting "tries". Above this max load factor, the 
open-addressing fails to store all labels and fallbacks to binary search. This 
max load factor is an average of the max load factor for 1000 random 
constructions done by the benchmark.

"memory" = space increase factor = 1 / loadFactor.

 

Array=8, tries=1, loadFactor=0.400125, memory x 2.499219
 Array=8, tries=2, loadFactor=0.63825, memory x 1.5667841
 Array=8, tries=3, loadFactor=0.78675, memory x 1.2710518
 Array=12, tries=1, loadFactor=0.34391674, memory x 2.9076805
 Array=12, tries=2, loadFactor=0.54183316, memory x 1.8455865
 Array=12, tries=3, loadFactor=0.6964172, memory x 1.4359208
 Array=16, tries=1, loadFactor=0.29825, memory x 3.352892
 Array=16, tries=2, loadFactor=0.5039375, memory x 1.9843731
 Array=16, tries=3, loadFactor=0.646625, memory x 1.5464914
 Array=16, tries=4, loadFactor=0.7493125, memory x 1.3345567
 Array=24, tries=1, loadFactor=0.25141662, memory x 3.9774618
 Array=24, tries=2, loadFactor=0.44929162, memory x 2.225726
 Array=24, tries=3, loadFactor=0.58083373, memory x 1.7216631
 Array=24, tries=4, loadFactor=0.6867924, memory x 1.4560441
 Array=32, tries=1, loadFactor=0.22528125, memory x 4.4388957
 Array=32, tries=2, loadFactor=0.4059375, memory x 2.4634335
 Array=32, tries=3, loadFactor=0.53625, memory x 1.8648019
 Array=32, tries=4, loadFactor=0.63653123, memory x 1.5710148
 Array=32, tries=5, loadFactor=0.7165625, memory x 1.3955517
 Array=48, tries=1, loadFactor=0.18899995, memory x 5.2910066
 Array=48, tries=2, loadFactor=0.36193773, memory x 2.762906
 Array=48, tries=3, loadFactor=0.49154142, memory x 2.0344167
 Array=48, tries=4, loadFactor=0.5871248, memory x 1.7032154
 Array=48, tries=5, loadFactor=0.6700212, memory x 1.4924902
 Array=64, tries=1, loadFactor=0.17101562, memory x 5.8474193
 Array=64, tries=2, loadFactor=0.33707812, memory x 2.9666712
 Array=64, tries=3, loadFactor=0.46217188, memory x 2.1636972
 Array=64, tries=4, loadFactor=0.56409377, memory x 1.7727549
 Array=64, tries=5, loadFactor=0.6392813, memory x 1.5642567
 Array=64, tries=6, loadFactor=0.6963125, memory x 1.4361368
 Array=96, tries=1, loadFactor=0.15294789, memory x 6.5381746
 Array=96, tries=2, loadFactor=0.31204194, memory x 3.2046974
 Array=96, tries=3, loadFactor=0.42829168, memory x 2.3348575
 Array=96, tries=4, loadFactor=0.5277291, memory x 1.8949116
 Array=96, tries=5, loadFactor=0.60920805, memory x 1.6414753
 Array=96, tries=6, loadFactor=0.66278124, memory x 1.5087935
 Array=128, tries=1, loadFactor=0.15130469, memory x 6.6091805
 Array=128, tries=2, loadFactor=0.3054219, memory x 3.2741597
 Array=128, tries=3, loadFactor=0.43253124, memory x 2.3119717
 Array=128, tries=4, loadFactor=0.5283203, memory x 1.8927912
 Array=128, tries=5, loadFactor=0.6043203, memory x 1.6547517
 Array=128, tries=6, loadFactor=0.66267186, memory x 1.5090425
 Array=128, tries=7, loadFactor=0.70567185, memory x 1.4170892
 Array=192, tries=1, loadFactor=0.14355215, memory x 6.9661093
 Array=192, tries=2, loadFactor=0.32579714, memory x 3.0693946
 Array=192, tries=3, loadFactor=0.42888057, memory x 2.3316514
 Array=192, tries=4, loadFactor=0.5238486, memory x 1.9089485
 Array=192, tries=5, loadFactor=0.59191173, memory x 1.6894411
 Array=192, tries=6, loadFactor=0.65411395, memory x 1.5287856
 Array=192, tries=7, loadFactor=0.6975258, memory x 1.4336387
 Array=256, tries=1, loadFactor=1.0, memory x 1.0
 Array=256, tries=2, loadFactor=1.0, memory x 1.0
 Array=256, tries=3, loadFactor=1.0, memory x 1.0
 Array=256, tries=4, loadFactor=1.0, memory x 1.0
 Array=256, tries=5, loadFactor=1.0, memory x 1.0
 Array=256, tries=6, loadFactor=1.0, memory x 1.0
 Array=256, tries=7, loadFactor=1.0, memory x 1.0
 Array=256, tries=8, loadFactor=1.0, memory x 1.0


was (Author: bruno.roustant):
Open-addressing benchmark to store byte labels in an array of size between 8 
and 256.

"Array" means array size.

"tries" means max number of slot comparisons allowed (when a slot we want to 
put a label to is not empty - move to the next slot).

"loadFactor" = max load factor (num labels / array size) for which we can add 
all the labels by respecting "tries". Above this max load factor, the 
open-addressing fails to store all labels and fallbacks to binary search. This 
max load factor is an average of 1000 random constructions done by the 
benchmark.

"memory" = space increase factor = 1 / loadFactor.

 

Array=8, tries=1, loadFactor=0.400125, memory x 2.499219
Array=8, tries=2, loadFactor=0.63825, memory x 1.5667841
Array=8, tries=3, loadFactor=0.78675, memory x 1.2710518
Array=12, tries=1, loadFactor=0.34391674, memory x 2.9076805
Array=12, tries=2, loadFactor=0.54183316, memory x 1.8455865
Array=12, tries=3, loadFactor=0.6964172, memory x 1.4359208
Array=16, tries=1, loadFactor=0.29825, memory x 3.352892
Array=16, tries=2, loadFactor=0.5039375, memory x 1.9843731
Array=16, tries=3, loadFactor=0.646625, memory x 1.5464914
Array=16, tries=4, loadFactor=0.7493125, memory x 1.3345567
Array=24, tries=1, loadFactor=0.25141662, memory x 3.9774618
Array=24, tries=2, loadFactor=0.44929162, memory x 2.225726
Array=24, tries=3, loadFactor=0.58083373, memory x 1.7216631
Array=24, tries=4, loadFactor=0.6867924, memory x 1.4560441
Array=32, tries=1, loadFactor=0.22528125, memory x 4.4388957
Array=32, tries=2, loadFactor=0.4059375, memory x 2.4634335
Array=32, tries=3, loadFactor=0.53625, memory x 1.8648019
Array=32, tries=4, loadFactor=0.63653123, memory x 1.5710148
Array=32, tries=5, loadFactor=0.7165625, memory x 1.3955517
Array=48, tries=1, loadFactor=0.18899995, memory x 5.2910066
Array=48, tries=2, loadFactor=0.36193773, memory x 2.762906
Array=48, tries=3, loadFactor=0.49154142, memory x 2.0344167
Array=48, tries=4, loadFactor=0.5871248, memory x 1.7032154
Array=48, tries=5, loadFactor=0.6700212, memory x 1.4924902
Array=64, tries=1, loadFactor=0.17101562, memory x 5.8474193
Array=64, tries=2, loadFactor=0.33707812, memory x 2.9666712
Array=64, tries=3, loadFactor=0.46217188, memory x 2.1636972
Array=64, tries=4, loadFactor=0.56409377, memory x 1.7727549
Array=64, tries=5, loadFactor=0.6392813, memory x 1.5642567
Array=64, tries=6, loadFactor=0.6963125, memory x 1.4361368
Array=96, tries=1, loadFactor=0.15294789, memory x 6.5381746
Array=96, tries=2, loadFactor=0.31204194, memory x 3.2046974
Array=96, tries=3, loadFactor=0.42829168, memory x 2.3348575
Array=96, tries=4, loadFactor=0.5277291, memory x 1.8949116
Array=96, tries=5, loadFactor=0.60920805, memory x 1.6414753
Array=96, tries=6, loadFactor=0.66278124, memory x 1.5087935
Array=128, tries=1, loadFactor=0.15130469, memory x 6.6091805
Array=128, tries=2, loadFactor=0.3054219, memory x 3.2741597
Array=128, tries=3, loadFactor=0.43253124, memory x 2.3119717
Array=128, tries=4, loadFactor=0.5283203, memory x 1.8927912
Array=128, tries=5, loadFactor=0.6043203, memory x 1.6547517
Array=128, tries=6, loadFactor=0.66267186, memory x 1.5090425
Array=128, tries=7, loadFactor=0.70567185, memory x 1.4170892
Array=192, tries=1, loadFactor=0.14355215, memory x 6.9661093
Array=192, tries=2, loadFactor=0.32579714, memory x 3.0693946
Array=192, tries=3, loadFactor=0.42888057, memory x 2.3316514
Array=192, tries=4, loadFactor=0.5238486, memory x 1.9089485
Array=192, tries=5, loadFactor=0.59191173, memory x 1.6894411
Array=192, tries=6, loadFactor=0.65411395, memory x 1.5287856
Array=192, tries=7, loadFactor=0.6975258, memory x 1.4336387
Array=256, tries=1, loadFactor=1.0, memory x 1.0
Array=256, tries=2, loadFactor=1.0, memory x 1.0
Array=256, tries=3, loadFactor=1.0, memory x 1.0
Array=256, tries=4, loadFactor=1.0, memory x 1.0
Array=256, tries=5, loadFactor=1.0, memory x 1.0
Array=256, tries=6, loadFactor=1.0, memory x 1.0
Array=256, tries=7, loadFactor=1.0, memory x 1.0
Array=256, tries=8, loadFactor=1.0, memory x 1.0

> Reduce size of FSTs due to use of direct-addressing encoding 
> -------------------------------------------------------------
>
>                 Key: LUCENE-8920
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8920
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Mike Sokolov
>            Priority: Blocker
>             Fix For: 8.3
>
>          Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to