[ https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16930608#comment-16930608 ]
Bruno Roustant edited comment on LUCENE-8920 at 9/16/19 2:29 PM: ----------------------------------------------------------------- Open-addressing benchmark to store byte labels in an array of size between 8 and 256. "Array" means array size. "tries" means max number of slot comparisons allowed (when a slot we want to put a label to is not empty - move to the next slot). "loadFactor" = max load factor (num labels / array size) for which we can add all the labels by respecting "tries". Above this max load factor, the open-addressing fails to store all labels and fallbacks to binary search. This max load factor is an average of the max load factor for 1000 random constructions done by the benchmark. "memory" = space increase factor = 1 / loadFactor. Array=8, tries=1, loadFactor=0.400125, memory x 2.499219 Array=8, tries=2, loadFactor=0.63825, memory x 1.5667841 Array=8, tries=3, loadFactor=0.78675, memory x 1.2710518 Array=12, tries=1, loadFactor=0.34391674, memory x 2.9076805 Array=12, tries=2, loadFactor=0.54183316, memory x 1.8455865 Array=12, tries=3, loadFactor=0.6964172, memory x 1.4359208 Array=16, tries=1, loadFactor=0.29825, memory x 3.352892 Array=16, tries=2, loadFactor=0.5039375, memory x 1.9843731 Array=16, tries=3, loadFactor=0.646625, memory x 1.5464914 Array=16, tries=4, loadFactor=0.7493125, memory x 1.3345567 Array=24, tries=1, loadFactor=0.25141662, memory x 3.9774618 Array=24, tries=2, loadFactor=0.44929162, memory x 2.225726 Array=24, tries=3, loadFactor=0.58083373, memory x 1.7216631 Array=24, tries=4, loadFactor=0.6867924, memory x 1.4560441 Array=32, tries=1, loadFactor=0.22528125, memory x 4.4388957 Array=32, tries=2, loadFactor=0.4059375, memory x 2.4634335 Array=32, tries=3, loadFactor=0.53625, memory x 1.8648019 Array=32, tries=4, loadFactor=0.63653123, memory x 1.5710148 Array=32, tries=5, loadFactor=0.7165625, memory x 1.3955517 Array=48, tries=1, loadFactor=0.18899995, memory x 5.2910066 Array=48, tries=2, loadFactor=0.36193773, memory x 2.762906 Array=48, tries=3, loadFactor=0.49154142, memory x 2.0344167 Array=48, tries=4, loadFactor=0.5871248, memory x 1.7032154 Array=48, tries=5, loadFactor=0.6700212, memory x 1.4924902 Array=64, tries=1, loadFactor=0.17101562, memory x 5.8474193 Array=64, tries=2, loadFactor=0.33707812, memory x 2.9666712 Array=64, tries=3, loadFactor=0.46217188, memory x 2.1636972 Array=64, tries=4, loadFactor=0.56409377, memory x 1.7727549 Array=64, tries=5, loadFactor=0.6392813, memory x 1.5642567 Array=64, tries=6, loadFactor=0.6963125, memory x 1.4361368 Array=96, tries=1, loadFactor=0.15294789, memory x 6.5381746 Array=96, tries=2, loadFactor=0.31204194, memory x 3.2046974 Array=96, tries=3, loadFactor=0.42829168, memory x 2.3348575 Array=96, tries=4, loadFactor=0.5277291, memory x 1.8949116 Array=96, tries=5, loadFactor=0.60920805, memory x 1.6414753 Array=96, tries=6, loadFactor=0.66278124, memory x 1.5087935 Array=128, tries=1, loadFactor=0.15130469, memory x 6.6091805 Array=128, tries=2, loadFactor=0.3054219, memory x 3.2741597 Array=128, tries=3, loadFactor=0.43253124, memory x 2.3119717 Array=128, tries=4, loadFactor=0.5283203, memory x 1.8927912 Array=128, tries=5, loadFactor=0.6043203, memory x 1.6547517 Array=128, tries=6, loadFactor=0.66267186, memory x 1.5090425 Array=128, tries=7, loadFactor=0.70567185, memory x 1.4170892 Array=192, tries=1, loadFactor=0.14355215, memory x 6.9661093 Array=192, tries=2, loadFactor=0.32579714, memory x 3.0693946 Array=192, tries=3, loadFactor=0.42888057, memory x 2.3316514 Array=192, tries=4, loadFactor=0.5238486, memory x 1.9089485 Array=192, tries=5, loadFactor=0.59191173, memory x 1.6894411 Array=192, tries=6, loadFactor=0.65411395, memory x 1.5287856 Array=192, tries=7, loadFactor=0.6975258, memory x 1.4336387 Array=256, tries=1, loadFactor=1.0, memory x 1.0 Array=256, tries=2, loadFactor=1.0, memory x 1.0 Array=256, tries=3, loadFactor=1.0, memory x 1.0 Array=256, tries=4, loadFactor=1.0, memory x 1.0 Array=256, tries=5, loadFactor=1.0, memory x 1.0 Array=256, tries=6, loadFactor=1.0, memory x 1.0 Array=256, tries=7, loadFactor=1.0, memory x 1.0 Array=256, tries=8, loadFactor=1.0, memory x 1.0 was (Author: bruno.roustant): Open-addressing benchmark to store byte labels in an array of size between 8 and 256. "Array" means array size. "tries" means max number of slot comparisons allowed (when a slot we want to put a label to is not empty - move to the next slot). "loadFactor" = max load factor (num labels / array size) for which we can add all the labels by respecting "tries". Above this max load factor, the open-addressing fails to store all labels and fallbacks to binary search. This max load factor is an average of 1000 random constructions done by the benchmark. "memory" = space increase factor = 1 / loadFactor. Array=8, tries=1, loadFactor=0.400125, memory x 2.499219 Array=8, tries=2, loadFactor=0.63825, memory x 1.5667841 Array=8, tries=3, loadFactor=0.78675, memory x 1.2710518 Array=12, tries=1, loadFactor=0.34391674, memory x 2.9076805 Array=12, tries=2, loadFactor=0.54183316, memory x 1.8455865 Array=12, tries=3, loadFactor=0.6964172, memory x 1.4359208 Array=16, tries=1, loadFactor=0.29825, memory x 3.352892 Array=16, tries=2, loadFactor=0.5039375, memory x 1.9843731 Array=16, tries=3, loadFactor=0.646625, memory x 1.5464914 Array=16, tries=4, loadFactor=0.7493125, memory x 1.3345567 Array=24, tries=1, loadFactor=0.25141662, memory x 3.9774618 Array=24, tries=2, loadFactor=0.44929162, memory x 2.225726 Array=24, tries=3, loadFactor=0.58083373, memory x 1.7216631 Array=24, tries=4, loadFactor=0.6867924, memory x 1.4560441 Array=32, tries=1, loadFactor=0.22528125, memory x 4.4388957 Array=32, tries=2, loadFactor=0.4059375, memory x 2.4634335 Array=32, tries=3, loadFactor=0.53625, memory x 1.8648019 Array=32, tries=4, loadFactor=0.63653123, memory x 1.5710148 Array=32, tries=5, loadFactor=0.7165625, memory x 1.3955517 Array=48, tries=1, loadFactor=0.18899995, memory x 5.2910066 Array=48, tries=2, loadFactor=0.36193773, memory x 2.762906 Array=48, tries=3, loadFactor=0.49154142, memory x 2.0344167 Array=48, tries=4, loadFactor=0.5871248, memory x 1.7032154 Array=48, tries=5, loadFactor=0.6700212, memory x 1.4924902 Array=64, tries=1, loadFactor=0.17101562, memory x 5.8474193 Array=64, tries=2, loadFactor=0.33707812, memory x 2.9666712 Array=64, tries=3, loadFactor=0.46217188, memory x 2.1636972 Array=64, tries=4, loadFactor=0.56409377, memory x 1.7727549 Array=64, tries=5, loadFactor=0.6392813, memory x 1.5642567 Array=64, tries=6, loadFactor=0.6963125, memory x 1.4361368 Array=96, tries=1, loadFactor=0.15294789, memory x 6.5381746 Array=96, tries=2, loadFactor=0.31204194, memory x 3.2046974 Array=96, tries=3, loadFactor=0.42829168, memory x 2.3348575 Array=96, tries=4, loadFactor=0.5277291, memory x 1.8949116 Array=96, tries=5, loadFactor=0.60920805, memory x 1.6414753 Array=96, tries=6, loadFactor=0.66278124, memory x 1.5087935 Array=128, tries=1, loadFactor=0.15130469, memory x 6.6091805 Array=128, tries=2, loadFactor=0.3054219, memory x 3.2741597 Array=128, tries=3, loadFactor=0.43253124, memory x 2.3119717 Array=128, tries=4, loadFactor=0.5283203, memory x 1.8927912 Array=128, tries=5, loadFactor=0.6043203, memory x 1.6547517 Array=128, tries=6, loadFactor=0.66267186, memory x 1.5090425 Array=128, tries=7, loadFactor=0.70567185, memory x 1.4170892 Array=192, tries=1, loadFactor=0.14355215, memory x 6.9661093 Array=192, tries=2, loadFactor=0.32579714, memory x 3.0693946 Array=192, tries=3, loadFactor=0.42888057, memory x 2.3316514 Array=192, tries=4, loadFactor=0.5238486, memory x 1.9089485 Array=192, tries=5, loadFactor=0.59191173, memory x 1.6894411 Array=192, tries=6, loadFactor=0.65411395, memory x 1.5287856 Array=192, tries=7, loadFactor=0.6975258, memory x 1.4336387 Array=256, tries=1, loadFactor=1.0, memory x 1.0 Array=256, tries=2, loadFactor=1.0, memory x 1.0 Array=256, tries=3, loadFactor=1.0, memory x 1.0 Array=256, tries=4, loadFactor=1.0, memory x 1.0 Array=256, tries=5, loadFactor=1.0, memory x 1.0 Array=256, tries=6, loadFactor=1.0, memory x 1.0 Array=256, tries=7, loadFactor=1.0, memory x 1.0 Array=256, tries=8, loadFactor=1.0, memory x 1.0 > Reduce size of FSTs due to use of direct-addressing encoding > ------------------------------------------------------------- > > Key: LUCENE-8920 > URL: https://issues.apache.org/jira/browse/LUCENE-8920 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Mike Sokolov > Priority: Blocker > Fix For: 8.3 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > Some data can lead to worst-case ~4x RAM usage due to this optimization. > Several ideas were suggested to combat this on the mailing list: > bq. I think we can improve thesituation here by tracking, per-FST instance, > the size increase we're seeing while building (or perhaps do a preliminary > pass before building) in order to decide whether to apply the encoding. > bq. we could also make the encoding a bit more efficient. For instance I > noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) > which make gaps very costly. Associating each label with a dense id and > having an intermediate lookup, ie. lookup label -> id and then id->arc offset > instead of doing label->arc directly could save a lot of space in some cases? > Also it seems that we are repeating the label in the arc metadata when > array-with-gaps is used, even though it shouldn't be necessary since the > label is implicit from the address? -- This message was sent by Atlassian Jira (v8.3.2#803003) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org