[ https://issues.apache.org/jira/browse/LUCENE-10247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17447282#comment-17447282 ]
Dawid Weiss commented on LUCENE-10247: -------------------------------------- Hi [~hendrikmuhs]! This sounds interesting - didn't look at the patch yet - but I think the examination of performance should be twofold: the size of the output dictionary and the performance impact of additional calculations. 5% size improvement may not be worth it if there is a speed degradation there. {code} 4.2 million key dictionary from 92MB ot 87MB, so saved 5MB, which is roughly 5.5%. {code} I recall we _did_ have space-related optimizations in Lucene at some point but they were removed exactly for this reason (time it took to optimize/ serialize and then search over the automaton vs. size benefits). You can take a look at this ancient paper which discusses those size-reduction techniques, including global graph node rearrangements to minimize pointer size: https://www.cs.put.poznan.pl/dweiss/site/publications/download/fsacomp.pdf These are still used in Morfologik, by the way, so you could try to estimate what kind of benefits you're looking at for your dictionaries: https://github.com/morfologik/morfologik-stemming/blob/master/morfologik-fsa-builders/src/main/java/morfologik/fsa/builders/CFSA2Serializer.java My experience with the above project is that in 99% of cases runtime speed is much preferred over disk size (and this is achieved by simpler state encoding). > Reduce FST size by using absolute and relative coding for target pointers > ------------------------------------------------------------------------- > > Key: LUCENE-10247 > URL: https://issues.apache.org/jira/browse/LUCENE-10247 > Project: Lucene - Core > Issue Type: Improvement > Components: core/FSTs > Reporter: Hendrik Muhs > Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > FST's use various tricks to reduce size. One more trick that can be added is > using relative coding for the target pointers that connect nodes. > Currently if a node has a target and it's not optimized using the `next` > flag, it writes a VLong which contains the address of the child. This is an > absolute address. A relative address can be shorter, relative address means > taking the address of the node and expressing the result of child node as > diff between the parent and the child. E.g. if the child is _10099_ and the > parent {_}10000{_}, an absolute pointer requires 2 bytes, but the relative > pointer ({_}10099 - 10000{_}) requires only 1 byte. Of course that's not > always true, the absolute pointer could be smaller. Therefore the idea is to > switch between the two, dependent on what's smaller. > (FWIW: From a theory standpoint this works nicely, because the FST is > constructed from the leafs to the root, child node addresses are always > smaller than parent ones and due to good locality, connected nodes are > usually stored close to each other ) > I have a prototype for this, it will be pushed as a draft PR and added as > link here. It introduces a new bit in _flags_ and changes the compiler to use > relative coding if possible. The lookup side needs the current node address > and checks one more bit flag. Both additions seem reasonable cheap and do not > add a lot of runtime. > In my measurements I was able to reduce the size of a 4.2 million key > dictionary from 92MB ot 87MB, so saved 5MB, which is roughly 5.5%. If you > have better reference datasets, let me know. It obviously works better the > more data. > The POC implementation only supports 1 output type (positive int output) and > I just made it compile. For this 1 type of dictionary it's possible to create > and dump dictionaries correctly. > Before spending more time on this I like to hear feedback and get agreement > on whether this is something you like to have. It seems to me that a size > reduction of FST's is always welcome and as pointed out, the runtime > additions are reasonable small. If you wonder about the implementation, I > happily discuss the nitty gritty details on gh, it had some interesting > challenges. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org