[ https://issues.apache.org/jira/browse/LUCENE-10247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17447323#comment-17447323 ]
Dawid Weiss commented on LUCENE-10247: -------------------------------------- Sure, I'll take a look later. Multi-word suggestion dictionaries trained on large data are typically good candidates for a stress test (as they contain high branching factors and are large). > Reduce FST size by using absolute and relative coding for target pointers > ------------------------------------------------------------------------- > > Key: LUCENE-10247 > URL: https://issues.apache.org/jira/browse/LUCENE-10247 > Project: Lucene - Core > Issue Type: Improvement > Components: core/FSTs > Reporter: Hendrik Muhs > Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > FST's use various tricks to reduce size. One more trick that can be added is > using relative coding for the target pointers that connect nodes. > Currently if a node has a target and it's not optimized using the `next` > flag, it writes a VLong which contains the address of the child. This is an > absolute address. A relative address can be shorter, relative address means > taking the address of the node and expressing the result of child node as > diff between the parent and the child. E.g. if the child is _10099_ and the > parent {_}10000{_}, an absolute pointer requires 2 bytes, but the relative > pointer ({_}10099 - 10000{_}) requires only 1 byte. Of course that's not > always true, the absolute pointer could be smaller. Therefore the idea is to > switch between the two, dependent on what's smaller. > (FWIW: From a theory standpoint this works nicely, because the FST is > constructed from the leafs to the root, child node addresses are always > smaller than parent ones and due to good locality, connected nodes are > usually stored close to each other ) > I have a prototype for this, it will be pushed as a draft PR and added as > link here. It introduces a new bit in _flags_ and changes the compiler to use > relative coding if possible. The lookup side needs the current node address > and checks one more bit flag. Both additions seem reasonable cheap and do not > add a lot of runtime. > In my measurements I was able to reduce the size of a 4.2 million key > dictionary from 92MB ot 87MB, so saved 5MB, which is roughly 5.5%. If you > have better reference datasets, let me know. It obviously works better the > more data. > The POC implementation only supports 1 output type (positive int output) and > I just made it compile. For this 1 type of dictionary it's possible to create > and dump dictionaries correctly. > Before spending more time on this I like to hear feedback and get agreement > on whether this is something you like to have. It seems to me that a size > reduction of FST's is always welcome and as pointed out, the runtime > additions are reasonable small. If you wonder about the implementation, I > happily discuss the nitty gritty details on gh, it had some interesting > challenges. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org