Hendrik Muhs created LUCENE-10247:
-------------------------------------

             Summary: Reduce FST size by using absolute and relative coding for 
target pointers
                 Key: LUCENE-10247
                 URL: https://issues.apache.org/jira/browse/LUCENE-10247
             Project: Lucene - Core
          Issue Type: Improvement
          Components: core/FSTs
            Reporter: Hendrik Muhs


FST's use various tricks to reduce size. One more trick that can be added is 
using relative coding for the target pointers that connect nodes.

Currently if a node has a target and it's not optimized using the `next` flag, 
it writes a VLong which contains the address of the child. This is an absolute 
address. A relative address can be shorter, relative address means taking the 
address of the node and expressing the result of child node as diff between the 
parent and the child. E.g. if the child is _10099_ and the parent {_}10000{_}, 
an absolute pointer requires 2 bytes, but the relative pointer ({_}10099 - 
10000{_}) requires only 1 byte. Of course that's not always true, the absolute 
pointer could be smaller. Therefore the idea is to switch between the two, 
dependent on what's smaller.

(FWIW: From a theory standpoint this works nicely, because the FST is 
constructed from the leafs to the root, child node addresses are always smaller 
than parent ones and due to good locality, connected nodes are usually stored 
close to each other )

I have a prototype for this, it will be pushed as a draft PR and added as link 
here. It introduces a new bit in _flags_ and changes the compiler to use 
relative coding if possible. The lookup side needs the current node address and 
checks one more bit flag. Both additions seem reasonable cheap and do not add a 
lot of runtime.

In my measurements I was able to reduce the size of a 4.2 million key 
dictionary from 92MB ot 87MB, so saved 5MB, which is roughly 5.5%. If you have 
better reference datasets, let me know. It obviously works better the more data.

 

The POC implementation only supports 1 output type (positive int output) and I 
just made it compile. For this 1 type of dictionary it's possible to create and 
dump dictionaries correctly.

 

Before spending more time on this I like to hear feedback and get agreement on 
whether this is something you like to have. It seems to me that a size 
reduction of FST's is always welcome and as pointed out, the runtime additions 
are reasonable small. If you wonder about the implementation, I happily discuss 
the nitty gritty details on gh, it had some interesting challenges.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to