[GitHub] [lucene] hendrikmuhs commented on a change in pull request #460: LUCENE-10247 - reduce size of FSTs by relative coding

GitBox Fri, 03 Dec 2021 06:57:27 -0800


hendrikmuhs commented on a change in pull request #460:
URL: https://github.com/apache/lucene/pull/460#discussion_r762005851




##########
File path: lucene/core/src/java/org/apache/lucene/util/fst/FST.java
##########
@@ -1000,6 +1027,98 @@ private void writePresenceBits(
     assert bytePos - dest == numPresenceBytes;
   }
 
+  private long estimateNodeAddress(

Review comment:
       I agree that one pass would be easier. What makes it complicated is the 
variable length encodings. In order to patch up a cell you still need to 
reserve the right number of cells to write into. If you reserved 1 byte but 
later realize you need 2 you have a problem. A buffer where you can insert a 
cell might help, but than the whole calculation falls apart again, because the 
cell that you patched a step ago might need another patch and might overflow. 
If you look at the implementation below, this is what `tolerance` is about:
   
   I have an estimate that says, I need to store `125`, that's `1` byte, but if 
it the real value will be `128` I need `2` bytes, that means my `tolerance` is 
`2`. When the address slides I am still fine as long as it only goes up by 
maximum `2`. If above `2`, I stretched my budget. The algorithm returns `0` in 
this case, which means it falls back to absolute coding. In the end it is a 
heuristic which might fail, we could do 3rd pass to catch a couple more cases, 
but the added benefit is probably so small that it isn't worth the extra loop 
(I can test this later, right now it sounds premature).
   
   ^ fits with your question regarding always using relative offsets. It would 
be quite hard to implement. The extra bit is for free at the moment, because 
the flag has a fixed size. It might be a problem if you already have other uses 
of flags in mind and want to reserve it.

##########
File path: lucene/core/src/java/org/apache/lucene/util/fst/FST.java
##########
@@ -720,9 +745,9 @@ long addNode(FSTCompiler<T> fstCompiler, 
FSTCompiler.UnCompiledNode<T> nodeIn)
       }
 
       if (targetHasArcs && (flags & BIT_TARGET_NEXT) == 0) {
-        assert target.node > 0;
+        assert targetNode > 0;

Review comment:
       > also - if we were able to encode negative offsets we could always 
apply this variable encoding
   
   A negative offset would mean you target a node that has not been written 
yet. That's not possible by design. In addition it is not necessary, the way 
the fst is constructed is from the leafs to the root. A parent always has a 
higher offset than its child. Therefore negative offsets never happen.
   
   Absolute vs. relative coding: I wasn't able to use relative addresses all 
times, if "fixed length arcs" are used I switch this optimization off 
completely. To always use relative coding I must replace all absolute 
addresses. I am not sure this is even possible.
   
   When 1st experimenting with this technique years ago, I did some 
measurements and calculations. In smaller fst's absolute coding is sufficient, 
we have small addresses anyway. The larger the fst gets the more interesting 
relative coding becomes. If the fst is highly compressible the absolute address 
was often smaller than the relative one. E.g. leaf usually compress like that. 
For larger fst's and fst's that do not compress well (because of values) 
relative coding works nicely, because the construction yields good locality.

##########
File path: lucene/core/src/java/org/apache/lucene/util/fst/FST.java
##########
@@ -720,9 +745,9 @@ long addNode(FSTCompiler<T> fstCompiler, 
FSTCompiler.UnCompiledNode<T> nodeIn)
       }
 
       if (targetHasArcs && (flags & BIT_TARGET_NEXT) == 0) {
-        assert target.node > 0;
+        assert targetNode > 0;

Review comment:
       An absolute node address targeting `0` is not possible due to the 
reverse reading logic and that's the old assert anyway.
   
   So the question remains, can a relative target be `0`?
   
   In theory this would be a epsilon transition if I remember correctly. So the 
arc would follow a label and target itself like a regexp with a `*`, e.g. `a*`.
   
   In practice I think such a regex-like matching functionality is not what 
this implementation is made for nor do I think it is possible at the moment. 
The nodes are constructed as `UncompiledNode`, but in order to make such an 
epsilon transition you need to know the target before you push it for 
persistence. In simpler words, this is a chicken-egg problem.
   
   I know about some experimentation to use fst's as regexp engine, but at the 
moment I think this is not a problem. If it ever will be implemented other 
things have to be solved and at that time this assert can be removed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] hendrikmuhs commented on a change in pull request #460: LUCENE-10247 - reduce size of FSTs by relative coding

Reply via email to