hendrikmuhs commented on a change in pull request #460: URL: https://github.com/apache/lucene/pull/460#discussion_r762005851
########## File path: lucene/core/src/java/org/apache/lucene/util/fst/FST.java ########## @@ -1000,6 +1027,98 @@ private void writePresenceBits( assert bytePos - dest == numPresenceBytes; } + private long estimateNodeAddress( Review comment: I agree that one pass would be easier. What makes it complicated is the variable length encodings. In order to patch up a cell you still need to reserve the right number of cells to write into. If you reserved 1 byte but later realize you need 2 you have a problem. A buffer where you can insert a cell might help, but than the whole calculation falls apart again, because the cell that you patched a step ago might need another patch and might overflow. If you look at the implementation below, this is what `tolerance` is about: I have an estimate that says, I need to store `125`, that's `1` byte, but if it the real value will be `128` I need `2` bytes, that means my `tolerance` is `2`. When the address slides I am still fine as long as it only goes up by maximum `2`. If above `2`, I stretched my budget. The algorithm returns `0` in this case, which means it falls back to absolute coding. In the end it is a heuristic which might fail, we could do 3rd pass to catch a couple more cases, but the added benefit is probably so small that it isn't worth the extra loop (I can test this later, right now it sounds premature). ^ fits with your question regarding always using relative offsets. It would be quite hard to implement. The extra bit is for free at the moment, because the flag has a fixed size. It might be a problem if you already have other uses of flags in mind and want to reserve it. ########## File path: lucene/core/src/java/org/apache/lucene/util/fst/FST.java ########## @@ -720,9 +745,9 @@ long addNode(FSTCompiler<T> fstCompiler, FSTCompiler.UnCompiledNode<T> nodeIn) } if (targetHasArcs && (flags & BIT_TARGET_NEXT) == 0) { - assert target.node > 0; + assert targetNode > 0; Review comment: > also - if we were able to encode negative offsets we could always apply this variable encoding A negative offset would mean you target a node that has not been written yet. That's not possible by design. In addition it is not necessary, the way the fst is constructed is from the leafs to the root. A parent always has a higher offset than its child. Therefore negative offsets never happen. Absolute vs. relative coding: I wasn't able to use relative addresses all times, if "fixed length arcs" are used I switch this optimization off completely. To always use relative coding I must replace all absolute addresses. I am not sure this is even possible. When 1st experimenting with this technique years ago, I did some measurements and calculations. In smaller fst's absolute coding is sufficient, we have small addresses anyway. The larger the fst gets the more interesting relative coding becomes. If the fst is highly compressible the absolute address was often smaller than the relative one. E.g. leaf usually compress like that. For larger fst's and fst's that do not compress well (because of values) relative coding works nicely, because the construction yields good locality. ########## File path: lucene/core/src/java/org/apache/lucene/util/fst/FST.java ########## @@ -720,9 +745,9 @@ long addNode(FSTCompiler<T> fstCompiler, FSTCompiler.UnCompiledNode<T> nodeIn) } if (targetHasArcs && (flags & BIT_TARGET_NEXT) == 0) { - assert target.node > 0; + assert targetNode > 0; Review comment: An absolute node address targeting `0` is not possible due to the reverse reading logic and that's the old assert anyway. So the question remains, can a relative target be `0`? In theory this would be a epsilon transition if I remember correctly. So the arc would follow a label and target itself like a regexp with a `*`, e.g. `a*`. In practice I think such a regex-like matching functionality is not what this implementation is made for nor do I think it is possible at the moment. The nodes are constructed as `UncompiledNode`, but in order to make such an epsilon transition you need to know the target before you push it for persistence. In simpler words, this is a chicken-egg problem. I know about some experimentation to use fst's as regexp engine, but at the moment I think this is not a problem. If it ever will be implemented other things have to be solved and at that time this assert can be removed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org