Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

via GitHub Thu, 12 Oct 2023 16:50:01 -0700


dungba88 commented on code in PR #12624:
URL: https://github.com/apache/lucene/pull/12624#discussion_r1353826324



##########
lucene/core/src/java/org/apache/lucene/util/fst/BytesStore.java:
##########
@@ -21,19 +21,18 @@
 import java.util.List;
 import org.apache.lucene.store.DataInput;
 import org.apache.lucene.store.DataOutput;
-import org.apache.lucene.util.Accountable;
 import org.apache.lucene.util.RamUsageEstimator;
 
 // TODO: merge with PagedBytes, except PagedBytes doesn't
 // let you read while writing which FST needs
 
-class BytesStore extends DataOutput implements Accountable {
+class BytesStore extends FSTWriter {

Review Comment:
   > Once FST.addNode completes, those written bytes are never altered?
   
   ~~More precisely, those bytes are never altered after `FSTCompiler.add` 
completes. It seems we need to write all nodes of the input in reverse as a 
whole.~~ It seems it's possible to flush the scratch bytes after every addNode. 
yeah the BytesStore use the head of the buffer as scratch area.
   
   > Hmm getReverseBytesReader is indeed a problem. I wonder how the [Tantivy 
FST impl](https://blog.burntsushi.net/transducers/) deals with this?
   
   It seems Tantivy segregate the building and the traverse of FST as 2 
different entity. The FST Builder will just write the FST to a DataOutput and 
not allow it to be read directly. I was thinking of this too, as currently we 
are mixing up the writing and reading:
   - Load a previously saved FST from a DataInput. This is read-only and is 
fine, and it's how Tantivy FST is created as well.
   - Construct a FST on-the-fly and use it right away. This is both read & 
write and it uses BytesStore.
   
   I'm kind of favoring the way Tantivy is doing, it's cleaner and more "facade 
pattern". Maybe we could first refactor so that the FST created on the fly will 
be written directly to a DataOutput, and then instead of using it directly, we 
construct a FST from that DataOutput? From the Builder point-of-view it will 
still create the FST eventually, but could pave the way for segregation of 
building & reading later (i.e Builder will only write to the DataOutput and 
it's up to the users to create the corresponding DataInput and construct a FST 
from that).
   
   UPDATE: The above seems to be a breaking change and might not be a small, 
incremental one, as NodeHash needs the FST to be initialized first so that it 
can search over the previously written nodes.
   
   If we are doing that, then we can get rid of the `getReverseBytesReader`. 
However one issue remains: we still need 
`getReverseBytesReaderForSuffixSharing` for NodeHash. Or at least some 
operation for random-access. I think Tantivy is using LRU cache for this in 
write-through mode: write the node into both the DataOutput and the LRU at the 
same time. This means we don't even need to read from the DataOutput, but it 
won't be perfectly minimal (as there will be false-negative cache-miss). I 
understand that there is the trade-off, but we might also need to support the 
current minimalist mode.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

Reply via email to