mikemccand commented on code in PR #12624: URL: https://github.com/apache/lucene/pull/12624#discussion_r1360701669
########## lucene/core/src/java/org/apache/lucene/util/fst/BytesStore.java: ########## @@ -21,19 +21,18 @@ import java.util.List; import org.apache.lucene.store.DataInput; import org.apache.lucene.store.DataOutput; -import org.apache.lucene.util.Accountable; import org.apache.lucene.util.RamUsageEstimator; // TODO: merge with PagedBytes, except PagedBytes doesn't // let you read while writing which FST needs -class BytesStore extends DataOutput implements Accountable { +class BytesStore extends FSTWriter { Review Comment: > I'm kind of favoring the way Tantivy is doing, it's cleaner and more "facade pattern". Maybe we could first refactor so that the FST created on the fly will be written directly to a DataOutput, and then instead of using it directly, we construct a FST from that DataOutput? From the Builder point-of-view it will still create the FST eventually, but could pave the way for segregation of building & reading later (i.e Builder will only write to the DataOutput and it's up to the users to create the corresponding DataInput and construct a FST from that). +1 -- I like this approach. Let's not special case the "build FST in memory and immediately use it from memory" case. Let the OS cache the bytes we write and then keep them hot (in RAM) when we immediately open it for reading. That's a nice standalone simplification. > UPDATE: The above seems to be a breaking change and might not be a small, incremental one, as NodeHash needs the FST to be initialized first so that it can search over the previously written nodes. It indeed reads from the still-writing FST, but only to confirm whether a newly created node is already in the `NodeHash`. `NodeHash` is really a `Set<Node>`. Normally a Set would hold pointers to java objects as its values, but in our case, instead of holding a true object pointer, we hold a pointer back into the FST since the FST has the node already. We could consider changing that so that `NodeHash` holds a copy of the full node itself (a `byte[]` or a slice into a shared `byte[]`). This is a bit wasteful in RAM, but it'd fully decouple `NodeHash` from the growing / appending FST's `byte[]`? > If we are doing that, then we can get rid of the getReverseBytesReader. However one issue remains: we still need getReverseBytesReaderForSuffixSharing for NodeHash. Or at least some operation for random-access. I think Tantivy is using LRU cache for this in write-through mode: write the node into both the DataOutput and the LRU at the same time. This means we don't even need to read from the DataOutput, but it won't be perfectly minimal (as there will be false-negative cache-miss). I understand that there is the trade-off, but we might also need to support the current minimalist mode. Aha! It sounds like Tantivy indeed does store-by-value in its LRU suffix cache, not store-by-reference like Lucene does today? Yet another option would be to consider allowing a still open `IndexOutput` to be read by an `IndexInput`? This would be a change in the semantics of our IO abstractions (`Directory` and its babies), and likely problematic for some filesystems (HDFS?), but for most common filesystems would be easily supported. Then we wouldn't have to duplicate the nodes in the `NodeHash` and in the FST. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org