Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

via GitHub Mon, 16 Oct 2023 06:52:17 -0700


mikemccand commented on code in PR #12624:
URL: https://github.com/apache/lucene/pull/12624#discussion_r1360701669



##########
lucene/core/src/java/org/apache/lucene/util/fst/BytesStore.java:
##########
@@ -21,19 +21,18 @@
 import java.util.List;
 import org.apache.lucene.store.DataInput;
 import org.apache.lucene.store.DataOutput;
-import org.apache.lucene.util.Accountable;
 import org.apache.lucene.util.RamUsageEstimator;
 
 // TODO: merge with PagedBytes, except PagedBytes doesn't
 // let you read while writing which FST needs
 
-class BytesStore extends DataOutput implements Accountable {
+class BytesStore extends FSTWriter {

Review Comment:
   > I'm kind of favoring the way Tantivy is doing, it's cleaner and more 
"facade pattern". Maybe we could first refactor so that the FST created on the 
fly will be written directly to a DataOutput, and then instead of using it 
directly, we construct a FST from that DataOutput? From the Builder 
point-of-view it will still create the FST eventually, but could pave the way 
for segregation of building & reading later (i.e Builder will only write to the 
DataOutput and it's up to the users to create the corresponding DataInput and 
construct a FST from that).
   
   +1 -- I like this approach.  Let's not special case the "build FST in memory 
and immediately use it from memory" case.  Let the OS cache the bytes we write 
and then keep them hot (in RAM) when we immediately open it for reading.  
That's a nice standalone simplification.
   
   > UPDATE: The above seems to be a breaking change and might not be a small, 
incremental one, as NodeHash needs the FST to be initialized first so that it 
can search over the previously written nodes.
   
   It indeed reads from the still-writing FST, but only to confirm whether a 
newly created node is already in the `NodeHash`.  `NodeHash` is really a 
`Set<Node>`.  Normally a Set would hold pointers to java objects as its values, 
but in our case, instead of holding a true object pointer, we hold a pointer 
back into the FST since the FST has the node already.  We could consider 
changing that so that `NodeHash` holds a copy of the full node itself (a 
`byte[]` or a slice into a shared `byte[]`).  This is a bit wasteful in RAM, 
but it'd fully decouple `NodeHash` from the growing / appending FST's `byte[]`?
   
   > If we are doing that, then we can get rid of the getReverseBytesReader. 
However one issue remains: we still need getReverseBytesReaderForSuffixSharing 
for NodeHash. Or at least some operation for random-access. I think Tantivy is 
using LRU cache for this in write-through mode: write the node into both the 
DataOutput and the LRU at the same time. This means we don't even need to read 
from the DataOutput, but it won't be perfectly minimal (as there will be 
false-negative cache-miss). I understand that there is the trade-off, but we 
might also need to support the current minimalist mode.
   
   Aha!  It sounds like Tantivy indeed does store-by-value in its LRU suffix 
cache, not store-by-reference like Lucene does today?
   
   Yet another option would be to consider allowing a still open `IndexOutput` 
to be read by an `IndexInput`?  This would be a change in the semantics of our 
IO abstractions (`Directory` and its babies), and likely problematic for some 
filesystems (HDFS?), but for most common filesystems would be easily supported. 
 Then we wouldn't have to duplicate the nodes in the `NodeHash` and in the FST.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

Reply via email to