Hi, We are trying an implementation where we use a custom PostingsFormat for one field to write the postings directly to a third party stable storage. The intention is partial update for this field. But for now, I want to ask one specific problem regarding merge.
Main Issue: ************* In the indexing chain of Lucene, PerFieldPostingsFormat (which is a consumer for FreqProxTermsWriter) is only invoked at the flush time. Which means if you implement a custom PostingsFormat for a field, a custom FieldsConsumer/TermsConsumer/PostingsConsumer can only do things at flush time or merge time. We are stuck because a commit can happen without anything to be flushed. In this sort of commit, an uncommitted merged segment gets committed, but our customer FieldsConsumer is not aware of this commit at all. Is there anyway for a custom FieldConsumer of a PostingsFormat to know about a "commit without anything to flush" event? More Details on what are we trying: ***************************************** We have our own implementation of merge logic, which is primarily renumbering the docIds. And whenever a merge happens we ended up writing the new merged state to the stable storage because of Lucene's design. However, we got into some inconsistency with Lucene's uncommted merged segment. Our FieldConsumer has committed the new merged state to the stable storage. Some details and Points to Note: ************************************* Our custom FieldsConsumer is called only at the flush time by Lucene. Because FreqProxTermsWriterPerField invokes it's consumer only at the time of flush(). I can not give a lot of details here, but assume we store something like (key:segment_field_term , value:postingsArray ) in the stable storage. There are other fields as well which follow default Lucene42 Codec formats. The problem: How to keep the segment information in the stable storage and Lucene's uncommited merged segments, in sync. *************** In this case, the opened Lucene directory may have the old segments but our key-value store have only the new merged state. The searches are done with old segment readers, and since old segments are no more there in the stable storage, searches fail. Seeking Solution for how to solve the case where a merged segment is just checkpointed (not yet commited) but may or may not participate in a search. ********************** So our flow is something like this: (Skipping the whole indexing chain for brevity) a) UpdateDocument -> InvertedDocConsumer (DocProcessorPerField) -> TermHashPerField ->FreqProxTermsWriterPerField b) DocumentWriter.flush() -> DWPT.flush() called through commit (Do not consider merge for now) c) This also goes through same indexing chain, calling flush of the consumers in the chain. d) Finally theFreqProxTermsWriterPerField.flush() calls it's consumer which is perFieldPostingsFormat and for this field, the custom TermsConsumer and PostingsConsumer are used. For merge, the flow is: SegmentMerger.mergeTerms -> FieldsConsumer.merge -> .TermsConsumer.merge -> PostingsConsumer.merge ******************************************************************************************************************************************************************************************************************* Since merge is also flushing a new segment, it is no surprise that merge and flush almost call the same methods of TermsConsumer and PostingsConsumer (startTerm, startDoc etc) ******************************************************************************************************************************************************************************************************************** But this design is a problem for us. We want "flush" to write directly to the stable storage but "merge" should wait when the merged segment gets committed. And so we implement our own merge method where we writed the merged state to an in-memory structure. But we may not be able to commit this at the same time as the uncommitted merged segment gets committed. One solution we tried but did not work completely : *********************************************************** We do not allow Lucene's TermConsumer.merge to process anything because we end the TermsEnum iterator before calling super.merge. And in our custom merge method:The custom TermsConsumer creates an in-memory merge information from the given mergeState. We write(commit, so to say) this to the stable storage at the next flush because we have no other signal to write this. However, the problem with this approach is we are dependent on another document to be added and flushed for our in-memory merge info to get committed to stable storage. IndexWriter's commit does not communicate with the FieldConsumers directly but only through DocumentWriter.flush() and this goes via the indexing chain. However, same is not true with the merged checkpointed segment, any commit will commit all the uncommitted segments without any flush requirement. So for eg, if we use Solr's Optimize command, after doing a forceMerge() everything is flushed and then a commit is issued. In this commit, the custom FieldConsumer are not invoked and they do not get a chance to commit any uncommitted in-memory information. So we end up in a problem with Optimize command since the merged segment is now committed but our own in-memory merged state is not committed. Thanks in advance for reading this long question. Any thoughts are welcome. If you are aware of some implementation doing partial updates through custom codecs, please do let me know. Kind Regards, Aditya Tripathi.