Codec - PostingsFormat - Postings/TermsConsumer - Checkpointed merged segment.

Aditya Tripathi Thu, 12 Jun 2014 19:41:24 -0700

Hi,

We are trying an implementation where we use a custom PostingsFormat for
one field to write the postings directly to a third party stable storage.
The intention is partial update for this field. But for now, I want to ask
one specific problem regarding merge.

Main Issue:
*************
In the indexing chain of Lucene, PerFieldPostingsFormat (which is a
consumer for FreqProxTermsWriter) is only invoked at the flush time. Which
means if you implement a custom PostingsFormat for a field, a custom
FieldsConsumer/TermsConsumer/PostingsConsumer can only do things at flush
time or merge time. We are stuck because a commit can happen without
anything to be flushed. In this sort of commit, an uncommitted merged
segment gets committed, but our customer FieldsConsumer is not aware of
this commit at all.

Is there anyway for a custom FieldConsumer of a PostingsFormat to know
about a "commit without anything to flush" event?

More Details on what are we trying:
*****************************************

We have our own implementation of merge logic, which is primarily
renumbering the docIds. And whenever a merge happens we ended up writing
the new merged state to the stable storage because of Lucene's design.
However, we got into some inconsistency with Lucene's uncommted merged
segment. Our FieldConsumer has committed the new merged state to the stable
storage.

Some details and Points to Note:
*************************************
Our custom FieldsConsumer is called only at the flush time by Lucene.
Because FreqProxTermsWriterPerField invokes it's consumer only at the time
of flush().

I can not give a lot of details here, but assume we store something like
(key:segment_field_term , value:postingsArray ) in the stable storage.

There are other fields as well which follow default Lucene42 Codec formats.

The problem: How to keep the segment information in the stable storage and
Lucene's uncommited merged segments, in sync.
***************
In this case, the opened Lucene directory may have the old segments but our
key-value store have only the new merged state. The searches are done with
old segment readers, and since old segments are no more there in the stable
storage, searches fail.

Seeking Solution for how to solve the case where a merged segment is just
checkpointed (not yet commited) but may or may not participate in a search.
**********************

So our flow is something like this:
(Skipping the whole indexing chain for brevity)
a) UpdateDocument -> InvertedDocConsumer (DocProcessorPerField) ->
TermHashPerField ->FreqProxTermsWriterPerField
b) DocumentWriter.flush() -> DWPT.flush() called through commit (Do not
consider merge for now)
c) This also goes through same indexing chain, calling flush of the
consumers in the chain.
d) Finally theFreqProxTermsWriterPerField.flush() calls it's consumer which
is perFieldPostingsFormat and for this field, the custom TermsConsumer and
PostingsConsumer are used.

For merge, the flow is:

SegmentMerger.mergeTerms -> FieldsConsumer.merge -> .TermsConsumer.merge ->
PostingsConsumer.merge

*******************************************************************************************************************************************************************************************************************
Since merge is also flushing a new segment, it is no surprise that merge
and flush almost call the same methods of TermsConsumer and
PostingsConsumer (startTerm, startDoc etc)
********************************************************************************************************************************************************************************************************************
But this design is a problem for us.

We want "flush" to write directly to the stable storage but "merge" should
wait when the merged segment gets committed.
And so we implement our own merge method where we writed the merged state
to an in-memory structure. But we may not be able to commit this at the
same time as the uncommitted merged segment gets committed.

One solution we tried but did not work completely :
***********************************************************
We do not allow Lucene's TermConsumer.merge to process anything because we
end the TermsEnum iterator before calling super.merge.
And in our custom merge method:The custom TermsConsumer creates an
in-memory merge information from the given mergeState.
We write(commit, so to say) this to the stable storage at the next flush
because we have no other signal to write this.

However, the problem with this approach is we are dependent on another
document to be added and flushed for our in-memory merge info to get
committed to stable storage.
IndexWriter's commit does not communicate with the FieldConsumers directly
but only through DocumentWriter.flush() and this goes via the indexing
chain.

However, same is not true with the merged checkpointed segment, any commit
will commit all the uncommitted segments without any flush requirement.
So for eg, if we use Solr's Optimize command, after doing a forceMerge()
everything is flushed and then a commit is issued. In this commit, the
custom FieldConsumer are not invoked and they do not get a chance to commit
any uncommitted in-memory information. So we end up in a problem with
Optimize command since the merged segment is now committed but our own
in-memory merged state is not committed.

Thanks in advance for reading this long question. Any thoughts are welcome.
If you are aware of some implementation doing partial updates through
custom codecs, please do let me know.

Kind Regards,
Aditya Tripathi.

Codec - PostingsFormat - Postings/TermsConsumer - Checkpointed merged segment.

Reply via email to