On Sun, Nov 24, 2013 at 10:44 AM, Jack Krupansky <j...@basetechnology.com>wrote:

> We should probably talk about "internal" Lucene document IDs and
> "external" or "rebased" Lucene document IDs. The internal document IDs are
> always "per-segment" and never, ever change for that closed segment. But...
> the application would not normally see these IDs. Usually the externally
> visible Lucene document IDs have been "rebased" to add the sum total count
> of documents (both existing and deleted) of all preceding segments to the
> document IDs of a given segment, producing a "global" (across the full
> index of all segments) Lucene document ID.
>
> So, if you have those three segments, with deleted documents in the first
> two segments, and then merge those first two segments, the
> externally-visible Lucene document IDs for the third segment will suddenly
> all be different, shifted lower by the number of deleted documents that
> were just merged away, even though nothing changed in the third segment
> itself.
>

That's right, and I'm starting to think that if i keep the segment id and
the original offset, i don't need to rebuild that part of the cache,
because it has not been rebased (but I can always update the deleted docs).
It seems simple so I'm suspecting to find a catch somewhere. but if it
works, that could potentially speed up any cache building

Do you have information where the docbase of the segment are stored? Or
which java class I should start my exploration from? [it is somewhat
sprawling complex, so I'm bit lost :)]


>
> Maybe these should be called "local" (to the segment) Lucene document IDs
> and "global" (across all segment) Lucene document IDs. Or, maybe internal
> vs. external is good enough.
>
> In short, it is completely safe to use and save Lucene document IDs, but
> only as long as no merging of segments is performed. Even one tiny merge
> and all subsequent saved document IDs are invalidated. Be careful with your
> merge policy - normally merges are happening in the background,
> automatically.
>

my tests, as per previous email, showed that the last segment docid's are
not that stable. I don't know if it matters that I used the RAMDirectory
for the test, but the docids were being 'recycled' -  the deleted docs were
in the previous segment, then suddently their docids were inside newly
added documents (so maybe solr/lucene is not counting deleted docs, if they
are at the end of a segment...?) i don't know. i'll need to explore the
index segments to understand what was going on there, thanks for any
possible pointers


  roman




>
> -- Jack Krupansky
>
> -----Original Message----- From: Erick Erickson
> Sent: Sunday, November 24, 2013 8:31 AM
> To: solr-user@lucene.apache.org
> Subject: Re: building custom cache - using lucene docids
>
>
> bq: Do i understand you correctly that when two segmets get merged, the
> docids
> (of the original segments) remain the same?
>
> The original segments are unchanged, segments are _never_ changed after
> they're closed. But they'll be thrown away. Say you have segment1 and
> segment2 that get merged into segment3. As soon as the last searcher
> that is looking at segment1 and segment2 is closed, those two segments
> will be deleted from your disk.
>
> But for any given doc, the docid in segment3 will very likely be different
> than it was in segment1 or 2.
>
> I think you're reading too much into LUCENE-2897. I'm pretty sure the
> segment in question is not available to you anyway before this rewrite is
> done,
> but freely admit I don't know much about it.
>
> You're probably going to get into the whole PerSegment family of
> operations,
> which is something I'm not all that familiar with so I'll leave
> explanations
> to others.
>
>
> On Sat, Nov 23, 2013 at 8:22 PM, Roman Chyla <roman.ch...@gmail.com>
> wrote:
>
>  Hi Erick,
>> Many thanks for the info. An additional question:
>>
>> Do i understand you correctly that when two segmets get merged, the docids
>> (of the original segments) remain the same?
>>
>> (unless, perhaps in situation, they were merged using the last index
>> segment which was opened for writing and where the docids could have
>> suddenly changed in a commit just before the merge)
>>
>> Yes, you guessed right that I am putting my code into the custom cache -
>> so
>> it gets notified on index changes. I don't know yet how, but I think I can
>> find the way to the current active, opened (last) index segment. Which is
>> actively updated (as opposed to just being merged) -- so my definition of
>> 'not last ones' is: where docids don't change. I'd be grateful if someone
>> could spot any problem with such assumption.
>>
>> roman
>>
>>
>>
>>
>> On Sat, Nov 23, 2013 at 7:39 PM, Erick Erickson <erickerick...@gmail.com
>> >wrote:
>>
>> > bq: But can I assume
>> > that docids in other segments (other than the last one) will be
>> relatively
>> > stable?
>> >
>> > Kinda. Maybe. Maybe not. It depends on how you define "other than the
>> > last one".
>> >
>> > The key is that the internal doc IDs may change when segments are
>> > merged. And old segments get merged. Doc IDs will _never_ change
>> > in a segment once it's closed (although as you note they may be
>> > marked as deleted). But that segment may be written to a new segment
>> > when merging and the internal ID for a given document in the new
>> > segment bears no relationship to internal ID in the old segment.
>> >
>> > BTW, I think you only really care when opening a new searchers. There is
>> > a UserCache (see solrconfig.xml) that gets notified when a new searcher
>> > is being opened to give it an opportunity to refresh itself, is that
>> > useful?
>> >
>> > As long as a searcher is open, it's guaranteed that nothing is changing.
>> > Hard commits with openSearcher=false don't open new searchers, which
>> > is why changes aren't visible until a softCommit or a hard commit with
>> > openSearcher=true despite the fact that the segments are closed.
>> >
>> > FWIW,
>> > Erick
>> >
>> > Best
>> > Erick
>> >
>> >
>> >
>> > On Sat, Nov 23, 2013 at 12:40 AM, Roman Chyla <roman.ch...@gmail.com>
>> > wrote:
>> >
>> > > Hi,
>> > > docids are 'ephemeral', but i'd still like to build a search cache >
>> > with
>> > > them (they allow for the fastest joins).
>> > >
>> > > i'm seeing docids keep changing with updates (especially, in the last
>> > index
>> > > segment) - as per
>> > > https://issues.apache.org/jira/browse/LUCENE-2897
>> > >
>> > > That would be fine, because i could build the cache from diff (of > >
>> index
>> > > state) + reading the latest index segment in its entirety. But can I
>> > assume
>> > > that docids in other segments (other than the last one) will be
>> > relatively
>> > > stable? (ie. when an old doc is deleted, the docid is marked as
>> removed;
>> > > update doc = delete old & create a new docid)?
>> > >
>> > > thanks
>> > >
>> > > roman
>> > >
>> >
>>
>>
>

Reply via email to