On Sun, Nov 24, 2013 at 10:44 AM, Jack Krupansky <j...@basetechnology.com>wrote:
> We should probably talk about "internal" Lucene document IDs and > "external" or "rebased" Lucene document IDs. The internal document IDs are > always "per-segment" and never, ever change for that closed segment. But... > the application would not normally see these IDs. Usually the externally > visible Lucene document IDs have been "rebased" to add the sum total count > of documents (both existing and deleted) of all preceding segments to the > document IDs of a given segment, producing a "global" (across the full > index of all segments) Lucene document ID. > > So, if you have those three segments, with deleted documents in the first > two segments, and then merge those first two segments, the > externally-visible Lucene document IDs for the third segment will suddenly > all be different, shifted lower by the number of deleted documents that > were just merged away, even though nothing changed in the third segment > itself. > That's right, and I'm starting to think that if i keep the segment id and the original offset, i don't need to rebuild that part of the cache, because it has not been rebased (but I can always update the deleted docs). It seems simple so I'm suspecting to find a catch somewhere. but if it works, that could potentially speed up any cache building Do you have information where the docbase of the segment are stored? Or which java class I should start my exploration from? [it is somewhat sprawling complex, so I'm bit lost :)] > > Maybe these should be called "local" (to the segment) Lucene document IDs > and "global" (across all segment) Lucene document IDs. Or, maybe internal > vs. external is good enough. > > In short, it is completely safe to use and save Lucene document IDs, but > only as long as no merging of segments is performed. Even one tiny merge > and all subsequent saved document IDs are invalidated. Be careful with your > merge policy - normally merges are happening in the background, > automatically. > my tests, as per previous email, showed that the last segment docid's are not that stable. I don't know if it matters that I used the RAMDirectory for the test, but the docids were being 'recycled' - the deleted docs were in the previous segment, then suddently their docids were inside newly added documents (so maybe solr/lucene is not counting deleted docs, if they are at the end of a segment...?) i don't know. i'll need to explore the index segments to understand what was going on there, thanks for any possible pointers roman > > -- Jack Krupansky > > -----Original Message----- From: Erick Erickson > Sent: Sunday, November 24, 2013 8:31 AM > To: solr-user@lucene.apache.org > Subject: Re: building custom cache - using lucene docids > > > bq: Do i understand you correctly that when two segmets get merged, the > docids > (of the original segments) remain the same? > > The original segments are unchanged, segments are _never_ changed after > they're closed. But they'll be thrown away. Say you have segment1 and > segment2 that get merged into segment3. As soon as the last searcher > that is looking at segment1 and segment2 is closed, those two segments > will be deleted from your disk. > > But for any given doc, the docid in segment3 will very likely be different > than it was in segment1 or 2. > > I think you're reading too much into LUCENE-2897. I'm pretty sure the > segment in question is not available to you anyway before this rewrite is > done, > but freely admit I don't know much about it. > > You're probably going to get into the whole PerSegment family of > operations, > which is something I'm not all that familiar with so I'll leave > explanations > to others. > > > On Sat, Nov 23, 2013 at 8:22 PM, Roman Chyla <roman.ch...@gmail.com> > wrote: > > Hi Erick, >> Many thanks for the info. An additional question: >> >> Do i understand you correctly that when two segmets get merged, the docids >> (of the original segments) remain the same? >> >> (unless, perhaps in situation, they were merged using the last index >> segment which was opened for writing and where the docids could have >> suddenly changed in a commit just before the merge) >> >> Yes, you guessed right that I am putting my code into the custom cache - >> so >> it gets notified on index changes. I don't know yet how, but I think I can >> find the way to the current active, opened (last) index segment. Which is >> actively updated (as opposed to just being merged) -- so my definition of >> 'not last ones' is: where docids don't change. I'd be grateful if someone >> could spot any problem with such assumption. >> >> roman >> >> >> >> >> On Sat, Nov 23, 2013 at 7:39 PM, Erick Erickson <erickerick...@gmail.com >> >wrote: >> >> > bq: But can I assume >> > that docids in other segments (other than the last one) will be >> relatively >> > stable? >> > >> > Kinda. Maybe. Maybe not. It depends on how you define "other than the >> > last one". >> > >> > The key is that the internal doc IDs may change when segments are >> > merged. And old segments get merged. Doc IDs will _never_ change >> > in a segment once it's closed (although as you note they may be >> > marked as deleted). But that segment may be written to a new segment >> > when merging and the internal ID for a given document in the new >> > segment bears no relationship to internal ID in the old segment. >> > >> > BTW, I think you only really care when opening a new searchers. There is >> > a UserCache (see solrconfig.xml) that gets notified when a new searcher >> > is being opened to give it an opportunity to refresh itself, is that >> > useful? >> > >> > As long as a searcher is open, it's guaranteed that nothing is changing. >> > Hard commits with openSearcher=false don't open new searchers, which >> > is why changes aren't visible until a softCommit or a hard commit with >> > openSearcher=true despite the fact that the segments are closed. >> > >> > FWIW, >> > Erick >> > >> > Best >> > Erick >> > >> > >> > >> > On Sat, Nov 23, 2013 at 12:40 AM, Roman Chyla <roman.ch...@gmail.com> >> > wrote: >> > >> > > Hi, >> > > docids are 'ephemeral', but i'd still like to build a search cache > >> > with >> > > them (they allow for the fastest joins). >> > > >> > > i'm seeing docids keep changing with updates (especially, in the last >> > index >> > > segment) - as per >> > > https://issues.apache.org/jira/browse/LUCENE-2897 >> > > >> > > That would be fine, because i could build the cache from diff (of > > >> index >> > > state) + reading the latest index segment in its entirety. But can I >> > assume >> > > that docids in other segments (other than the last one) will be >> > relatively >> > > stable? (ie. when an old doc is deleted, the docid is marked as >> removed; >> > > update doc = delete old & create a new docid)? >> > > >> > > thanks >> > > >> > > roman >> > > >> > >> >> >