Re: building custom cache - using lucene docids

Roman Chyla Mon, 25 Nov 2013 21:35:40 -0800

OK, I've spent some time reading the solr/lucene4x classes, and this is
myunderstanding (feel free to correct me ;-))


DirectoryReader holds the opened segments -- each segment has its own
reader, the BaseCompositeReader (or extended classes thereof) store the
offsets per each segment; eg. [0, 5, 22] - meaning, there are 2 segments,
with 5, and 17 docs respectively

The segments are listed in the segments_N file,
http://lucene.apache.org/core/3_0_3/fileformats.html#Segments
File

So theoretically, order of segments could change when merge happens - yet,
every SegmentReader is identified by unique name and this name doesn't
change unless the segment itself changed (ie. docs were deleted; or got
more docs) - so it is possible to rely on this name to know what has not
changed

the name is coming from SegmentInfo (check its toString method) -- the
SegmentInfo has a method equals() that will consider as equal the readers
with the same name and the same dir (which is useful to know - two readers,
one with deletes, one without, are equal)

Lucene's FieldCache itself is rather complex, but it shows there is a very
clever mechanism (a few actually!) -- a class can register a listener that
will be called whenever an index segments is being closed (this could be
used to invalidate portions of a cache), the relevant classes are:
SegmentReader.CoreClosedListener, IndexReader.ReaderClosedListener

But Lucene is using this mechanism only to purge the cache - so
effectively, every commits triggers cache rebuild. This is the interesting
bit: lots of work could be spared if segments data were reused  (but
admittedly, only sometimes - for data that was fully read into memory, for
anything else, such as terms, the cache reads only some values and is
fetching the rest from the index - so Lucene must close the reader and
rebuild the cache on every commit; but that is not my case, as I am to copy
values from an index, and store them in memory...)

the weird 'recyclation' of docids I've observed can probably be explained
by the fact that the index reader contains segments and near realtime
readers (but I'm not sure about this)

To conclude: it is possible to build a cache that updates itself (with only
changes committed since the last build) - this will have impact on how fast
new searcher is ready to serve requests

HTH somebody else too :)

  roman



On Mon, Nov 25, 2013 at 7:54 PM, Roman Chyla <roman.ch...@gmail.com> wrote:

>
>
>
> On Mon, Nov 25, 2013 at 12:54 AM, Mikhail Khludnev <
> mkhlud...@griddynamics.com> wrote:
>
>> Roman,
>>
>> I don't fully understand your question. After segment is flushed it's
>> never
>> changed, hence segment-local docids are always the same. Due to merge
>> segment can gone, its' docs become new ones in another segment.  This is
>> true for 'global' (Solr-style) docnums, which can flip after merge is
>> happened in the middle of the segments' chain.
>> As well you are saying about segmented cache I can propose you to look at
>> CachingWrapperFilter and NoOpRegenerator as a pattern for such data
>> structures.
>>
>
> Thanks Mikhail, the CWF confirms that the idea of regenerating just part
> of the cache is doable. The CacheRegenerators, on the other hand, make no
> sense to me - and they are not given any 'signals', so they don't know if
> they are in the middle of some regeneration or not, and they should not
> keep a state (of previous index) - as they can be shared by threads that
> build the cache
>
> Best,
>
>   roman
>
>
>>
>>
>> On Sat, Nov 23, 2013 at 9:40 AM, Roman Chyla <roman.ch...@gmail.com>
>> wrote:
>>
>> > Hi,
>> > docids are 'ephemeral', but i'd still like to build a search cache with
>> > them (they allow for the fastest joins).
>> >
>> > i'm seeing docids keep changing with updates (especially, in the last
>> index
>> > segment) - as per
>> > https://issues.apache.org/jira/browse/LUCENE-2897
>> >
>> > That would be fine, because i could build the cache from diff (of index
>> > state) + reading the latest index segment in its entirety. But can I
>> assume
>> > that docids in other segments (other than the last one) will be
>> relatively
>> > stable? (ie. when an old doc is deleted, the docid is marked as removed;
>> > update doc = delete old & create a new docid)?
>> >
>> > thanks
>> >
>> > roman
>> >
>>
>>
>>
>> --
>> Sincerely yours
>> Mikhail Khludnev
>> Principal Engineer,
>> Grid Dynamics
>>
>> <http://www.griddynamics.com>
>>  <mkhlud...@griddynamics.com>
>>
>
>

Re: building custom cache - using lucene docids

Reply via email to