Re: Understanding Lucene's File Format

Michael McCandless Fri, 17 Sep 2010 08:08:21 -0700

You're welcome!

Mike


On Fri, Sep 17, 2010 at 10:44 AM, Giovanni Fernandez-Kincade
<gfernandez-kinc...@capitaliq.com> wrote:
> Interesting. Thanks for your help Mike!
>
> -----Original Message-----
> From: Michael McCandless [mailto:luc...@mikemccandless.com]
> Sent: Friday, September 17, 2010 10:29 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Understanding Lucene's File Format
>
> Yes.
>
> They are decoded from the deltas in the tii file into absolutes in memory, on 
> load.
>
> Note that trunk (w/ flex indexing) has changed this substantially: we store 
> only the offset into the terms dict file, as an absolute in a packed int 
> array (no object per indexed term).  Then, at the seek points in the terms 
> index we store absolute frq/prx pointers, so that on seek we can rebase the 
> decoding.
>
> Mike
>
> On Fri, Sep 17, 2010 at 10:02 AM, Giovanni Fernandez-Kincade 
> <gfernandez-kinc...@capitaliq.com> wrote:
>>> The terms index (once loaded into RAM) has absolute longs, too.
>>
>> So in the TermInfo Index(.tii), the FreqDelta, ProxDelta, And SkipDelta 
>> stored with each TermInfo are actually absolute?
>>
>> -----Original Message-----
>> From: Michael McCandless [mailto:luc...@mikemccandless.com]
>> Sent: Friday, September 17, 2010 5:24 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Understanding Lucene's File Format
>>
>> The entry for each term in the terms dict stores a long file offset pointer, 
>> into the .frq file, and another long for the .prx file.
>>
>> But, these longs are delta-coded, so as you scan you have to sum up these 
>> deltas to get the absolute file pointers.
>>
>> The terms index (once loaded into RAM) has absolute longs, too.
>>
>> So when looking up a term, we first bin search to the nearest indexed term 
>> less than what you seek, then seek to that spot in the terms dict, then 
>> scan, summing the deltas.
>>
>> Mike
>>
>> On Thu, Sep 16, 2010 at 3:53 PM, Giovanni Fernandez-Kincade 
>> <gfernandez-kinc...@capitaliq.com> wrote:
>>> Hi,
>>> I've been trying to understand Lucene's file format and I keep getting hung 
>>> up on one detail - how can Lucene quickly find the frequency data (or 
>>> proximity data) for a particular term? According to the file formats page 
>>> on the Lucene 
>>> website<http://lucene.apache.org/java/2_2_0/fileformats.html#Term%20Dictionary>,
>>>  the FreqDelta field in the Term Info file (.tis) is relative to the 
>>> previous term. How is this helpful? The few references I've found on the 
>>> web for this subject make it sound like the Term Dictionary has direct 
>>> pointers to the frequency data for a given term, but that isn't consistent 
>>> with the aforementioned reference.
>>>
>>> Thanks for your help,
>>> Gio.
>>>
>>
>

Re: Understanding Lucene's File Format

Reply via email to