You're welcome! Mike
On Fri, Sep 17, 2010 at 10:44 AM, Giovanni Fernandez-Kincade <gfernandez-kinc...@capitaliq.com> wrote: > Interesting. Thanks for your help Mike! > > -----Original Message----- > From: Michael McCandless [mailto:luc...@mikemccandless.com] > Sent: Friday, September 17, 2010 10:29 AM > To: solr-user@lucene.apache.org > Subject: Re: Understanding Lucene's File Format > > Yes. > > They are decoded from the deltas in the tii file into absolutes in memory, on > load. > > Note that trunk (w/ flex indexing) has changed this substantially: we store > only the offset into the terms dict file, as an absolute in a packed int > array (no object per indexed term). Then, at the seek points in the terms > index we store absolute frq/prx pointers, so that on seek we can rebase the > decoding. > > Mike > > On Fri, Sep 17, 2010 at 10:02 AM, Giovanni Fernandez-Kincade > <gfernandez-kinc...@capitaliq.com> wrote: >>> The terms index (once loaded into RAM) has absolute longs, too. >> >> So in the TermInfo Index(.tii), the FreqDelta, ProxDelta, And SkipDelta >> stored with each TermInfo are actually absolute? >> >> -----Original Message----- >> From: Michael McCandless [mailto:luc...@mikemccandless.com] >> Sent: Friday, September 17, 2010 5:24 AM >> To: solr-user@lucene.apache.org >> Subject: Re: Understanding Lucene's File Format >> >> The entry for each term in the terms dict stores a long file offset pointer, >> into the .frq file, and another long for the .prx file. >> >> But, these longs are delta-coded, so as you scan you have to sum up these >> deltas to get the absolute file pointers. >> >> The terms index (once loaded into RAM) has absolute longs, too. >> >> So when looking up a term, we first bin search to the nearest indexed term >> less than what you seek, then seek to that spot in the terms dict, then >> scan, summing the deltas. >> >> Mike >> >> On Thu, Sep 16, 2010 at 3:53 PM, Giovanni Fernandez-Kincade >> <gfernandez-kinc...@capitaliq.com> wrote: >>> Hi, >>> I've been trying to understand Lucene's file format and I keep getting hung >>> up on one detail - how can Lucene quickly find the frequency data (or >>> proximity data) for a particular term? According to the file formats page >>> on the Lucene >>> website<http://lucene.apache.org/java/2_2_0/fileformats.html#Term%20Dictionary>, >>> the FreqDelta field in the Term Info file (.tis) is relative to the >>> previous term. How is this helpful? The few references I've found on the >>> web for this subject make it sound like the Term Dictionary has direct >>> pointers to the frequency data for a given term, but that isn't consistent >>> with the aforementioned reference. >>> >>> Thanks for your help, >>> Gio. >>> >> >