Processing of prx file for phrase queries: Whole position list for term read?

Burton-West, Tom Tue, 18 Nov 2008 13:01:35 -0800

Hello,

We are working with a very large index and with large documents (300+
page books.)  It appears that the bottleneck on our system is the disk
IO involved in reading position information from the prx file for
commonly occuring terms.


An example slow query is  "the new economics".    

To process the above phrase query for the word "the", does the entire
part of the .prx file for the word "the" need to be read in to memory or
only the fragments of the entries for the word "the" that contain
specific doc ids?

In reading the lucene index file formats document
(http://lucene.apache.org/java/2_4_0/fileformats.html) its not clear
whether the .tis file stores a pointer into the .prx file for a term
(and therefore the entire list of doc_ids and positions for that term
needs to be read into memory), or if the .tis file stores a pointer to
the term **and doc id** in the prx file, in which case only the
positions for a given doc id would need to be read. Or if somehow the
.frq file has information on where to find the doc id in the .prx file.


The documentation for the .tis file says that it stores ProxDelta which
is based on the term (rather than the term/doc id).  On the other hand
the documentation for the .prx file states that Positions entries are
"ordered by increasing document number (the document number is implicit
from the .frq file)"


Tom

Processing of prx file for phrase queries: Whole position list for term read?

Reply via email to