Hello, We are working with a very large index and with large documents (300+ page books.) It appears that the bottleneck on our system is the disk IO involved in reading position information from the prx file for commonly occuring terms.
An example slow query is "the new economics". To process the above phrase query for the word "the", does the entire part of the .prx file for the word "the" need to be read in to memory or only the fragments of the entries for the word "the" that contain specific doc ids? In reading the lucene index file formats document (http://lucene.apache.org/java/2_4_0/fileformats.html) its not clear whether the .tis file stores a pointer into the .prx file for a term (and therefore the entire list of doc_ids and positions for that term needs to be read into memory), or if the .tis file stores a pointer to the term **and doc id** in the prx file, in which case only the positions for a given doc id would need to be read. Or if somehow the .frq file has information on where to find the doc id in the .prx file. The documentation for the .tis file says that it stores ProxDelta which is based on the term (rather than the term/doc id). On the other hand the documentation for the .prx file states that Positions entries are "ordered by increasing document number (the document number is implicit from the .frq file)" Tom