On 12/29/2014 12:07 PM, Mahmoud Almokadem wrote: > What do you mean with "important parts of index"? and how to calculate their > size?
I have no formal education in what's important when it comes to doing a query, but I can make some educated guesses. Starting with this as a reference: http://lucene.apache.org/core/4_10_0/core/org/apache/lucene/codecs/lucene410/package-summary.html#file-names I would guess that the segment info (.si) files and the term index (*.tip) files would be supremely important to *always* have in memory, and they are fairly small. Next would be the term dictionary (*.tim) files. The term dictionary is pretty big, and would be very important for fast queries. Frequencies, positions, and norms may also be important, depending on exactly what kind of query you have. Frequencies and positions are quite large. Frequencies are critical for relevence ranking (the default sort by score), and positions are important for phrase queries. Position data may also be used by relevance ranking, but I am not familiar enough with it to say for sure. If you have docvalues defined, then *.dvm and *.dvd files would be used for facets and sorting on those specific fields. The *.dvd files can be very big, depending on your schema. The *.fdx and *.fdt files become important when actually retrieving results after the matching documents have been determined. The stored data is compressed, so additional CPU power is required to uncompress that data before it is sent to the client. Stored data may be large or small, depending on your schema. Stored data does not directly affect search speed, but if memory space is limited, every block of stored data that gets retrieved will result in some other part of the index being removed from the OS disk cache, which means that it might need to be re-read from the disk on the next query. Thanks, Shawn