Due to multiple languages and dirty OCR, our indexes have over 2 billion unique terms ( http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again ).
In Solr 3.6 and previous we needed to reduce the memory used for storing the in-memory representation of the tii file. We originally used the termInfosIndexDivisor which affects the sampling of the tii file when read into memory. Later we used the termIndexInterval. Please see http://lucene.472066.n3.nabble.com/Solr-4-0-Beta-termIndexInterval-vs-termIndexDivisor-vs-termInfosIndexDivisor-tt4006182.htmlfor more background. Neither of these work with the default posting format in Solr4.x. However in the latest Solr 4.x example/solrconfig.xml file there is commented out text that implies that you can still use setTermIndexDivisor (appended below). That should probably be removed from the example if it does not work in Solr 4.x. At the Lucene level there are parameters to affect the size of tie in-memory representation of the index to the index (tip file). http://lucene.apache.org/core/4_3_0/core/org/apache/lucene/codecs/lucene41/Lucene41PostingsFormat.html In the Javadoc for IndexWriterConfig.setTermIndexInterval, There is the following statement: *"This parameter does not apply to all PostingsFormat implementations, including the default one in this release. It only makes sense for term indexes that are implemented as a fixed gap between terms. For example, Lucene41PostingsFormat<http://lucene.apache.org/core/4_3_0/core/org/apache/lucene/codecs/lucene41/Lucene41PostingsFormat.html>implements the term index instead based upon how terms share prefixes. To configure its parameters (the minimum and maximum size for a block), you would instead use Lucene41PostingsFormat.Lucene41PostingsFormat(int, int)<http://lucene.apache.org/core/4_3_0/core/org/apache/lucene/codecs/lucene41/Lucene41PostingsFormat.html#Lucene41PostingsFormat%28int,%20int%29>. which can also be configured on a per-field basis"* http://lucene.apache.org/core/4_3_0/core/org/apache/lucene/index/IndexWriterConfig.html#setTermIndexInterval%28int%29 This is followed by an example of how to set the min and max block size in Lucene. Is the ability to set the min and max block size available in Solr? If not, should I open a JIRA? Tom ---------- Exceprt from the Solr 4.3 latest rev of the example/solrconfig.xml file: http://svn.apache.org/viewvc/lucene/dev/branches/branch_4x/solr/example/solr/collection1/conf/solrconfig.xml?revision=1470617&view=co <!-- By explicitly declaring the Factory, the termIndexDivisor can be specified. --><!-- <indexReaderFactory name="IndexReaderFactory" class="solr.StandardIndexReaderFactory"> <int name="setTermIndexDivisor">12</int> </indexReaderFactory > -->