Hi We are trying to upgrade our index from 3.6.1 to 4.9.1 and I wanted to make sure our existing indexing strategy is still valid or not. The statistics of the raw corpus are:
- 4.8 Billon total number of tokens in the entire corpus. - 13MM documents We have 3 requirements 1) we want to index and search all tokens in a document (i.e. we do not rely on external stores) 2) we need search time to be fast and willing to pay larger indexing time and index size, 3) be able to search as fast as possible ngrams of 3 tokens or less (i.e, unigrams, bigrams and trigrams). To satisfy (1) we used the default <maxFieldLength>2147483647</maxFieldLength> in solrconfig.xml of 3.6.1 index to specify the total number of tokens to index in an article. In solr 4 we are specifying it via the tokenizer in the analyzer chain <tokenizer class="solr.ClassicTokenizerFactory" maxTokenLength="2147483647 "/> To satisfy 2 and 3 in our 3.6.1 index we indexed using the following shingedFilterFactory in the analyzer chain <filter class="solr.ShingleFilterFactory" outputUnigrams="true" maxShingleSize="3”/> This was based on this thread: http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200808.mbox/%3c856ac15f0808161539p54417df2ga5a6fdfa35889...@mail.gmail.com%3E The open questions we are trying to understand now are: 1) whether shingling is still the best strategy for phrase (ngram) search given our requirements above? 2) if not then what would be a better strategy. thank you in advance for your help Peyman