Our current production index size is 1.5 TB with 3 shards. Currently we have the following field type:
<fieldType name="text_ngram" class="solr.TextField" positionIncrementGap="100"> <analyzer type="query"> <tokenizer class="solr.KeywordTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> <analyzer type="index"> <tokenizer class="solr.KeywordTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.CustomNGramFilterFactory" minGramSize="3" maxGramSize="30" preserveOriginal="true"/> </analyzer> </fieldType> And the above field type is working well for the US and English language clients. Now we have some new Chinese and Japanese client ,so after google http://www.basistech.com/indexing-strategies-for-multilingual-search-with-solr-and-rosette/ https://docs.lucidworks.com/display/lweug/Multilingual+Indexing+and+Search for best approach for multilingual index,there seems to be pros/cons associated with every approach. Then i tried RnD with a single field approach and here's my new field type: <fieldType name="text_multi" class="solr.TextField" positionIncrementGap="100"> <analyzer type="query"> <tokenizer class="solr.KeywordTokenizerFactory"/> <filter class="solr.CJKWidthFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.CJKBigramFilterFactory"/> </analyzer> <analyzer type="index"> <tokenizer class="solr.KeywordTokenizerFactory"/> <filter class="solr.CJKWidthFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.CJKBigramFilterFactory"/> <filter class="solr.CustomNGramFilterFactory" minGramSize="3" maxGramSize="30" preserveOriginal="true"/> </analyzer> </fieldType> I have kept the same tokenizer, only changed the filters.And it is working well with all existing search /use-case for English documents as well as new use case for Chinese/Japanese documents. Now i have the following questions to the Solr experts/developer: 1) Is this a correct approach to do it? Or i'm missing something? 2) Can you give me an example where there will be problem with this above new field type? A use-case/scenario with example will be very helpful. 3) Also is there any problem in future with different clients coming up? Please provide some guidance.