Is it possible to know a little bit more about the nature of that multi-lingual field ? I can see the keywordTokenizer and then a lot of grams calculated from that token . What is that field used for ?
2015-05-07 19:23 GMT+01:00 Kuntal Ganguly <gangulykuntal1...@gmail.com>: > Our current production index size is 1.5 TB with 3 shards. Currently we > have the following field type: > > <fieldType name="text_ngram" class="solr.TextField" > positionIncrementGap="100"> > > <analyzer type="query"> > <tokenizer class="solr.KeywordTokenizerFactory"/> > <filter class="solr.LowerCaseFilterFactory"/> > </analyzer> > <analyzer type="index"> > <tokenizer class="solr.KeywordTokenizerFactory"/> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.CustomNGramFilterFactory" minGramSize="3" > maxGramSize="30" preserveOriginal="true"/> > </analyzer> > </fieldType> > > And the above field type is working well for the US and English language > clients. > > Now we have some new Chinese and Japanese client ,so after google > > http://www.basistech.com/indexing-strategies-for-multilingual-search-with-solr-and-rosette/ > > https://docs.lucidworks.com/display/lweug/Multilingual+Indexing+and+Search > > for best approach for multilingual index,there seems to be pros/cons > associated with every approach. > > Then i tried RnD with a single field approach and here's my new field type: > > <fieldType name="text_multi" class="solr.TextField" > positionIncrementGap="100"> > > <analyzer type="query"> > <tokenizer class="solr.KeywordTokenizerFactory"/> > <filter class="solr.CJKWidthFilterFactory"/> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.CJKBigramFilterFactory"/> > </analyzer> > <analyzer type="index"> > <tokenizer class="solr.KeywordTokenizerFactory"/> > <filter class="solr.CJKWidthFilterFactory"/> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.CJKBigramFilterFactory"/> > <filter class="solr.CustomNGramFilterFactory" minGramSize="3" > maxGramSize="30" preserveOriginal="true"/> > </analyzer> > </fieldType> > > I have kept the same tokenizer, only changed the filters.And it is working > well with all existing search /use-case for English documents as well as > new use case for Chinese/Japanese documents. > > Now i have the following questions to the Solr experts/developer: > > 1) Is this a correct approach to do it? Or i'm missing something? > > 2) Can you give me an example where there will be problem with this above > new field type? A use-case/scenario with example will be very helpful. > > 3) Also is there any problem in future with different clients coming up? > > Please provide some guidance > -- -------------------------- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti "Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry?" William Blake - Songs of Experience -1794 England