Hi, I really think there is something "not quite right" going on here after much study. Here is my findings.
Using MLT, I get terms that appear to be long concatenations of words that are space delimited in the original text. I can't think of any reason for these sentence-like terms to exist (see below). All my data and config follows: Here is the output from MLT: <lst name="interestingTerms"> <float name="text_t:result">1.0</float> <float name="text_t:concepts">1.0</float> <float name="text_t:identified">1.0</float> <float name="text_t:row">1.0</float> <float name="text_t:based">1.0</float> <float name="text_t:000">1.0</float> <float name="text_t:ontreweb">1.0</float> <float name="text_t:in">1.0</float> <float name="text_t:and">1.0</float> <float name="text_t:2">1.0</float> <!-- These do not look like valid or useful terms to have in the index. --> <!-- Why do these exist? --> <float name="text_t:searchinonelanguagefindresultsinanother">1.0</float> <float name="text_t:ontrewebstartpage">1.0</float> <float name="text_t:unlimitedmutliwordandphrasematching">1.0</float> <float name="text_t:wordsandphrases">1.0</float> <float name="text_t:pluggablevocabulariesontologies">1.0</float> <float name="text_t:mappedconcepts">1.0</float> <float name="text_t:ontrewebproductfeatures">1.0</float> <float name="text_t:multilinguallexiconsfrenchenglishetc">1.0</float> <float name="text_t:multipleinheritanceofconcepts">1.0</float> <float name="text_t:4">1.0</float> <float name="text_t:string">1.0</float> <float name="text_t:english">1.0</float> <float name="text_t:mapped">1.0</float> <float name="text_t:multilingual">1.0</float> <float name="text_t:mutliword">1.0</float> </lst> My field: <field name="text_t" type="textgen" indexed="true" stored="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true"/> Field definition taken from the default schema.xml <fieldType name="textgen" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> Original text (partially snipped) as it appears in the stored index. "Ontreweb Product Features Unlimited mutliword and phrase matching Multiple inheritance of concepts Pluggable vocabularies, ontologies Multilingual lexicons: french, english, etc. Search in one language, find results in another 200,000+ words and phrases, 35,000 mapped concepts. 1. 2. 3. 4."