Bizarre Terms revisited

Darren Govoni Wed, 30 Jun 2010 12:48:28 -0700

Hi,
  I really think there is something "not quite right" going on here
after much study. Here is my findings.


Using MLT, I get terms that appear to be long concatenations of words
that are space delimited in the original text.
I can't think of any reason for these sentence-like terms to exist  (see
below).

All my data and config follows:

Here is the output from MLT:

<lst name="interestingTerms"> 
<float name="text_t:result">1.0</float> 
<float name="text_t:concepts">1.0</float> 
<float name="text_t:identified">1.0</float> 
<float name="text_t:row">1.0</float> 
<float name="text_t:based">1.0</float> 
<float name="text_t:000">1.0</float>  
<float name="text_t:ontreweb">1.0</float> 
<float name="text_t:in">1.0</float> 
<float name="text_t:and">1.0</float> 
<float name="text_t:2">1.0</float> 

<!-- These do not look like valid or useful terms to have in the index.
-->
<!-- Why do these exist? -->
<float
name="text_t:searchinonelanguagefindresultsinanother">1.0</float> 
<float name="text_t:ontrewebstartpage">1.0</float> 
<float name="text_t:unlimitedmutliwordandphrasematching">1.0</float> 
<float name="text_t:wordsandphrases">1.0</float> 
<float name="text_t:pluggablevocabulariesontologies">1.0</float> 
<float name="text_t:mappedconcepts">1.0</float> 
<float name="text_t:ontrewebproductfeatures">1.0</float> 
<float name="text_t:multilinguallexiconsfrenchenglishetc">1.0</float> 
<float name="text_t:multipleinheritanceofconcepts">1.0</float> 

<float name="text_t:4">1.0</float> 
<float name="text_t:string">1.0</float> 
<float name="text_t:english">1.0</float> 
<float name="text_t:mapped">1.0</float> 
<float name="text_t:multilingual">1.0</float> 
<float name="text_t:mutliword">1.0</float> 
</lst> 

My field:

   <field name="text_t"  type="textgen"    indexed="true"  stored="true"
multiValued="true" termVectors="true" termPositions="true"
termOffsets="true"/> 


Field definition taken from the default schema.xml

    <fieldType name="textgen" class="solr.TextField"
positionIncrementGap="100"> 
      <analyzer type="index"> 
        <tokenizer class="solr.WhitespaceTokenizerFactory"/> 
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" /> 
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/> 
        <filter class="solr.LowerCaseFilterFactory"/> 
      </analyzer> 
      <analyzer type="query"> 
        <tokenizer class="solr.WhitespaceTokenizerFactory"/> 
        <filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/> 
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="stopwords.txt"
                enablePositionIncrements="true"
                /> 
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/> 
        <filter class="solr.LowerCaseFilterFactory"/> 
      </analyzer> 
    </fieldType> 

Original text (partially snipped) as it appears in the stored index.

"Ontreweb Product Features 

     

Unlimited mutliword and phrase matching Multiple inheritance of concepts 
Pluggable vocabularies, ontologies Multilingual 
lexicons: french, english, etc. Search in one language, find results in another 
200,000+ words and phrases, 35,000 mapped 
concepts.

1. 2. 3. 4."

Bizarre Terms revisited

Reply via email to