In trying to understand the various options for WordDelimiterFilterFactory, I 
tried setting all options to 0.
This seems to prevent a number of words from being output at all. In particular 
"can't" and "99dxl" don't get output, nor do any wods containing hypens. Is 
this correct behavior?


Here is what the Solr Analyzer output

org.apache.solr.analysis.WhitespaceTokenizerFactory {}
term position   1       2       3       4       5       6       7       8       
9
term text       ca-55   99_3_a9 55-67   powerShot       ca999x15        foo-bar 
can't   joe's   99dxl

 org.apache.solr.analysis.WordDelimiterFilterFactory {splitOnCaseChange=0, 
generateNumberParts=0, catenateWords=0, generateWordParts=0, catenateAll=0, 
catenateNumbers=0}

term position   1       5
term text       powerShot       joe
term type       word    word
source start,end        20,29   53,56

Here is the schema
<fieldtype name="mbooksOcrXPatLike" class="solr.TextField">
      <analyzer>
          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
          <filter class="solr.WordDelimiterFilterFactory"
                splitOnCaseChange="0"
                generateWordParts="0"
                generateNumberParts="0"
                catenateWords="0"
                catenateNumbers="0"
                catenateAll="0"
                />
          <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldtype>

Tom

Reply via email to