Solr - WordDelimiterFactory with Custom Tokenizer to split only on Boundires

meghana Wed, 24 Apr 2013 03:49:40 -0700

I have configured WordDelimiterFilterFactory for custom tokenizers for '&'
and '-' , and for few tokenizer (like . _ :) we need to split on boundries
only.


e.g. 
test.com (should tokenized to test.com)
newyear.  (should tokenized to newyear)
new_car (should tokenized to new_car)
..
..

Below is defination for text field

<fieldType name="text_general_preserved" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
         <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="false" />
         <filter class="solr.WordDelimiterFilterFactory"
                splitOnCaseChange ="0"
                splitOnNumerics ="0"
                stemEnglishPossessive ="0"
                generateWordParts="1"
                generateNumberParts="1"
                catenateWords="0"
                catenateNumbers="0"
                catenateAll="0"
                preserveOriginal="0"
                protected="protwords_general.txt"
                types="wdfftypes_general.txt"
                />
        
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="false" />
        <filter class="solr.WordDelimiterFilterFactory"
                splitOnCaseChange ="0"
                splitOnNumerics ="0"
                stemEnglishPossessive ="0"
                generateWordParts="1"
                generateNumberParts="1"
                catenateWords="0"
                catenateNumbers="0"
                catenateAll="0"
                preserveOriginal="0"
                protected="protwords_general.txt"
                types="wdfftypes_general.txt"
                />
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

below is wdfftypes_general.txt content

& => ALPHA
- => ALPHA
_ => SUBWORD_DELIM
: => SUBWORD_DELIM
. => SUBWORD_DELIM

types can be used in worddelimiter  are LOWER, UPPER, ALPHA, DIGIT,
ALPHANUM, SUBWORD_DELIM . there's no description available for use of each
type. as per name, i thought type SUBWORD_DELIM may fulfill my need, but it
doesn't seem to work. 

Can anybody suggest me how can i set configuration for worddelimiter factory
to fulfill my requirement. 

Thanks.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-WordDelimiterFactory-with-Custom-Tokenizer-to-split-only-on-Boundires-tp4058557.html
Sent from the Solr - User mailing list archive at Nabble.com.

Solr - WordDelimiterFactory with Custom Tokenizer to split only on Boundires

Reply via email to