The WDF "types" will treat a character the same regardless of where it appears.

For something conditional, like dot between letters vs. dot lot preceded and followed by a letter, you either have to have a custom tokenizer or a character filter.

Interesting that although the standard tokenizer messes up embedded hyphens, it does handle the embedded dot vs. trailing dot case as you wish (but messes up "U.S.A." by stripping the trailing dot) - but that doesn't help your case.

A character filter like the following might help your case:

<fieldType name="text_ws_dot" class="solr.TextField" positionIncrementGap="100">
 <analyzer>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([\w\d])[\._&amp;]+($|[^\w\d])" replacement="$1 $2" /> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(^|[^\w\d])[\._&amp;]+($|[^\w\d])" replacement="$1 $2" /> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(^|[^\w\d])[\._&amp;]+([\w\d])" replacement="$1 $2" />
   <tokenizer class="solr.WhitespaceTokenizerFactory"/>
 </analyzer>
</fieldType>

I'm not a regular expression expert, so I'm not sure whether/how those patterns could be combined.

Also, that doesn't allow the case of a single ".", "&", or "_" as a word - but you didn't specify how that case should be handled.



-- Jack Krupansky
-----Original Message----- From: meghana
Sent: Wednesday, April 24, 2013 6:49 AM
To: solr-user@lucene.apache.org
Subject: Solr - WordDelimiterFactory with Custom Tokenizer to split only on Boundires

I have configured WordDelimiterFilterFactory for custom tokenizers for '&'
and '-' , and for few tokenizer (like . _ :) we need to split on boundries
only.

e.g.
test.com (should tokenized to test.com)
newyear.  (should tokenized to newyear)
new_car (should tokenized to new_car)
..
..

Below is defination for text field

<fieldType name="text_general_preserved" class="solr.TextField"
positionIncrementGap="100">
     <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="false" />
        <filter class="solr.WordDelimiterFilterFactory"
               splitOnCaseChange ="0"
               splitOnNumerics ="0"
               stemEnglishPossessive ="0"
               generateWordParts="1"
               generateNumberParts="1"
               catenateWords="0"
               catenateNumbers="0"
               catenateAll="0"
               preserveOriginal="0"
               protected="protwords_general.txt"
               types="wdfftypes_general.txt"
               />

       <filter class="solr.LowerCaseFilterFactory"/>
     </analyzer>
     <analyzer type="query">
       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
       <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="false" />
       <filter class="solr.WordDelimiterFilterFactory"
               splitOnCaseChange ="0"
               splitOnNumerics ="0"
               stemEnglishPossessive ="0"
               generateWordParts="1"
               generateNumberParts="1"
               catenateWords="0"
               catenateNumbers="0"
               catenateAll="0"
               preserveOriginal="0"
               protected="protwords_general.txt"
               types="wdfftypes_general.txt"
               />
       <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
       <filter class="solr.LowerCaseFilterFactory"/>
     </analyzer>
   </fieldType>

below is wdfftypes_general.txt content

& => ALPHA
- => ALPHA
_ => SUBWORD_DELIM
: => SUBWORD_DELIM
. => SUBWORD_DELIM

types can be used in worddelimiter  are LOWER, UPPER, ALPHA, DIGIT,
ALPHANUM, SUBWORD_DELIM . there's no description available for use of each
type. as per name, i thought type SUBWORD_DELIM may fulfill my need, but it
doesn't seem to work.

Can anybody suggest me how can i set configuration for worddelimiter factory
to fulfill my requirement.

Thanks.



--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-WordDelimiterFactory-with-Custom-Tokenizer-to-split-only-on-Boundires-tp4058557.html Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to