Re: Solr - WordDelimiterFactory with Custom Tokenizer to split only on Boundires

Jack Krupansky Wed, 24 Apr 2013 08:17:28 -0700

The WDF "types" will treat a character the same regardless of where itappears.

For something conditional, like dot between letters vs. dot lot preceded andfollowed by a letter, you either have to have a custom tokenizer or acharacter filter.

Interesting that although the standard tokenizer messes up embedded hyphens,it does handle the embedded dot vs. trailing dot case as you wish (butmesses up "U.S.A." by stripping the trailing dot) - but that doesn't helpyour case.


A character filter like the following might help your case:

<fieldType name="text_ws_dot" class="solr.TextField"positionIncrementGap="100">

 <analyzer>

<charFilter class="solr.PatternReplaceCharFilterFactory"pattern="([\w\d])[\._&]+($|[^\w\d])" replacement="$1 $2" /><charFilter class="solr.PatternReplaceCharFilterFactory"pattern="(^|[^\w\d])[\._&]+($|[^\w\d])" replacement="$1 $2" /><charFilter class="solr.PatternReplaceCharFilterFactory"pattern="(^|[^\w\d])[\._&]+([\w\d])" replacement="$1 $2" />

   <tokenizer class="solr.WhitespaceTokenizerFactory"/>
 </analyzer>
</fieldType>

I'm not a regular expression expert, so I'm not sure whether/how thosepatterns could be combined.

Also, that doesn't allow the case of a single ".", "&", or "_" as a word -but you didn't specify how that case should be handled.




-- Jack Krupansky

-----Original Message-----From: meghana

Sent: Wednesday, April 24, 2013 6:49 AM
To: solr-user@lucene.apache.org

Subject: Solr - WordDelimiterFactory with Custom Tokenizer to split only onBoundires


I have configured WordDelimiterFilterFactory for custom tokenizers for '&'
and '-' , and for few tokenizer (like . _ :) we need to split on boundries
only.

e.g.
test.com (should tokenized to test.com)
newyear.  (should tokenized to newyear)
new_car (should tokenized to new_car)
..
..

Below is defination for text field

<fieldType name="text_general_preserved" class="solr.TextField"
positionIncrementGap="100">
     <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="false" />
        <filter class="solr.WordDelimiterFilterFactory"
               splitOnCaseChange ="0"
               splitOnNumerics ="0"
               stemEnglishPossessive ="0"
               generateWordParts="1"
               generateNumberParts="1"
               catenateWords="0"
               catenateNumbers="0"
               catenateAll="0"
               preserveOriginal="0"
               protected="protwords_general.txt"
               types="wdfftypes_general.txt"
               />

       <filter class="solr.LowerCaseFilterFactory"/>
     </analyzer>
     <analyzer type="query">
       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
       <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="false" />
       <filter class="solr.WordDelimiterFilterFactory"
               splitOnCaseChange ="0"
               splitOnNumerics ="0"
               stemEnglishPossessive ="0"
               generateWordParts="1"
               generateNumberParts="1"
               catenateWords="0"
               catenateNumbers="0"
               catenateAll="0"
               preserveOriginal="0"
               protected="protwords_general.txt"
               types="wdfftypes_general.txt"
               />
       <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
       <filter class="solr.LowerCaseFilterFactory"/>
     </analyzer>
   </fieldType>

below is wdfftypes_general.txt content

& => ALPHA
- => ALPHA
_ => SUBWORD_DELIM
: => SUBWORD_DELIM
. => SUBWORD_DELIM

types can be used in worddelimiter  are LOWER, UPPER, ALPHA, DIGIT,
ALPHANUM, SUBWORD_DELIM . there's no description available for use of each
type. as per name, i thought type SUBWORD_DELIM may fulfill my need, but it
doesn't seem to work.

Can anybody suggest me how can i set configuration for worddelimiter factory
to fulfill my requirement.

Thanks.



--

View this message in context:http://lucene.472066.n3.nabble.com/Solr-WordDelimiterFactory-with-Custom-Tokenizer-to-split-only-on-Boundires-tp4058557.htmlSent from the Solr - User mailing list archive at Nabble.com.

Re: Solr - WordDelimiterFactory with Custom Tokenizer to split only on Boundires

Reply via email to