Re: Solr - WordDelimiterFactory with Custom Tokenizer to split only on Boundires

meghana Tue, 30 Apr 2013 07:28:23 -0700

Thanks Jack Krupansky, Its very helpful :)

Jack Krupansky-2 wrote
> The WDF "types" will treat a character the same regardless of where it 
> appears.
> 
> For something conditional, like dot between letters vs. dot lot preceded
> and 
> followed by a letter, you either have to have a custom tokenizer or a 
> character filter.
> 
> Interesting that although the standard tokenizer messes up embedded
> hyphens, 
> it does handle the embedded dot vs. trailing dot case as you wish (but 
> messes up "U.S.A." by stripping the trailing dot) - but that doesn't help 
> your case.
> 
> A character filter like the following might help your case:
> <fieldType name="text_ws_dot" class="solr.TextField" 
> positionIncrementGap="100">
>   
> <analyzer>
>     
> <charFilter class="solr.PatternReplaceCharFilterFactory" 
> pattern="([\w\d])[\._&amp;]+($|[^\w\d])" replacement="$1 $2" />
>     
> <charFilter class="solr.PatternReplaceCharFilterFactory" 
> pattern="(^|[^\w\d])[\._&amp;]+($|[^\w\d])" replacement="$1 $2" />
>     
> <charFilter class="solr.PatternReplaceCharFilterFactory" 
> pattern="(^|[^\w\d])[\._&amp;]+([\w\d])" replacement="$1 $2" />
>     
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>   
> </analyzer>
> </fieldType>
> I'm not a regular expression expert, so I'm not sure whether/how those 
> patterns could be combined.
> 
> Also, that doesn't allow the case of a single ".", "&", or "_" as a word - 
> but you didn't specify how that case should be handled.
> 
> 
> 
> -- Jack Krupansky
> -----Original Message----- 
> From: meghana
> Sent: Wednesday, April 24, 2013 6:49 AM
> To:


> [email protected]

> Subject: Solr - WordDelimiterFactory with Custom Tokenizer to split only
> on 
> Boundires
> 
> I have configured WordDelimiterFilterFactory for custom tokenizers for '&'
> and '-' , and for few tokenizer (like . _ :) we need to split on boundries
> only.
> 
> e.g.
> test.com (should tokenized to test.com)
> newyear.  (should tokenized to newyear)
> new_car (should tokenized to new_car)
> ..
> ..
> 
> Below is defination for text field
> <fieldType name="text_general_preserved" class="solr.TextField"
> positionIncrementGap="100">
>       
> <analyzer type="index">
>          
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>          
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="false" />
>          
> <filter class="solr.WordDelimiterFilterFactory"
>                 splitOnCaseChange ="0"
>                 splitOnNumerics ="0"
>                 stemEnglishPossessive ="0"
>                 generateWordParts="1"
>                 generateNumberParts="1"
>                 catenateWords="0"
>                 catenateNumbers="0"
>                 catenateAll="0"
>                 preserveOriginal="0"
>                 protected="protwords_general.txt"
>                 types="wdfftypes_general.txt"
>                 />
>         
> <filter class="solr.LowerCaseFilterFactory"/>
>       
> </analyzer>
>       
> <analyzer type="query">
>         
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="false" />
>         
> <filter class="solr.WordDelimiterFilterFactory"
>                 splitOnCaseChange ="0"
>                 splitOnNumerics ="0"
>                 stemEnglishPossessive ="0"
>                 generateWordParts="1"
>                 generateNumberParts="1"
>                 catenateWords="0"
>                 catenateNumbers="0"
>                 catenateAll="0"
>                 preserveOriginal="0"
>                 protected="protwords_general.txt"
>                 types="wdfftypes_general.txt"
>                 />
>         
> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>         
> <filter class="solr.LowerCaseFilterFactory"/>
>       
> </analyzer>
>     
> </fieldType>
> below is wdfftypes_general.txt content
> 
> & => ALPHA
> - => ALPHA
> _ => SUBWORD_DELIM
> : => SUBWORD_DELIM
> . => SUBWORD_DELIM
> 
> types can be used in worddelimiter  are LOWER, UPPER, ALPHA, DIGIT,
> ALPHANUM, SUBWORD_DELIM . there's no description available for use of each
> type. as per name, i thought type SUBWORD_DELIM may fulfill my need, but
> it
> doesn't seem to work.
> 
> Can anybody suggest me how can i set configuration for worddelimiter
> factory
> to fulfill my requirement.
> 
> Thanks.
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-WordDelimiterFactory-with-Custom-Tokenizer-to-split-only-on-Boundires-tp4058557.html
> Sent from the Solr - User mailing list archive at Nabble.com.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-WordDelimiterFactory-with-Custom-Tokenizer-to-split-only-on-Boundires-tp4058557p4060011.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr - WordDelimiterFactory with Custom Tokenizer to split only on Boundires

Reply via email to