Thanks Jack Krupansky, Its very helpful :) Jack Krupansky-2 wrote > The WDF "types" will treat a character the same regardless of where it > appears. > > For something conditional, like dot between letters vs. dot lot preceded > and > followed by a letter, you either have to have a custom tokenizer or a > character filter. > > Interesting that although the standard tokenizer messes up embedded > hyphens, > it does handle the embedded dot vs. trailing dot case as you wish (but > messes up "U.S.A." by stripping the trailing dot) - but that doesn't help > your case. > > A character filter like the following might help your case: > <fieldType name="text_ws_dot" class="solr.TextField" > positionIncrementGap="100"> > > <analyzer> > > <charFilter class="solr.PatternReplaceCharFilterFactory" > pattern="([\w\d])[\._&]+($|[^\w\d])" replacement="$1 $2" /> > > <charFilter class="solr.PatternReplaceCharFilterFactory" > pattern="(^|[^\w\d])[\._&]+($|[^\w\d])" replacement="$1 $2" /> > > <charFilter class="solr.PatternReplaceCharFilterFactory" > pattern="(^|[^\w\d])[\._&]+([\w\d])" replacement="$1 $2" /> > > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > > </analyzer> > </fieldType> > I'm not a regular expression expert, so I'm not sure whether/how those > patterns could be combined. > > Also, that doesn't allow the case of a single ".", "&", or "_" as a word - > but you didn't specify how that case should be handled. > > > > -- Jack Krupansky > -----Original Message----- > From: meghana > Sent: Wednesday, April 24, 2013 6:49 AM > To:
> solr-user@.apache > Subject: Solr - WordDelimiterFactory with Custom Tokenizer to split only > on > Boundires > > I have configured WordDelimiterFilterFactory for custom tokenizers for '&' > and '-' , and for few tokenizer (like . _ :) we need to split on boundries > only. > > e.g. > test.com (should tokenized to test.com) > newyear. (should tokenized to newyear) > new_car (should tokenized to new_car) > .. > .. > > Below is defination for text field > <fieldType name="text_general_preserved" class="solr.TextField" > positionIncrementGap="100"> > > <analyzer type="index"> > > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > > <filter class="solr.StopFilterFactory" ignoreCase="true" > words="stopwords.txt" enablePositionIncrements="false" /> > > <filter class="solr.WordDelimiterFilterFactory" > splitOnCaseChange ="0" > splitOnNumerics ="0" > stemEnglishPossessive ="0" > generateWordParts="1" > generateNumberParts="1" > catenateWords="0" > catenateNumbers="0" > catenateAll="0" > preserveOriginal="0" > protected="protwords_general.txt" > types="wdfftypes_general.txt" > /> > > <filter class="solr.LowerCaseFilterFactory"/> > > </analyzer> > > <analyzer type="query"> > > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > > <filter class="solr.StopFilterFactory" ignoreCase="true" > words="stopwords.txt" enablePositionIncrements="false" /> > > <filter class="solr.WordDelimiterFilterFactory" > splitOnCaseChange ="0" > splitOnNumerics ="0" > stemEnglishPossessive ="0" > generateWordParts="1" > generateNumberParts="1" > catenateWords="0" > catenateNumbers="0" > catenateAll="0" > preserveOriginal="0" > protected="protwords_general.txt" > types="wdfftypes_general.txt" > /> > > <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" > ignoreCase="true" expand="true"/> > > <filter class="solr.LowerCaseFilterFactory"/> > > </analyzer> > > </fieldType> > below is wdfftypes_general.txt content > > & => ALPHA > - => ALPHA > _ => SUBWORD_DELIM > : => SUBWORD_DELIM > . => SUBWORD_DELIM > > types can be used in worddelimiter are LOWER, UPPER, ALPHA, DIGIT, > ALPHANUM, SUBWORD_DELIM . there's no description available for use of each > type. as per name, i thought type SUBWORD_DELIM may fulfill my need, but > it > doesn't seem to work. > > Can anybody suggest me how can i set configuration for worddelimiter > factory > to fulfill my requirement. > > Thanks. > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Solr-WordDelimiterFactory-with-Custom-Tokenizer-to-split-only-on-Boundires-tp4058557.html > Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-WordDelimiterFactory-with-Custom-Tokenizer-to-split-only-on-Boundires-tp4058557p4060011.html Sent from the Solr - User mailing list archive at Nabble.com.