My point is that I WANT the AT, DOT to be indexed, to avoid these being treated the same: foo-...@brown.fox and foo-bar.brown.fox By using the LowerCaseFilterFactory before the replacements, you actually ensure that a search for email:at will not give a match because the query will be lower-cased and not match the indexed term "AT". For this reason I would not add the special tokens to stopword lists either, as you DO want them in the index.
-- Jan Høydahl - search architect Cominvent AS - www.cominvent.com On 10. feb. 2010, at 08.34, abhishes wrote: > > Thank you! it works very well. > > I think that the field type suggested by you will index words like DOT, AT, > com also > > In order to prevent these words from getting indexed, I have changed the > field type to > > <fieldType name="email" class="solr.TextField" positionIncrementGap="100"> > <analyzer> > <tokenizer class="solr.StandardTokenizerFactory"/> > <filter class="solr.PatternReplaceFilterFactory" pattern="\." > replacement=" > DOT " replace="all" /> > <filter class="solr.PatternReplaceFilterFactory" pattern="@" > replacement=" > AT " replace="all" /> > <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" > generateNumberParts="1" catenateWords="0" catenateNumbers="0" > catenateAll="0" splitOnCaseChange="0"/> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.StopFilterFactory" ignoreCase="true" > words="stopwords.txt" enablePositionIncrements="true" /> > </analyzer> > </fieldType> > > I have added the words dot, com to the stoplist file (at was already there). > > Is this correct? > > -- > View this message in context: > http://old.nabble.com/Question-on-Tokenizing-email-address-tp27518673p27527033.html > Sent from the Solr - User mailing list archive at Nabble.com. >