Re: Question on Tokenizing email address

Jan Høydahl / Cominvent Thu, 11 Feb 2010 05:10:03 -0800

My point is that I WANT the AT, DOT to be indexed, to avoid these being treated 
the same: foo-...@brown.fox and foo-bar.brown.fox
By using the LowerCaseFilterFactory before the replacements, you actually 
ensure that a search for email:at will not give a match because the query will 
be lower-cased and not match the indexed term "AT". For this reason I would not 
add the special tokens to stopword lists either, as you DO want them in the 
index.


--
Jan Høydahl  - search architect
Cominvent AS - www.cominvent.com

On 10. feb. 2010, at 08.34, abhishes wrote:

> 
> Thank you! it works very well.
> 
> I think that the field type suggested by you will index words like DOT, AT,
> com also
> 
> In order to prevent these words from getting indexed, I have changed the
> field type to 
> 
> <fieldType name="email" class="solr.TextField" positionIncrementGap="100">
>  <analyzer>
>       <tokenizer class="solr.StandardTokenizerFactory"/>                      
>       <filter class="solr.PatternReplaceFilterFactory" pattern="\." 
> replacement="
> DOT " replace="all" />
>       <filter class="solr.PatternReplaceFilterFactory" pattern="@" 
> replacement="
> AT " replace="all" />
>       <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> generateNumberParts="1" catenateWords="0" catenateNumbers="0"
> catenateAll="0" splitOnCaseChange="0"/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>       <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true" />              
>  </analyzer>
> </fieldType>
> 
> I have added the words dot, com to the stoplist file (at was already there).
> 
> Is this correct?
> 
> -- 
> View this message in context: 
> http://old.nabble.com/Question-on-Tokenizing-email-address-tp27518673p27527033.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Question on Tokenizing email address

Reply via email to