The WDF has a "types" attribute which can specify one or more character type mapping files. You could create a file like:

@ => ALPHA
_ => ALPHA

For example (from the book!):

Example - Treat at-sign and underscores as text

 <fieldType name="text_at_under" class="solr.TextField"
            positionIncrementGap="100" autoGeneratePhraseQueries="true">
   <analyzer>
     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
     <filter class="solr.WordDelimiterFilterFactory"
             types="at-under-alpha.txt"/>
   </analyzer>
 </fieldType>

The file +at-under-alpha.txt+ would contain:

 @ => ALPHA
 _ => ALPHA

The analysis results:

   Source: Hello @World_bar, r@end.
   Tokens: 1: Hello 2: @World_bar 3: r@end


-- Jack Krupansky

-----Original Message----- From: Mingfeng Yang
Sent: Tuesday, June 18, 2013 6:58 PM
To: solr-user@lucene.apache.org
Subject: preserve special characters

We need to index and search lots of tweets which can like "@solr:  solr is
great". or "@solr_lucene, good combination".

And we want to search with "@solr" or "@solr_lucene".  How can we preserve
"@" and "_" in the index?

If using whitespacetokennizer followed by worddelimiterfilter, @solr_lucene
will be broken down into "solr" and "lucene", which make the search results
contain lots of non-relevant docs.

If using standardtokenizer, the "@" symbol is stripped.

Thanks,
Ming-

Reply via email to