Re: Preceding special characters in ClassicTokenizerFactory

Ahmet Arslan Mon, 03 Oct 2016 13:24:45 -0700

Hi Andy,

WordDelimeterFilter has "types" option. There is an example file named 
wdftypes.txt in the source tree that preserves #hashtags and @mentions. If you 
follow this path, please use Whitespace tokenizer.


Ahmet



On Monday, October 3, 2016 9:52 PM, "Whelan, Andy" <awhe...@srcinc.com> wrote:
Hello,
I am guessing that what I am looking for is probably going to require extending 
StandardTokenizerFactory or ClassicTokenizerFactory. But I thought I would ask 
the group here before attempting this. We are indexing documents from an 
eclectic set of sources. There is, however, a heavy interest in computing and 
social media sources. So computer terminology and social media terms (terms 
beginning with hashes (#), @ symbols, etc.) are terms that we would like to 
have searchable.

We are considering the ClassicTokenizerFactory because we like the fact that it 
does not use the Unicode standard annex 
UAX#29<http://unicode.org/reports/tr29/#Word_Boundaries> word boundary rules. 
It preserves email addresses, internet domain names, etc.  We would also like 
to use it as the tokenizer element of index and query analyzers that would 
preserve @< rest of token > or #<rest of token> patterns.

I have seen examples where folks are replacing the StandardTokenizerFactory in 
their analyzer with stream combinations made up of charFilters,  
WhitespaceTokenizerFactory, etc. as in the following article 
http://www.prowave.io/indexing-special-terms-using-solr/ to remedy such 
problems.

Example:
         <analyzer type="index">
                 <charFilter class="solr.PatternReplaceCharFilterFactory" 
pattern="(\.\s)" replacement=" " />
                 <charFilter class="solr.PatternReplaceCharFilterFactory" 
pattern="(\.$)" replacement="" />
                 <charFilter class="solr.PatternReplaceCharFilterFactory" 
pattern="(,)" replacement=" " />
                 <charFilter class="solr.PatternReplaceCharFilterFactory" 
pattern="(;)" replacement=" " />
                 <charFilter class="solr.PatternReplaceCharFilterFactory" 
pattern="(\|)" replacement=" " />
                 <charFilter class="solr.PatternReplaceCharFilterFactory" 
pattern="(\/)" replacement=" " />
                 <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                 <filter class="solr.SynonymFilterFactory" 
synonyms="punctuation-whitelist.txt" ignoreCase="true" expand="false"/>
                 <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt" enablePositionIncrements="true" />
                 <filter class="solr.LowerCaseFilterFactory"/>
         </analyzer>


I am just wondering if anyone knew of a smart way (without extending classes) 
to actually preserve most of the ClassicTokenizerFactory functionality without 
getting rid of leading special characters? The "Solr In Action" book (page 179) 
claims that it is hard to extend the StandardTokenizerFactory. I'm assuming 
this is the same for ClassicTokenizerFactory.

Thanks
-Andrew

Re: Preceding special characters in ClassicTokenizerFactory

Reply via email to