Concatenate multiple tokens into one

Robert Gründler Wed, 10 Nov 2010 15:39:46 -0800

Hi,

i've created the following filterchain in a field type, the idea is to use it 
for autocompletion purposes:


<tokenizer class="solr.WhitespaceTokenizerFactory"/> <!-- create tokens 
separated by whitespace -->
<filter class="solr.LowerCaseFilterFactory"/> <!-- lowercase everything -->
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" 
enablePositionIncrements="true" />  <!-- throw out stopwords -->
<filter class="solr.PatternReplaceFilterFactory" pattern="([^a-z])" 
replacement="" replace="all" />  <!-- throw out all everything except a-z -->

<!-- actually, here i would like to join multiple tokens together again, to 
provide one token for the EdgeNGramFilterFactory -->

<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="25" /> 
<!-- create edgeNGram tokens for autocomplete matches -->

With that kind of filterchain, the EdgeNGramFilterFactory will receive multiple 
tokens on input strings with whitespaces in it. This leads to the following 
results:
Input Query: "George Cloo"
Matches:
- "George Harrison"
- "John Clooridge"
- "George Smith"
-"George Clooney"
- etc

However, only "George Clooney" should match in the autocompletion use case.
Therefore, i'd like to add a filter before the EdgeNGramFilterFactory, which 
concatenates all the tokens generated by the WhitespaceTokenizerFactory.
Are there filters which can do such a thing?

If not, are there examples how to implement a custom TokenFilter?

thanks!

-robert

Concatenate multiple tokens into one

Reply via email to