RE: Concatenate multiple tokens into one

Jonathan Rochkind Wed, 10 Nov 2010 16:15:30 -0800

Are you sure you really want to throw out stopwords for your use case?  I don't 
think autocompletion will work how you want if you do.


And if you don't... then why use the WhitespaceTokenizer and then try to jam 
the tokens back together? Why not just NOT tokenize in the first place. Use the 
KeywordTokenizer, which really should be called the NonTokenizingTokenizer, 
becaues it doesn't tokenize at all, it just creates one token from the entire 
input string. 

Then lowercase, remove whitespace (or not), do whatever else you want to do to 
your single token to normalize it, and then edgengram it. 

If you include whitespace in the token, then when making your queries for 
auto-complete, be sure to use a query parser that doesn't do 
"pre-tokenization", the 'field' query parser should work well for this. 

Jonathan



________________________________________
From: Robert Gründler [rob...@dubture.com]
Sent: Wednesday, November 10, 2010 6:39 PM
To: solr-user@lucene.apache.org
Subject: Concatenate multiple tokens into one

Hi,

i've created the following filterchain in a field type, the idea is to use it 
for autocompletion purposes:

<tokenizer class="solr.WhitespaceTokenizerFactory"/> <!-- create tokens 
separated by whitespace -->
<filter class="solr.LowerCaseFilterFactory"/> <!-- lowercase everything -->
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" 
enablePositionIncrements="true" />  <!-- throw out stopwords -->
<filter class="solr.PatternReplaceFilterFactory" pattern="([^a-z])" 
replacement="" replace="all" />  <!-- throw out all everything except a-z -->

<!-- actually, here i would like to join multiple tokens together again, to 
provide one token for the EdgeNGramFilterFactory -->

<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="25" /> 
<!-- create edgeNGram tokens for autocomplete matches -->

With that kind of filterchain, the EdgeNGramFilterFactory will receive multiple 
tokens on input strings with whitespaces in it. This leads to the following 
results:
Input Query: "George Cloo"
Matches:
- "George Harrison"
- "John Clooridge"
- "George Smith"
-"George Clooney"
- etc

However, only "George Clooney" should match in the autocompletion use case.
Therefore, i'd like to add a filter before the EdgeNGramFilterFactory, which 
concatenates all the tokens generated by the WhitespaceTokenizerFactory.
Are there filters which can do such a thing?

If not, are there examples how to implement a custom TokenFilter?

thanks!

-robert

RE: Concatenate multiple tokens into one

Reply via email to