Are you sure you really want to throw out stopwords for your use case? I don't think autocompletion will work how you want if you do.
And if you don't... then why use the WhitespaceTokenizer and then try to jam the tokens back together? Why not just NOT tokenize in the first place. Use the KeywordTokenizer, which really should be called the NonTokenizingTokenizer, becaues it doesn't tokenize at all, it just creates one token from the entire input string. Then lowercase, remove whitespace (or not), do whatever else you want to do to your single token to normalize it, and then edgengram it. If you include whitespace in the token, then when making your queries for auto-complete, be sure to use a query parser that doesn't do "pre-tokenization", the 'field' query parser should work well for this. Jonathan ________________________________________ From: Robert Gründler [rob...@dubture.com] Sent: Wednesday, November 10, 2010 6:39 PM To: solr-user@lucene.apache.org Subject: Concatenate multiple tokens into one Hi, i've created the following filterchain in a field type, the idea is to use it for autocompletion purposes: <tokenizer class="solr.WhitespaceTokenizerFactory"/> <!-- create tokens separated by whitespace --> <filter class="solr.LowerCaseFilterFactory"/> <!-- lowercase everything --> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <!-- throw out stopwords --> <filter class="solr.PatternReplaceFilterFactory" pattern="([^a-z])" replacement="" replace="all" /> <!-- throw out all everything except a-z --> <!-- actually, here i would like to join multiple tokens together again, to provide one token for the EdgeNGramFilterFactory --> <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="25" /> <!-- create edgeNGram tokens for autocomplete matches --> With that kind of filterchain, the EdgeNGramFilterFactory will receive multiple tokens on input strings with whitespaces in it. This leads to the following results: Input Query: "George Cloo" Matches: - "George Harrison" - "John Clooridge" - "George Smith" -"George Clooney" - etc However, only "George Clooney" should match in the autocompletion use case. Therefore, i'd like to add a filter before the EdgeNGramFilterFactory, which concatenates all the tokens generated by the WhitespaceTokenizerFactory. Are there filters which can do such a thing? If not, are there examples how to implement a custom TokenFilter? thanks! -robert