I've posted a ConcaFilter in my previous mail which does concatenate tokens. This works fine, but i realized that what i wanted to achieve is implemented easier in another way (by using 2 separate field types).
Have a look at a previous mail i wrote to the list and the reply from Ahmet Arslan (topic: "EdgeNGram relevancy). best -robert See On Nov 11, 2010, at 5:27 PM, Nick Martin wrote: > Hi Robert, All, > > I have a similar problem, here is my fieldType, > http://paste.pocoo.org/show/289910/ > I want to include stopword removal and lowercase the incoming terms. The idea > being to take, "Foo Bar Baz Ltd" and turn it into "foobarbaz" for the > EdgeNgram filter factory. > If anyone can tell me a simple way to concatenate tokens into one token > again, similar too the KeyWordTokenizer that would be super helpful. > > Many thanks > > Nick > > On 11 Nov 2010, at 00:23, Robert Gründler wrote: > >> >> On Nov 11, 2010, at 1:12 AM, Jonathan Rochkind wrote: >> >>> Are you sure you really want to throw out stopwords for your use case? I >>> don't think autocompletion will work how you want if you do. >> >> in our case i think it makes sense. the content is targetting the electronic >> music / dj scene, so we have a lot of words like "DJ" or "featuring" which >> make sense to throw out of the query. Also searches for "the beastie boys" >> and "beastie boys" should return a match in the autocompletion. >> >>> >>> And if you don't... then why use the WhitespaceTokenizer and then try to >>> jam the tokens back together? Why not just NOT tokenize in the first place. >>> Use the KeywordTokenizer, which really should be called the >>> NonTokenizingTokenizer, becaues it doesn't tokenize at all, it just creates >>> one token from the entire input string. >> >> I started out with the KeywordTokenizer, which worked well, except the >> StopWord problem. >> >> For now, i've come up with a quick-and-dirty custom "ConcatFilter", which >> does what i'm after: >> >> public class ConcatFilter extends TokenFilter { >> >> private TokenStream tstream; >> >> protected ConcatFilter(TokenStream input) { >> super(input); >> this.tstream = input; >> } >> >> @Override >> public Token next() throws IOException { >> >> Token token = new Token(); >> StringBuilder builder = new StringBuilder(); >> >> TermAttribute termAttribute = (TermAttribute) >> tstream.getAttribute(TermAttribute.class); >> TypeAttribute typeAttribute = (TypeAttribute) >> tstream.getAttribute(TypeAttribute.class); >> >> boolean incremented = false; >> >> while (tstream.incrementToken()) { >> >> if (typeAttribute.type().equals("word")) { >> builder.append(termAttribute.term()); >> >> } >> incremented = true; >> } >> >> token.setTermBuffer(builder.toString()); >> >> if (incremented == true) >> return token; >> >> return null; >> } >> } >> >> I'm not sure if this is a safe way to do this, as i'm not familar with the >> whole solr/lucene implementation after all. >> >> >> best >> >> >> -robert >> >> >> >> >>> >>> Then lowercase, remove whitespace (or not), do whatever else you want to do >>> to your single token to normalize it, and then edgengram it. >>> >>> If you include whitespace in the token, then when making your queries for >>> auto-complete, be sure to use a query parser that doesn't do >>> "pre-tokenization", the 'field' query parser should work well for this. >>> >>> Jonathan >>> >>> >>> >>> ________________________________________ >>> From: Robert Gründler [rob...@dubture.com] >>> Sent: Wednesday, November 10, 2010 6:39 PM >>> To: solr-user@lucene.apache.org >>> Subject: Concatenate multiple tokens into one >>> >>> Hi, >>> >>> i've created the following filterchain in a field type, the idea is to use >>> it for autocompletion purposes: >>> >>> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <!-- create tokens >>> separated by whitespace --> >>> <filter class="solr.LowerCaseFilterFactory"/> <!-- lowercase everything --> >>> <filter class="solr.StopFilterFactory" ignoreCase="true" >>> words="stopwords.txt" enablePositionIncrements="true" /> <!-- throw out >>> stopwords --> >>> <filter class="solr.PatternReplaceFilterFactory" pattern="([^a-z])" >>> replacement="" replace="all" /> <!-- throw out all everything except a-z >>> --> >>> >>> <!-- actually, here i would like to join multiple tokens together again, to >>> provide one token for the EdgeNGramFilterFactory --> >>> >>> <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" >>> maxGramSize="25" /> <!-- create edgeNGram tokens for autocomplete matches >>> --> >>> >>> With that kind of filterchain, the EdgeNGramFilterFactory will receive >>> multiple tokens on input strings with whitespaces in it. This leads to the >>> following results: >>> Input Query: "George Cloo" >>> Matches: >>> - "George Harrison" >>> - "John Clooridge" >>> - "George Smith" >>> -"George Clooney" >>> - etc >>> >>> However, only "George Clooney" should match in the autocompletion use case. >>> Therefore, i'd like to add a filter before the EdgeNGramFilterFactory, >>> which concatenates all the tokens generated by the >>> WhitespaceTokenizerFactory. >>> Are there filters which can do such a thing? >>> >>> If not, are there examples how to implement a custom TokenFilter? >>> >>> thanks! >>> >>> -robert >>> >>> >>> >>> >> >