Re: Concatenate multiple tokens into one

Nick Martin Thu, 11 Nov 2010 08:28:23 -0800

Hi Robert, All,

I have a similar problem, here is my fieldType, 
http://paste.pocoo.org/show/289910/
I want to include stopword removal and lowercase the incoming terms. The idea 
being to take, "Foo Bar Baz Ltd" and turn it into "foobarbaz" for the EdgeNgram 
filter factory.
If anyone can tell me a simple way to concatenate tokens into one token again, 
similar too the KeyWordTokenizer that would be super helpful.


Many thanks

Nick

On 11 Nov 2010, at 00:23, Robert Gründler wrote:

> 
> On Nov 11, 2010, at 1:12 AM, Jonathan Rochkind wrote:
> 
>> Are you sure you really want to throw out stopwords for your use case?  I 
>> don't think autocompletion will work how you want if you do. 
> 
> in our case i think it makes sense. the content is targetting the electronic 
> music / dj scene, so we have a lot of words like "DJ" or "featuring" which
> make sense to throw out of the query. Also searches for "the beastie boys" 
> and "beastie boys" should return a match in the autocompletion.
> 
>> 
>> And if you don't... then why use the WhitespaceTokenizer and then try to jam 
>> the tokens back together? Why not just NOT tokenize in the first place. Use 
>> the KeywordTokenizer, which really should be called the 
>> NonTokenizingTokenizer, becaues it doesn't tokenize at all, it just creates 
>> one token from the entire input string. 
> 
> I started out with the KeywordTokenizer, which worked well, except the 
> StopWord problem.
> 
> For now, i've come up with a quick-and-dirty custom "ConcatFilter", which 
> does what i'm after:
> 
> public class ConcatFilter extends TokenFilter {
> 
>       private TokenStream tstream;
> 
>       protected ConcatFilter(TokenStream input) {
>               super(input);
>               this.tstream = input;
>       }
> 
>       @Override
>       public Token next() throws IOException {
>               
>               Token token = new Token();
>               StringBuilder builder = new StringBuilder();
>               
>               TermAttribute termAttribute = (TermAttribute) 
> tstream.getAttribute(TermAttribute.class);
>               TypeAttribute typeAttribute = (TypeAttribute) 
> tstream.getAttribute(TypeAttribute.class);
>               
>               boolean incremented = false;
>               
>               while (tstream.incrementToken()) {
>                       
>                       if (typeAttribute.type().equals("word")) {
>                               builder.append(termAttribute.term());           
>                 
>                       }
>                       incremented = true;
>               }
>               
>               token.setTermBuffer(builder.toString());
>               
>               if (incremented == true)
>                       return token;
>               
>               return null;
>       }
> }
> 
> I'm not sure if this is a safe way to do this, as i'm not familar with the 
> whole solr/lucene implementation after all.
> 
> 
> best
> 
> 
> -robert
> 
> 
> 
> 
>> 
>> Then lowercase, remove whitespace (or not), do whatever else you want to do 
>> to your single token to normalize it, and then edgengram it. 
>> 
>> If you include whitespace in the token, then when making your queries for 
>> auto-complete, be sure to use a query parser that doesn't do 
>> "pre-tokenization", the 'field' query parser should work well for this. 
>> 
>> Jonathan
>> 
>> 
>> 
>> ________________________________________
>> From: Robert Gründler [rob...@dubture.com]
>> Sent: Wednesday, November 10, 2010 6:39 PM
>> To: solr-user@lucene.apache.org
>> Subject: Concatenate multiple tokens into one
>> 
>> Hi,
>> 
>> i've created the following filterchain in a field type, the idea is to use 
>> it for autocompletion purposes:
>> 
>> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <!-- create tokens 
>> separated by whitespace -->
>> <filter class="solr.LowerCaseFilterFactory"/> <!-- lowercase everything -->
>> <filter class="solr.StopFilterFactory" ignoreCase="true" 
>> words="stopwords.txt" enablePositionIncrements="true" />  <!-- throw out 
>> stopwords -->
>> <filter class="solr.PatternReplaceFilterFactory" pattern="([^a-z])" 
>> replacement="" replace="all" />  <!-- throw out all everything except a-z -->
>> 
>> <!-- actually, here i would like to join multiple tokens together again, to 
>> provide one token for the EdgeNGramFilterFactory -->
>> 
>> <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="25" 
>> /> <!-- create edgeNGram tokens for autocomplete matches -->
>> 
>> With that kind of filterchain, the EdgeNGramFilterFactory will receive 
>> multiple tokens on input strings with whitespaces in it. This leads to the 
>> following results:
>> Input Query: "George Cloo"
>> Matches:
>> - "George Harrison"
>> - "John Clooridge"
>> - "George Smith"
>> -"George Clooney"
>> - etc
>> 
>> However, only "George Clooney" should match in the autocompletion use case.
>> Therefore, i'd like to add a filter before the EdgeNGramFilterFactory, which 
>> concatenates all the tokens generated by the WhitespaceTokenizerFactory.
>> Are there filters which can do such a thing?
>> 
>> If not, are there examples how to implement a custom TokenFilter?
>> 
>> thanks!
>> 
>> -robert
>> 
>> 
>> 
>> 
>

Re: Concatenate multiple tokens into one

Reply via email to