Re: Concatenate multiple tokens into one

Robert Gründler Thu, 11 Nov 2010 10:14:04 -0800

I've posted a ConcaFilter in my previous mail which does concatenate tokens. 
This works fine, but i
realized that what i wanted to achieve is implemented easier in another way (by 
using 2 separate field types).


Have a look at a previous mail i wrote to the list and the reply from Ahmet 
Arslan (topic: "EdgeNGram relevancy).


best


-robert




See 
On Nov 11, 2010, at 5:27 PM, Nick Martin wrote:

> Hi Robert, All,
> 
> I have a similar problem, here is my fieldType, 
> http://paste.pocoo.org/show/289910/
> I want to include stopword removal and lowercase the incoming terms. The idea 
> being to take, "Foo Bar Baz Ltd" and turn it into "foobarbaz" for the 
> EdgeNgram filter factory.
> If anyone can tell me a simple way to concatenate tokens into one token 
> again, similar too the KeyWordTokenizer that would be super helpful.
> 
> Many thanks
> 
> Nick
> 
> On 11 Nov 2010, at 00:23, Robert Gründler wrote:
> 
>> 
>> On Nov 11, 2010, at 1:12 AM, Jonathan Rochkind wrote:
>> 
>>> Are you sure you really want to throw out stopwords for your use case?  I 
>>> don't think autocompletion will work how you want if you do. 
>> 
>> in our case i think it makes sense. the content is targetting the electronic 
>> music / dj scene, so we have a lot of words like "DJ" or "featuring" which
>> make sense to throw out of the query. Also searches for "the beastie boys" 
>> and "beastie boys" should return a match in the autocompletion.
>> 
>>> 
>>> And if you don't... then why use the WhitespaceTokenizer and then try to 
>>> jam the tokens back together? Why not just NOT tokenize in the first place. 
>>> Use the KeywordTokenizer, which really should be called the 
>>> NonTokenizingTokenizer, becaues it doesn't tokenize at all, it just creates 
>>> one token from the entire input string. 
>> 
>> I started out with the KeywordTokenizer, which worked well, except the 
>> StopWord problem.
>> 
>> For now, i've come up with a quick-and-dirty custom "ConcatFilter", which 
>> does what i'm after:
>> 
>> public class ConcatFilter extends TokenFilter {
>> 
>>      private TokenStream tstream;
>> 
>>      protected ConcatFilter(TokenStream input) {
>>              super(input);
>>              this.tstream = input;
>>      }
>> 
>>      @Override
>>      public Token next() throws IOException {
>>              
>>              Token token = new Token();
>>              StringBuilder builder = new StringBuilder();
>>              
>>              TermAttribute termAttribute = (TermAttribute) 
>> tstream.getAttribute(TermAttribute.class);
>>              TypeAttribute typeAttribute = (TypeAttribute) 
>> tstream.getAttribute(TypeAttribute.class);
>>              
>>              boolean incremented = false;
>>              
>>              while (tstream.incrementToken()) {
>>                      
>>                      if (typeAttribute.type().equals("word")) {
>>                              builder.append(termAttribute.term());           
>>                 
>>                      }
>>                      incremented = true;
>>              }
>>              
>>              token.setTermBuffer(builder.toString());
>>              
>>              if (incremented == true)
>>                      return token;
>>              
>>              return null;
>>      }
>> }
>> 
>> I'm not sure if this is a safe way to do this, as i'm not familar with the 
>> whole solr/lucene implementation after all.
>> 
>> 
>> best
>> 
>> 
>> -robert
>> 
>> 
>> 
>> 
>>> 
>>> Then lowercase, remove whitespace (or not), do whatever else you want to do 
>>> to your single token to normalize it, and then edgengram it. 
>>> 
>>> If you include whitespace in the token, then when making your queries for 
>>> auto-complete, be sure to use a query parser that doesn't do 
>>> "pre-tokenization", the 'field' query parser should work well for this. 
>>> 
>>> Jonathan
>>> 
>>> 
>>> 
>>> ________________________________________
>>> From: Robert Gründler [rob...@dubture.com]
>>> Sent: Wednesday, November 10, 2010 6:39 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Concatenate multiple tokens into one
>>> 
>>> Hi,
>>> 
>>> i've created the following filterchain in a field type, the idea is to use 
>>> it for autocompletion purposes:
>>> 
>>> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <!-- create tokens 
>>> separated by whitespace -->
>>> <filter class="solr.LowerCaseFilterFactory"/> <!-- lowercase everything -->
>>> <filter class="solr.StopFilterFactory" ignoreCase="true" 
>>> words="stopwords.txt" enablePositionIncrements="true" />  <!-- throw out 
>>> stopwords -->
>>> <filter class="solr.PatternReplaceFilterFactory" pattern="([^a-z])" 
>>> replacement="" replace="all" />  <!-- throw out all everything except a-z 
>>> -->
>>> 
>>> <!-- actually, here i would like to join multiple tokens together again, to 
>>> provide one token for the EdgeNGramFilterFactory -->
>>> 
>>> <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" 
>>> maxGramSize="25" /> <!-- create edgeNGram tokens for autocomplete matches 
>>> -->
>>> 
>>> With that kind of filterchain, the EdgeNGramFilterFactory will receive 
>>> multiple tokens on input strings with whitespaces in it. This leads to the 
>>> following results:
>>> Input Query: "George Cloo"
>>> Matches:
>>> - "George Harrison"
>>> - "John Clooridge"
>>> - "George Smith"
>>> -"George Clooney"
>>> - etc
>>> 
>>> However, only "George Clooney" should match in the autocompletion use case.
>>> Therefore, i'd like to add a filter before the EdgeNGramFilterFactory, 
>>> which concatenates all the tokens generated by the 
>>> WhitespaceTokenizerFactory.
>>> Are there filters which can do such a thing?
>>> 
>>> If not, are there examples how to implement a custom TokenFilter?
>>> 
>>> thanks!
>>> 
>>> -robert
>>> 
>>> 
>>> 
>>> 
>> 
>

Re: Concatenate multiple tokens into one

Reply via email to