Re: Integrating external stemmer in Solr and pre-processing text

Jaco Tue, 30 Sep 2008 12:55:12 -0700

Hi,

The suggested approach with a TokenFilter extending the BufferedTokenStream
class works fine, performance is OK - the external stemmer is now invoked
only once for the complete search text. Also, from a functional point of
view, the approach is useful, because it allows for other filtering (i.e
WordDelimiterFilter with the various useful options) to be done before
stemming takes place.


Code is roughly like this for the process() function of the custom Filter
class:

protected Token process (Token token) {
        StringBuilder        stringBuilder = new StringBuilder();
        Token                nextToken;
        Integer                tokenPos = 0;
        Map<Integer, Token>    tokenMap = new LinkedHashMap<Integer,
Token>();

        stringBuilder.append(token.term()).append(' ');
        tokenMap.put(tokenPos++, token);
        nextToken    = read();

        while (nextToken != null)
        {
            stringBuilder.append(nextToken.term()).append(' ');
            tokenMap.put(tokenPos++, nextToken);

            nextToken    = read();
        }

        String    inputText         = stringBuilder.toString();
        String    stemmedText   = stemText(inputText);
        String[] stemmedWords    = stemmedText.split("\\s");

        for (Map.Entry<Integer, Token> entry : tokenMap.entrySet())
        {
            Integer    pos    = entry.getKey();
            Token    tok = entry.getValue();

            tok.setTermBuffer(stemmedWords[pos]);
            write(tok);
        }

        return null;
    }
}

This will need some work and additional error checking, and I'll probably
put a maximum om the number of tokens that is to be processed in one go to
make sure things don't get too big in memory.

Thanks for helping out!

Bye,

Jaco.



2008/9/26 Jaco <[EMAIL PROTECTED]>

> Thanks for these suggestions, will try it in the coming days and post my
> findings in this thread.
>
> Bye,
>
>
> Jaco.
>
> 2008/9/26 Grant Ingersoll <[EMAIL PROTECTED]>
>
>>
>> On Sep 26, 2008, at 12:05 PM, Jaco wrote:
>>
>>  Hi Grant,
>>>
>>> In reply to your questions:
>>>
>>> 1. Are you having to restart/initialize the stemmer every time for your
>>> "slow" approach?  Does that really need to happen?
>>>
>>> It is invoking a COM object in Windows. The object is instantiated once
>>> for
>>> a token stream, and then invoked once for each token. The invoke always
>>> has
>>> an overhead, not much to do about that (sigh...)
>>>
>>> 2. Can the stemmer return something other than a String?  Say a String
>>> array
>>> of all the stemmed words?  Or maybe even some type of object that tells
>>> you
>>> the original word and the stemmed word?
>>>
>>> The stemmer can only return a String. But, I do know that the returned
>>> string always has exactly the same number of words as the input string.
>>> So
>>> logically, it would be possible to :
>>> a) first calculate the position/start/end of each token in the input
>>> string
>>> (usual tokenization by Whitespace), resulting in token list 1
>>> b) then invoke the stemmer, and tokenize that result by Whitespace,
>>> resulting in token list 2
>>> c) 'merge' the token values of token list 2 into token list 1, which is
>>> possible because each token's position is the same in both lists...
>>> d) return that 'merged' token list 2 for further processing
>>>
>>> Would this work in Solr?
>>>
>>
>> I think so, assuming your stemmer tokenizes on whitespace as well.
>>
>>
>>>
>>> I can do some Java coding to achieve that from logical point of view, but
>>> I
>>> wouldn't know how to structure this flow into the MyTokenizerFactory, so
>>> some hints to achieve that would be great!
>>>
>>
>>
>> One thought:
>> Don't create an all in one Tokenizer.  Instead, keep the Whitespace
>> Tokenizer as is.  Then, create a TokenFilter that buffers the whole document
>> into a memory (via the next() implementation) and also creates, using
>> StringBuilder, a string containing the whole text.  Once you've read it all
>> in, then send the string to your stemmer, parse it back out and associate it
>> back to your token buffer.  If you are guaranteed position, you could even
>> keep a (linked) hash, such that it is really quick to look up tokens after
>> stemming.
>>
>> Pseudocode looks something like:
>>
>> while (token.next != null)
>>   tokenMap.put(token.position, token)
>>   stringBuilder.append(' ').append(token.text)
>>
>> stemmedText = comObj.stem(stringBuilder.toString())
>> correlateStemmedText(stemmedText, tokenMap)
>>
>> spit out the tokens one by one...
>>
>>
>> I think this approach should be fast (but maybe not as fast as your all in
>> one tokenizer) and will provide the correct position and offsets.  You do
>> have to be careful w/ really big documents, as that map can be big.  You
>> also want to be careful about map reuse, token reuse, etc.
>>
>> I believe there are a couple of buffering TokenFilters in Solr that you
>> could examine for inspiration.  I think the RemoveDuplicatesTokenFilter (or
>> whatever it's called) does buffering.
>>
>> -Grant
>>
>>
>>
>>
>>
>>>
>>> Thanks for helping out!
>>>
>>> Jaco.
>>>
>>>
>>> 2008/9/26 Grant Ingersoll <[EMAIL PROTECTED]>
>>>
>>>
>>>> On Sep 26, 2008, at 9:40 AM, Jaco wrote:
>>>>
>>>> Hi,
>>>>
>>>>>
>>>>> Here's some of the code of my Tokenizer:
>>>>>
>>>>> public class MyTokenizerFactory extends BaseTokenizerFactory
>>>>> {
>>>>>  public WhitespaceTokenizer create(Reader input)
>>>>>  {
>>>>>     String text, normalizedText;
>>>>>
>>>>>     try {
>>>>>         text  = IOUtils.toString(input);
>>>>>         normalizedText    = *invoke my stemmer(text)*;
>>>>>
>>>>>     }
>>>>>     catch( IOException ex ) {
>>>>>         throw new SolrException( SolrException.ErrorCode.SERVER_ERROR,
>>>>> ex );
>>>>>     }
>>>>>
>>>>>     StringReader    stringReader = new StringReader(normalizedText);
>>>>>
>>>>>     return new WhitespaceTokenizer(stringReader);
>>>>>  }
>>>>> }
>>>>>
>>>>> I see what's going in the analysis tool now, and I think I understand
>>>>> the
>>>>> problem. For instance, the text: abcdxxx defgxxx. Let's assume the
>>>>> stemmer
>>>>> gets rid of xxx.
>>>>>
>>>>> I would then see this in the analysis tool after the tokenizer stage:
>>>>> - abcd - term position 1; start: 1; end:  3
>>>>> - defg - term position 2; start: 4; end: 7
>>>>>
>>>>> These positions are not in line with the initial search text - this
>>>>> must
>>>>> be
>>>>> why the highlighting goes wrong. I guess my little trick to do this was
>>>>> a
>>>>> bit too simple because it messes up the positions basically because
>>>>> something different from the original source text is tokenized.
>>>>>
>>>>>
>>>> Yes, this is exactly the problem.  I don't know enough about com4J or
>>>> your
>>>> stemmer, but some things come to mind:
>>>>
>>>> 1. Are you having to restart/initialize the stemmer every time for your
>>>> "slow" approach?  Does that really need to happen?
>>>> 2. Can the stemmer return something other than a String?  Say a String
>>>> array of all the stemmed words?  Or maybe even some type of object that
>>>> tells you the original word and the stemmed word?
>>>>
>>>> -Grant
>>>>
>>>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com
>>
>> Lucene Helpful Hints:
>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>
>>
>>
>>
>>
>>
>>
>>
>

Re: Integrating external stemmer in Solr and pre-processing text

Reply via email to