Re: Integrating external stemmer in Solr and pre-processing text

Grant Ingersoll Fri, 26 Sep 2008 13:16:53 -0700


On Sep 26, 2008, at 12:05 PM, Jaco wrote:

Hi Grant,

In reply to your questions:
1. Are you having to restart/initialize the stemmer every time foryour
"slow" approach?  Does that really need to happen?
It is invoking a COM object in Windows. The object is instantiatedonce fora token stream, and then invoked once for each token. The invokealways has
an overhead, not much to do about that (sigh...)
2. Can the stemmer return something other than a String? Say aString arrayof all the stemmed words? Or maybe even some type of object thattells you
the original word and the stemmed word?

The stemmer can only return a String. But, I do know that the returned
string always has exactly the same number of words as the inputstring. So
logically, it would be possible to :
a) first calculate the position/start/end of each token in the inputstring
(usual tokenization by Whitespace), resulting in token list 1
b) then invoke the stemmer, and tokenize that result by Whitespace,
resulting in token list 2
c) 'merge' the token values of token list 2 into token list 1, whichis
possible because each token's position is the same in both lists...
d) return that 'merged' token list 2 for further processing

Would this work in Solr?


I think so, assuming your stemmer tokenizes on whitespace as well.

I can do some Java coding to achieve that from logical point ofview, but Iwouldn't know how to structure this flow into theMyTokenizerFactory, so
some hints to achieve that would be great!



One thought:

Don't create an all in one Tokenizer. Instead, keep the WhitespaceTokenizer as is. Then, create a TokenFilter that buffers the wholedocument into a memory (via the next() implementation) and alsocreates, using StringBuilder, a string containing the whole text.Once you've read it all in, then send the string to your stemmer,parse it back out and associate it back to your token buffer. If youare guaranteed position, you could even keep a (linked) hash, suchthat it is really quick to look up tokens after stemming.


Pseudocode looks something like:

while (token.next != null)
   tokenMap.put(token.position, token)
   stringBuilder.append(' ').append(token.text)

stemmedText = comObj.stem(stringBuilder.toString())
correlateStemmedText(stemmedText, tokenMap)

spit out the tokens one by one...

I think this approach should be fast (but maybe not as fast as yourall in one tokenizer) and will provide the correct position andoffsets. You do have to be careful w/ really big documents, as thatmap can be big. You also want to be careful about map reuse, tokenreuse, etc.

I believe there are a couple of buffering TokenFilters in Solr thatyou could examine for inspiration. I think theRemoveDuplicatesTokenFilter (or whatever it's called) does buffering.


-Grant

Thanks for helping out!

Jaco.


2008/9/26 Grant Ingersoll <[EMAIL PROTECTED]>
On Sep 26, 2008, at 9:40 AM, Jaco wrote:

Hi,
Here's some of the code of my Tokenizer:

public class MyTokenizerFactory extends BaseTokenizerFactory
{
 public WhitespaceTokenizer create(Reader input)
 {
     String text, normalizedText;

     try {
         text  = IOUtils.toString(input);
         normalizedText    = *invoke my stemmer(text)*;

     }
     catch( IOException ex ) {
throw newSolrException( SolrException.ErrorCode.SERVER_ERROR,
ex );
     }
StringReader stringReader = newStringReader(normalizedText);
     return new WhitespaceTokenizer(stringReader);
 }
}
I see what's going in the analysis tool now, and I think Iunderstand theproblem. For instance, the text: abcdxxx defgxxx. Let's assume thestemmer
gets rid of xxx.
I would then see this in the analysis tool after the tokenizerstage:
- abcd - term position 1; start: 1; end:  3
- defg - term position 2; start: 4; end: 7
These positions are not in line with the initial search text -this must
be
why the highlighting goes wrong. I guess my little trick to dothis was a
bit too simple because it messes up the positions basically because
something different from the original source text is tokenized.
Yes, this is exactly the problem. I don't know enough about com4Jor your
stemmer, but some things come to mind:
1. Are you having to restart/initialize the stemmer every time foryour
"slow" approach?  Does that really need to happen?
2. Can the stemmer return something other than a String? Say aStringarray of all the stemmed words? Or maybe even some type of objectthat
tells you the original word and the stemmed word?

-Grant


--------------------------
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

Re: Integrating external stemmer in Solr and pre-processing text

Reply via email to